从 ElementTree findall 返回空列表
- 2025-03-21 09:07:00
- admin 原创
- 59
问题描述:
我对 xml 解析和 Python 还不熟悉,所以请多多包涵。我正在使用 lxml 解析 wiki 转储,但我只想要每个页面的标题和文本。
现在我得到了这个:
from xml.etree import ElementTree as etree
def parser(file_name):
document = etree.parse(file_name)
titles = document.findall('.//title')
print titles
目前,titles 没有返回任何内容。我查看过以前的答案,例如:ElementTree findall() 返回空列表和 lxml 文档,但大多数内容似乎都是针对解析 HTML 量身定制的。
这是我的 XML 的一部分:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf9</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="108" case="first-letter">Book</namespace>
<namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>Aratrum</title>
<ns>0</ns>
<id>65741</id>
<revision>
<id>349931990</id>
<parentid>225434394</parentid>
<timestamp>2010-03-15T02:55:02Z</timestamp>
<contributor>
<ip>143.105.193.119</ip>
</contributor>
<comment>/* Sources */</comment>
<sha1>2zkdnl9nsd1fbopv0fpwu2j5gdf0haw</sha1>
<text xml:space="preserve" bytes="1436">'''Aratrum''' is the Latin word for [[plough]], and "arotron" (αροτρον) is the [[Greek language|Greek]] word. The [[Ancient Greece|Greeks]] appear to have had diverse kinds of plough from the earliest historical records. [[Hesiod]] advised the farmer to have always two ploughs, so that if one broke the other might be ready for use. These ploughs should be of two kinds, the one called "autoguos" (αυτογυος, "self-limbed"), in which the plough-tail was of the same piece of timber as the share-beam and the pole; and the other called "pekton" (πηκτον, "fixed"), because in it, three parts, which were of three kinds of timber, were adjusted to one another, and fastened together by nails.
The ''autoguos'' plough was made from a [[sapling]] with two branches growing from its trunk in opposite directions. In ploughing, the trunk served as the pole, one of the two branches stood upwards and became the tail, and the other penetrated the ground and, sometimes shod with bronze or iron, acted as the [[ploughshare]].
==Sources==
Based on an article from ''A Dictionary of Greek and Roman Antiquities,'' John Murray, London, 1875.
ἄρατρον
==External links==
*[http://penelope.uchicago.edu/Thayer/E/Roman/Texts/secondary/SMIGRA*/Aratrum.html Smith's Dictionary article], with diagrams, further details, sources.
[[Category:Agricultural machinery]]
[[Category:Ancient Greece]]
[[Category:Animal equipment]]</text>
</revision>
</page>
我也尝试过 iterparse 然后打印它找到的元素的标签:
for e in etree.iterparse(file_name):
print e.tag
但它抱怨 e 没有标签属性。
编辑:
解决方案 1:
问题在于您没有考虑 XML 命名空间。XML 文档(及其中的所有元素)位于http://www.mediawiki.org/xml/export-0.7/
命名空间中。要使其正常工作,您需要更改
titles = document.findall('.//title')
到
titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
命名空间也可以通过参数提供namespaces
,它是一个prefix:URI
字典:
NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)
有关更多信息,请参阅ElementTree 文档中的使用命名空间解析 XML部分。
第三个选项(在 Python 3.8 中添加)是使用命名空间通配符:
titles = document.findall('.//{*}title')
的问题iterparse()
在于,此函数提供的是(event, element)
元组(而不仅仅是元素)。为了获取标签名称,请更改
for e in etree.iterparse(file_name):
print(e.tag)
更改为:
for ev, el in etree.iterparse(file_name):
print(el.tag)
解决方案 2:
首先,你需要找到父元素,page
我不知道它嵌套了多少层,但是一旦你找到它,你就可以立即获取title
标签:
>>> page_tag = ET.fromstring(xdata)
>>> title_tag = page_tag.find('title')
>>> title_tag.text
'Aratrum'
随着更多信息的涌入,您可以这样做:
def parser(file_name):
document = etree.parse(file_name)
titles = []
for page_tag in document.findall('page'):
titles.append(page_tag.find('title').text)
return titles
希望这有帮助!
相关推荐
热门文章
项目管理软件有哪些?
热门标签
曾咪二维码
扫码咨询,免费领取项目管理大礼包!
云禅道AD