仅从此元素提取文本，不提取其子元素的文本-IT科技

仅从此元素提取文本，不提取其子元素的文本

2025-03-04 08:24:00

admin

原创

摘要：问题描述：我只想从汤的最顶层元素中提取文本；但是 soup.text 也提供了所有子元素的文本：我有import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>'...

问题描述：

我只想从汤的最顶层元素中提取文本；但是 soup.text 也提供了所有子元素的文本：

我有

import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text

其输出为yesno。我只想要“是”。

实现这一目标的最佳方法是什么？

编辑：我也想yes在解析“ ”时输出<html><b>no</b>yes</html>。

解决方案 1：

怎么样.find(text=True)？

>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'

编辑：

我认为我现在已经了解你想要什么了。试试这个：

>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'

解决方案 2：

您可以使用内容

>>> print soup.html.contents[0]
yes

或者获取 html 下的所有文本，使用findAll(text=True, recursive=False)

>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False) 
[u'x', u'yes']

以上连接起来形成单个字符串

>>> ''.join(soup.html.findAll(text=True, recursive=False)) 
u'xyes'

解决方案 3：

这在 bs4 中对我有用：

import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])

输出：

AC

解决方案 4：

在现代（截至 2023-06-17）BeautifulSoup4 中，给出：

from bs4 import BeautifulSoup
node = BeautifulSoup("""
<html>
    <div>
        <span>A</span>
        B
        <span>C</span>
        D
    </div>
</html>""").find('div')

使用以下内容获取直接子文本元素（BD）：

s = "".join(node.find_all(string=True, recursive=False))

以下内容获取所有后代文本元素（ABCD）：

s = "".join(node.find_all(string=True, recursive=True))

解决方案 5：

您可能需要研究一下 lxml 的soupparser模块，它支持 XPath：

>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']