使用 BeautifulSoup 提取不带标签的文本
- 2025-03-04 08:25:00
- admin 原创
- 68
问题描述:
我的网页如下所示:
<p>
<strong class="offender">YOB:</strong> 1987<br/>
<strong class="offender">RACE:</strong> WHITE<br/>
<strong class="offender">GENDER:</strong> FEMALE<br/>
<strong class="offender">HEIGHT:</strong> 5'05''<br/>
<strong class="offender">WEIGHT:</strong> 118<br/>
<strong class="offender">EYE COLOR:</strong> GREEN<br/>
<strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>
我想提取每个人的信息并获取YOB:1987
,RACE:WHITE
等等......
我尝试的是:
subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')
YOB:
但这只给了我、等等的值RACE:
......
YOB:1987
有什么方法可以让我获取 格式的数据RACE:WHITE
吗?
解决方案 1:
只需循环遍历所有<strong>
标签并使用它next_sibling
即可获得所需的内容。像这样:
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
演示:
from bs4 import BeautifulSoup
html = '''
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''
soup = BeautifulSoup(html)
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
这将为您提供:
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
解决方案 2:
我认为您可以使用来获得它subc1.text
。
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
或者如果你想探索它,你可以使用.contents
:
>>> p = soup.find('p')
>>> from pprint import pprint
>>> pprint(p.contents)
[u'
',
<strong class="offender">YOB:</strong>,
u' 1987',
<br/>,
u'
',
<strong class="offender">RACE:</strong>,
u' WHITE',
<br/>,
u'
',
<strong class="offender">GENDER:</strong>,
u' FEMALE',
<br/>,
u'
',
<strong class="offender">HEIGHT:</strong>,
u" 5'05''",
<br/>,
u'
',
<strong class="offender">WEIGHT:</strong>,
u' 118',
<br/>,
u'
',
<strong class="offender">EYE COLOR:</strong>,
u' GREEN',
<br/>,
u'
',
<strong class="offender">HAIR COLOR:</strong>,
u' BROWN',
<br/>,
u'
']
并从列表中筛选出必要的项目:
>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]]))
>>> pprint(data)
{u'EYE COLOR:': u'GREEN',
u'GENDER:': u'FEMALE',
u'HAIR COLOR:': u'BROWN',
u'HEIGHT:': u"5'05''",
u'RACE:': u'WHITE',
u'WEIGHT:': u'118',
u'YOB:': u'1987'}
解决方案 3:
您可以在 findall for 循环中尝试这个:
item_price = item.find('span', attrs={'class':'s-item__price'}).text
它仅提取文本并将其分配给“item_pice”
解决方案 4:
我认为你可以用.strip()
西班牙凉菜汤来解决这个问题:
输入:
html = """\n<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
代码:
soup = Soup(html)
text = soup.find("p").strip(whitespace=False) # to keep
characters intact
lines = [
line.strip()
for line in text.split("
")
if line != ""
]
data = dict([line.split(": ") for line in lines])
输出:
print(data)
# {'YOB': '1987',
# 'RACE': 'WHITE',
# 'GENDER': 'FEMALE',
# 'HEIGHT': "5'05''",
# 'WEIGHT': '118',
# 'EYE COLOR': 'GREEN',
# 'HAIR COLOR': 'BROWN'}
相关推荐
热门文章
项目管理软件有哪些?
热门标签
曾咪二维码
扫码咨询,免费领取项目管理大礼包!
云禅道AD