如何查找仅具有特定属性的标签

如何查找仅具有特定属性的标签 - BeautifulSoup

2025-04-16 08:57:00

admin

原创

摘要：问题描述：如何使用 BeautifulSoup 搜索仅包含我要搜索的属性的标签？例如我想找到所有<td valign="top">标签。以下代码：raw_card_data = soup.fetch('td', {'valign':re.compile('top')})获取我想...

问题描述：

如何使用 BeautifulSoup 搜索仅包含我要搜索的属性的标签？

例如我想找到所有<td valign="top">标签。

以下代码：
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})

获取我想要的所有数据，同时获取任何<td>具有该属性的标签valign:top

我也尝试过：
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
但这没有返回任何内容（可能是因为正则表达式不好）

我想知道 BeautifulSoup 中是否有办法说“查找<td>唯一属性为valign:top”的标签

更新
例如，如果 HTML 文档包含以下<td>标签：

<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />

我只希望返回第一个<td>标签（<td width="580" valign="top">

解决方案 1：

正如BeautifulSoup 文档中解释的那样

你可以使用这个：

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

编辑：

要返回仅具有 valign="top" 属性的标签，您可以检查标签属性的长度attrs：

from BeautifulSoup import BeautifulSoup

html = '<td valign="top">.....</td>\n        <td width="580" valign="top">.......</td>\n        <td>.....</td>'

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

for result in results :
    if len(result.attrs) == 1 :
        print result

<td valign="top">.....</td>

解决方案 2：

您可以按照文档中的说明使用lambda函数。因此，在您的情况下，仅使用以下命令搜索标签：findAll`td`valign = "top"

td_tag_list = soup.findAll(
                lambda tag:tag.name == "td" and
                len(tag.attrs) == 1 and
                tag["valign"] == "top")

解决方案 3：

如果您只想搜索具有任意值的属性名称

from bs4 import BeautifulSoup
import re

soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})

根据 Steve Lorimer 的说法，最好传递 True 而不是正则表达式

results = soup.findAll("td", {"valign" : True})

解决方案 4：

最简单的方法是使用新的 CSS 样式select方法：

soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')

解决方案 5：

使用任意标签中的属性进行查找

<th class="team" data-sort="team">Team</th>    
soup.find_all(attrs={"class": "team"}) 

<th data-sort="team">Team</th>  
soup.find_all(attrs={"data-sort": "team"})

解决方案 6：

如果您希望提取存在特定属性的所有标签，则可以使用与接受的答案相同的代码，但不要为标签指定值，而只需输入 True。

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})

这将返回所有具有 valign 属性的 td 标签。如果您的项目需要从像 div 这样广泛使用的标签中提取信息，并且能够处理您可能正在寻找的非常具体的属性，那么这将非常有用。

解决方案 7：

只需将其作为参数传递即可findAll：

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]

解决方案 8：

添加 Chris Redford 和 Amr 的答案的组合，您还可以使用 select 命令搜索具有任何值的属性名称：

from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\n    <td width="580" valign="top">.......</td>\n    <td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')

解决方案 9：

如果要在不同的行中打印具有特定属性的所有标签的名称，例如打印所有具有id属性的标签（无论其值如何）：

from bs4 import BeautifulSoup ;
from bs4 import element ;
html = '!DOCTYPE html><html><head><title>Navigate Parse Tree</title></head>\n<body><h1>This is your Assignment</h1><a href = "https://www.google.com">This is a link that will take you to Google</a>\n<ul><li><p> This question is given to test your knowledge of <b>Web Scraping</b></p>\n<p>Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web.</p></li>\n<li id = "li2">This is an li tag given to you for scraping</li>\n<li>This li tag gives you the various ways to get data from a website\n<ol><li class = "list_or">Using API of the website</li><li>Scrape data using BeautifulSoup</li><li>Scrape data using Selenium</li>\n<li>Scrape data using Scrapy</li></ol></li>\n<li class = "list_or"><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">\nClicking on this takes you to the documentation of BeautifulSoup</a>\n<a href="https://selenium-python.readthedocs.io/" id="anchor">Clicking on this takes you to the documentation of Selenium</a>\n</li></ul></body></html>'

data = BeautifulSoup(html, 'html.parser');
for i in data.descendants :
     if type(i) == element.Tag:
        if i.attrs != {} and 'id' in i.attrs:
           print(i.name)