抓取 https://www.thenewboston.com/ 时出现“SSL：certificate_verify

摘要：问题描述：因此，我最近开始使用 YouTube 上的“The New Boston”视频学习 Python，一切都很顺利，直到我看到他制作简单网络爬虫的教程。虽然我毫无问题地理解了它，但当我运行代码时，我得到的错误似乎都是基于“SSL：CERTIFICATE_VERIFY_FAILED”。从昨晚开始，我就一直...

问题描述：

因此，我最近开始使用 YouTube 上的“The New Boston”视频学习 Python，一切都很顺利，直到我看到他制作简单网络爬虫的教程。虽然我毫无问题地理解了它，但当我运行代码时，我得到的错误似乎都是基于“SSL：CERTIFICATE_VERIFY_FAILED”。从昨晚开始，我就一直在寻找答案，试图找出如何修复它，似乎视频评论中或他的网站上的其他人都没有遇到与我相同的问题，即使使用他网站上的其他人的代码，我也得到了相同的结果。我将发布我从网站上获得的代码，因为它给了我同样的错误，而我编写的代码现在一团糟。

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
        source_code = requests.get(url)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        for link in soup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
            href = "https://www.thenewboston.com/" + link.get('href')
            title = link.string # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)
    page += 1
trade_spider(1)

完整错误如下：ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

如果这是一个愚蠢的问题，我很抱歉，我对编程还不熟悉，但我真的无法弄清楚这一点，我本来想跳过这个教程，但无法解决这个问题让我很困扰，谢谢！

解决方案 1：

问题不在于您的代码，而在于您尝试访问的网站。查看SSLLabs 的分析时，您会注意到：

此服务器的证书链不完整。等级上限为 B。

这意味着服务器配置错误，不仅 Python，其他几种浏览器在访问此网站时也会遇到问题。一些桌面浏览器会尝试从互联网加载缺失的证书或使用缓存的证书进行填充，从而解决此配置问题。但其他浏览器或应用程序也会失败，与 Python 类似。

要解决损坏的服务器配置问题，您可以明确提取缺失的证书并将其添加到信任存储中。或者，您可以在验证参数中将证书作为信任。摘自文档：

您可以使用受信任 CA 的证书来验证 CA_BUNDLE 文件或目录的路径：
>>> requests.get('https://github.com', verify='/path/to/certfile') 
此受信任 CA 列表也可以通过 REQUESTS_CA_BUNDLE 环境变量指定。

解决方案 2：

您可以告诉请求不要验证 SSL 证书：

>>> url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1"
>>> response = requests.get(url, verify=False)
>>> response.status_code
200

更多内容请见requests文档

解决方案 3：

您的系统中可能缺少库存证书。例如，如果在 Ubuntu 上运行，请检查ca-certificates是否安装了软件包。

解决方案 4：

如果您想使用 Python dmg 安装程序，您还必须阅读 Python 3 的 ReadMe 并运行 bash 命令来获取新证书。

尝试运行

/Applications/Python 3.6/Install Certificates.command

解决方案 5：

值得对这里发生的事情进行更多的“实际”阐述，并补充@Steffen Ullrich 在这里和其他地方的回答：

urllib 和“SSL: CERTIFICATE_VERIFY_FAILED”错误
Python Urllib2 SSL 错误（非常详细的解答）

笔记：

我会使用 OP 以外的其他网站，因为 OP 的网站目前没有问题。
我使用 Ubunto 运行以下命令（curl和openssl）。我尝试curl在 Windows 10 上运行，但得到了不同的、无用的输出。

可以使用以下curl命令“重现”OP遇到的错误：

curl -vvI https://www.vimmi.net

输出内容（请注意最后一行）：

* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS alert, Server hello (2):
* SSL certificate problem: unable to get local issuer certificate
* stopped the pause stream!
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate

现在让我们使用该--insecure标志运行它，它将显示有问题的证书：

curl --insecure -vvI https://www.vimmi.net

输出（注意最后两行）：

* Rebuilt URL to: https://www.vimmi.net/
*   Trying 82.80.192.7...
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* [...]
* Server certificate:
*  subject: OU=Domain Control Validated; CN=vimmi.net
*  start date: Aug  5 15:43:45 2019 GMT
*  expire date: Oct  4 16:16:12 2020 GMT
*  issuer: C=US; ST=Arizona; L=Scottsdale; O=GoDaddy.com, Inc.; OU=http://certs.godaddy.com/repository/; CN=Go Daddy Secure Certificate Authority - G2
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.

使用可以看到相同的结果openssl，值得一提的是，因为它由 python 内部使用：

echo | openssl s_client -connect vimmi.net:443

输出：

CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:OU = Domain Control Validated, CN = vimmi.net
   i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
[...]
---
DONE

那么为什么curl无法openssl验证 Go Daddy 为该网站颁发的证书？

好吧，“验证证书”（使用 openssl 的错误消息术语）意味着验证证书是否包含可信源签名（换句话说：证书由可信来源签名），从而验证vimmi.net身份（这里的“身份”严格意味着“证书中包含的公钥属于证书中注明的个人、组织、服务器或其他实体”）。

如果我们可以建立具有以下属性的“信任链”，则该来源是“可信的” ：

每个证书的颁发者（最后一个证书除外）都与列表中下一个证书的主题匹配
每个证书（最后一个证书除外）都由链中下一个证书对应的密钥进行签名（即，可以使用以下证书中包含的公钥来验证一个证书的签名）
列表中的最后一个证书是信任锚：您信任的证书，因为它是通过某个值得信赖的程序交付给您的

在我们的例子中，颁发者是“Go Daddy 安全证书颁发机构 - G2”。也就是说，名为“Go Daddy 安全证书颁发机构 - G2”的实体签署了证书，因此它应该是受信任的来源。

为了确定该实体的可信度，我们有两个选择：

假设“Go Daddy 安全证书颁发机构 - G2”是“信任锚”（参见上面的清单 3）。事实证明，curl并openssl尝试根据这一假设采取行动：他们在默认路径（称为 CA 路径）上搜索了该实体的证书，这些路径是：

* 对于`curl`，它是`/etc/ssl/certs`。
* 对于`openssl`，它是`/use/lib/ssl`（跑去`openssl version -a`看）。

但是没有找到该证书，所以我们只有第二种选择：

按照上面列出的步骤 1 和 2 进行操作；为此，我们需要获取为该实体颁发的证书。这可以通过从其来源下载或使用浏览器来实现。
- 例如，vimmi.net使用 Chrome，单击挂锁 > “证书” > “证书路径”选项卡，选择实体 > “查看证书”，然后在打开的窗口中转到“详细信息”选项卡 > “复制到文件” > Base-64 编码 > 保存文件）

太棒了！现在我们有了该证书（可以是任何文件格式：cer，，pem等等；您甚至可以将其保存为txt文件），让我们来curl使用它：

curl --cacert test.cer https://vimmi.net

回归 Python

一旦我们有：

“Go Daddy 安全证书颁发机构 - G2”证书
“Go Daddy Root Certification Authority - G2”证书（上面没有提到，但可以通过类似的方式获得）。

我们需要将它们的内容复制到一个文件中，我们将其命名为combined.cer，并将其放在当前目录中。然后，只需：

import requests

res = requests.get("https://vimmi.net", verify="./combined.cer")
print (res.status_code) # 200

顺便说一句，“Go Daddy根证书颁发机构 - G2”被浏览器和各种工具列为受信任的颁发机构；这就是我们不必为其指定的原因curl。

进一步阅读：

如何验证 SSL 证书，尤其是@ychaouche 图像。
HTTPS 连接的最初几毫秒
维基百科：公钥证书、证书颁发机构
精彩视频：证书链验证基础知识。
专注于证书签名术语的有用的 SE答案：1、2、3。
与中间人攻击相关的证书：1、2。
世界上最危险的代码：在非浏览器软件中验证 SSL 证书

解决方案 6：

我将此作为答案发布是因为到目前为止我已经解决了您的问题，但是您的代码中仍然存在问题（修复后我可以更新）。

长话短说：您可能使用的是旧版本的请求，或者 SSL 证书无效。此 SO 问题中有更多信息：Python 请求“证书验证失败”

我已将代码更新到我自己的bsoup.py文件中：

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
        source_code = requests.get(url, timeout=5, verify=False)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
            href = "https://www.thenewboston.com/" + link.get('href')
            title = link.string # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)

        page += 1

if __name__ == "__main__":
    trade_spider(1)

当我运行脚本时，它给出了以下错误：

https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1
Traceback (most recent call last):
  File "./bsoup.py", line 26, in <module>
    trade_spider(1)
  File "./bsoup.py", line 16, in trade_spider
    for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
  File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1256, in find_all
    generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'

你的方法存在问题findAll。我使用了 python3 和 python2，其中 python2 报告了以下问题：

TypeError: unbound method find_all() must be called with BeautifulSoup instance as first argument (got str instance instead)

因此，看起来你需要修复该方法才能继续

解决方案 7：

我花了几个小时尝试修复一些 Python 并更新虚拟机上的证书。就我而言，我是在其他人设置的服务器上工作的。结果发现错误的证书已上传到服务器。我在另一个 SO 答案中找到了此命令。

root@ubuntu:~/cloud-tools# openssl s_client -connect abc.def.com:443
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = abc.def.com
   i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2