如何使用 Python 读取 URL 的内容？-IT科技

如何使用 Python 读取 URL 的内容？

2025-03-10 08:52:00

admin

原创

摘要：问题描述：当我将其粘贴到浏览器上时，效果如下：http://www.somesite.com/details.pl?urn=2344 但是当我尝试用 Python 读取 URL 时什么也没有发生： link = 'http://www.somesite.com/details.pl?urn=2344' f ...

问题描述：

当我将其粘贴到浏览器上时，效果如下：

http://www.somesite.com/details.pl?urn=2344

但是当我尝试用 Python 读取 URL 时什么也没有发生：

 link = 'http://www.somesite.com/details.pl?urn=2344'
 f = urllib.urlopen(link)           
 myfile = f.readline()  
 print myfile

我是否需要对 URL 进行编码，或者是否有我没有看到的内容？

解决方案 1：

回答你的问题：

import urllib.request

link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.request.urlopen(link)
myfile = f.read()
print(myfile)

你需要read()，而不是readline()

另请参阅 Martin Thoma 或 innm 对这个问题的回答：Python 2/3 compat，Python 3

或者，requests使用

import requests

link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)

解决方案 2：

对于python3用户来说，为了节省时间，请使用以下代码，

from urllib.request import urlopen

link = "https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html"

f = urlopen(link)
myfile = f.read()
print(myfile)

我知道针对错误有不同的线程：Name Error: urlopen is not defined，但我认为这可能会节省时间。

解决方案 3：

这些答案对于 Python 3 来说都不太好（在发布本文时已在最新版本上测试过）。

这就是你做事的方式...

import urllib.request

try:
   with urllib.request.urlopen('http://www.python.org/') as f:
      print(f.read().decode('utf-8'))
except urllib.error.URLError as e:
   print(e.reason)

以上内容适用于返回“utf-8”的内容。如果您希望 python“猜测适当的编码”，请删除 .decode('utf-8')。

文档：
https：//docs.python.org/3/library/urllib.request.html#module-urllib.request

解决方案 4：

适用于 Python 2.X 和 Python 3.X 的解决方案利用了 Python 2 和 3 兼容库six：

from six.moves.urllib.request import urlopen
link = "http://www.somesite.com/details.pl?urn=2344"
response = urlopen(link)
content = response.read()
print(content)

解决方案 5：

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Works on python 3 and python 2.
# when server knows where the request is coming from.

import sys

if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    from urllib import urlopen
with urlopen('https://www.facebook.com/') as \n    url:
    data = url.read()

print data

# When the server does not know where the request is coming from.
# Works on python 3.

import urllib.request

user_agent = \n    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = 'https://www.facebook.com/'
headers = {'User-Agent': user_agent}

request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
data = response.read()
print data

解决方案 6：

我们可以读取如下的网站html内容：

from urllib.request import urlopen
response = urlopen('http://google.com/')
html = response.read()
print(html)

解决方案 7：

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://blog.csdn.net/qq_39591494/article/details/83934260").read().decode('utf-8')
print(html)

解决方案 8：

import requests
from bs4 import BeautifulSoup

link = "https://www.timeshighereducation.com/hub/sinorbis"

res = requests.get(link)
if res.status_code == 200:
    soup = BeautifulSoup(res, 'html.parser')

# get the text content of the webpage
text = soup.get_text()

print(text)

使用BeautifulSoupHTML 解析器我们可以提取网页内容。

解决方案 9：

我使用了以下代码：

import urllib

def read_text():
      quotes = urllib.urlopen("https://s3.amazonaws.com/udacity-hosted-downloads/ud036/movie_quotes.txt")
      contents_file = quotes.read()
      print contents_file

read_text()

解决方案 10：

# retrieving data from url
# only for python 3

import urllib.request

def main():
  url = "http://docs.python.org"

# retrieving data from URL
  webUrl = urllib.request.urlopen(url)
  print("Result code: " + str(webUrl.getcode()))

# print data from URL 
  print("Returned data: -----------------")
  data = webUrl.read().decode("utf-8")
  print(data)

if __name__ == "__main__":
  main()

解决方案 11：

URL 应该是一个字符串：

import urllib

link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)           
myfile = f.readline()  
print myfile