Python中的urllib介绍

在Python2中, 有urllib与urllib2两个库可以用来实现request的发送
而在Python3中, 没有urllib2了，统一称为：urllib
urllib中包括了四个模块：
- urllib.request: 可以用来发送request和获取request的结果
- urllib.error: 包含了urllib.request产生的异常
- urllib.parse: 用于解析和处理URL
- urllib.robotparse: 用于解析页面的robots.txt文件

关于urllib.request模块

该模块提供了最基本的构造HTTP请求的方法, 可以模拟浏览器的请求过程，同时还带有处理authentication(授权验证)、redirections(重定向)、cookies(浏览器Cookies)等内容
使用示例：urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
- 参数：url是地址；data：可选附加参数，字节流编码格式(bytes()可转换), 请求方式会变为POST; timeout(超时时间)单位为秒, 未及时响应则抛出异常
- 返回：HTTPResponse类型的对象：response.read() 可得到返回的网页内容, 可使用decode('utf-8')解码字符串； response.status 可以得到返回结果的状态码

关于urllib.request.Request模块

利用urlopen()可实现最基本的请求发起，但不足以构建一个完整的请求
使用强大的Request类可以在请求中加入需要的headers等信息
- 示例源码：class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
- 参数url: 是必传参数，其他都是可选参数
- 参数data: 参数如果要传必须传bytes(字节流)类型的, 如果是一个字典, 可以先用urllib.parse.urlencode()进行编码
- 参数headers: 是一个字典, 可在构造Request时通过headers参数传递，也可通过调用add_header()来添加请求头
- 参数origin_req_host: 请求方的host名称或IP地址
- 参数unverifiable 指的是这个请求是否是无法验证的，默认是False, 意思是用户没有足够权限来选择接受这个请求的结果
- 参数method是一个字符串，用来指示请求使用的方法，如：GET、POST、PUT等

使用urllib.request模块来爬取数据示例

import urllib.request
import re

url = "http://www.baidu.com"
res = urllib.request.urlopen(url)
# print(res.read().decode("utf-8")) # 请求的结果是html代码
# print(len(res.read())) # 153682 注：此处输出网页字符的长度
html = res.read().decode("utf-8")

dlist = re.findall("<title>(.*?)</title>", html)
print(dlist) # ['百度一下，你就知道']

使用urllib.request.Request模块进行请求封装示例

import urllib.request
import re

url = 'http://news.baidu.com/';

# 伪装浏览器用户
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
req = urllib.request.Request(url, headers=headers) # 封装请求对象
res = urllib.request.urlopen(req) # 执行请求获取响应信息 通过封装对象
# res = urllib.request.urlopen(url) # 执行请求获取响应信息 直接通过url
html = res.read().decode('utf-8') # 从响应对象中读取信息并解码
pat = '<a href="(.*?)" .*? target="_blank">(.*?)</a>' # 使用正则解析新闻标题

dlist = re.findall(pat, html)

for v in dlist:
    print(v[1] + ' : ' + v[0])

爬虫的异常处理

在爬虫运行时，经常会出现异常，若不进行处理则会因报错而终止运行
urllib.error 可接收有urllib.request产生的异常。它有两个类，URLError 和 HTTPError
- URLError内有一个属性：reason 返回错误原因
- HTTPError内有三个属性： code 返回状态码，reason返回错误原因，headers返回请求头
URLError 是 OSError 的一个子类，HTTPError 是 URLError的一个子类
错误处理: 在容易出错的位置，添加 try 块

方式1 (按错误类型分别处理)：

from urllib import request, error

url = 'https://avatar.csdn.net/x/y/sdsds.jpg'
req = request.Request(url)
try:
    res = request.urlopen(req)
    html = res.read().decode('utf-8') # 此处二进制内容转换为utf8的内容
    print(len(html))
# HTTPError 要在 URLError 之后
except error.HTTPError as e:
    print('HTTPError')
    print(e.reason)
    print(e.code)
except error.URLError as e:
    print('URLError')
    print(e.reason)

方式2 (开发中经常采用这种方式, 统一处理):

from urllib import request, error

url = 'https://avatar.csdn.net/x/y/sdsds.jpg'
req = request.Request(url)
try:
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    print(len(html))
except Exception as e:
    if hasattr(e, "code"):
        print("HTTPError")
        print(e.reason)
        print(e.code)
    elif hasattr(e, "reason"):
        print("URLError")
        print(e.reason)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50.md

50.md

Python中的urllib介绍

关于urllib.request模块

关于urllib.request.Request模块

使用urllib.request模块来爬取数据示例

使用urllib.request.Request模块进行请求封装示例

爬虫的异常处理

Files

50.md

Latest commit

History

50.md

File metadata and controls

Python中的urllib介绍

关于urllib.request模块

关于urllib.request.Request模块

使用urllib.request模块来爬取数据示例

使用urllib.request.Request模块进行请求封装示例

爬虫的异常处理