1 urllib
这部分简单了解即可
一个简单demo
from urllib import request response=request.urlopen('http://www.baidu.com') response.read().decode('utf8')
四个模块
- urllib.request - 打开和读取 URL。
- urllib.error - 包含 urllib.request 抛出的异常。
- urllib.parse - 解析 URL。
- urllib.robotparser - 解析 robots.txt 文件。
主要介绍几个自己不太熟悉或者常用的方法。
1.1.1 urllib.request.Request
request.Request( url,# 必传 data=None, #这部分必须使用bytes(字节流) headers={}, origin_req_host=None, unverifiable=False, method=None, )
demo
from urllib import request,parse url='http://httpbin.org/post' headers={ 'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66', 'Host':'httpbin.org' } dict={ 'name':'test_user' } data=bytes(parse.urlencode(dict),encoding='utf8') #先将字典类型转化编码格式。 req=request.Request(url=url,data=data,headers=headers,method='POST') response=request.urlopen(req) print(response.read().decode('utf-8')) #output """ { "args": {}, "data": "", "files": {}, "form": { #这部分使我们上传的内容 "name": "test_user" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "14", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66", "X-Amzn-Trace-Id": "Root=1-62c7e429-2dfdef7313887be20022ab31" }, "json": null, "origin": "60.168.149.12", "url": "http://httpbin.org/post" } """
也可通过add_header()来添加。
req.add_header('User-Agent','XXXXX')
1.2 Handler类
主要用来进行一些其他高级操作(cookie、代理处理等)
- baseHandle类
需要时再补充
1.3 异常处理urllib.error
urllib.error 模块为 urllib.request 所引发的异常定义了异常类,基础异常类是 URLError。
urllib.error 包含了两个方法,URLError 和 HTTPError。
# URLError 只有一个属性reason from urllib import request,error try: response=request.urlopen('https://baidumatches999.com') except error.URLError as e: print(e.reason)
HTTPError是URLError的子类,拥有三个属性,分别是
- code:状态码
- reason
- headers
from urllib import request,error try: response=request.urlopen('https://baidu.com/test.htm') except error.HTTPError as e: print(e.reason,e.code,e.headers,sep='\n') """ Not Found 404 Content-Length: 206 Content-Type: text/html; charset=iso-8859-1 Date: Fri, 08 Jul 2022 08:24:56 GMT Server: Apache Connection: close """
1.4 urllib.prase
URL分成六部分(scheme(协议)、netloc(域名)、path(‘访问路径’)、params(‘参数’),query(‘查询条件’))、fragment(‘锚点’)
scheme://netloc/path;params?query#fragment
主要用来处理URL,拆分,合并等。
- urlprase():识别和分段
- urlstring:必填项,待解析的URL
- scheme:协议,默认http
- allow_fragments:是否忽略fragment
- urlprase():
- urlsplit():
- urlunsplit()
- urljoin()
- parse_qsl()
- quote():将内容转化为URL编码的格式
# parse from urllib.parse import quote url="http://www.baidu.com/"+quote('你好') url """ 'http://www.baidu.com/%E4%BD%A0%E5%A5%BD' """
- unquote():解码
1.5 Robots协议
robots协议也称爬虫协议、爬虫规则等,是指网站可建立一个robots.txt文件来告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取,而搜索引擎则通过读取robots.txt文件来识别这个页面是否允许被抓取。但是,这个robots协议不是防火墙,也没有强制执行力,搜索引擎完全可以忽视robots.txt文件去抓取网页的快照**。**
robotparser模块来解析robots.txt
2 Requests模块
requests在实现某些操作,更为便捷,下方通过相同示例进行对比两种方法异同。
2.1 基本用法
用的最多的当属get()
import requests url='XXXX' r=requests.get(url)
也可通过post、put、delete实现对应请求。
- json
import requests r=requests.get('http://httpbin.org/get') print(r.text) print(r.json()) # 返回成json格式 """ 输出 { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-62c80370-7ed4a0a10c7b27106031d457" }, "origin": "60.168.149.12", "url": "http://httpbin.org/get" } {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-62c80370-7ed4a0a10c7b27106031d457'}, 'origin': '60.168.149.12', 'url': 'http://httpbin.org/get'} """
2.2 请求头headers
有很多网站,如果我们不添加请求头信息,可能会获取不到内容。
比如下方的知乎发现页
import requests r=requests.get('https://www.zhihu.com/explore') print(r.text) """输出 <html> <head><title>403 Forbidden</title></head> <body bgcolor="white"> <center><h1>403 Forbidden</h1></center> <hr><center>openresty</center> </body> </html> """
添加请求头之后就可以正常获取内容了。
import requests headers={ 'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66' } r=requests.get('https://www.zhihu.com/explore',headers=headers) print(r.text)
当然请求头还有其他属性要掌握,比如Referer,Cookie也需要了解,反爬有时用得着。
参考之前的博文:【爬虫】Web基础——响应头、请求头、http&https、状态码
2.3 post
def post(url, data=None, json=None, **kwargs): r"""Sends a POST request. :param url: URL for the new :class:`Request` object. :param data: (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """
demo就不列举了,数据通过打他提交。通过看文档可知里面的参数与request类似,即下面request中有的属性都可以用,比如file(我们可以通过这个来完成文件上传的功能)
# 文件上传 import requests file={'file':open("./data/test.png",'rb')} r=requests.post('http://httpbin.org/post',files=file) print(r.text)
2.4 requests.request
我们可以看下Request这个类,具体有哪些属性。
def request(method, url, **kwargs): """Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary, list of tuples or bytes to send in the query string for the :class:`Request`. :param data: (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers to add for the file. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How many seconds to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. :param verify: (optional) Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults to ``True``. :param stream: (optional) if ``False``, the response content will be immediately downloaded. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :return: :class:`Response <Response>` object :rtype: requests.Response Usage:: >>> import requests >>> req = requests.request('GET', 'https://httpbin.org/get') <Response [200]> """ # By using the 'with' statement we are sure the session is closed, thus we # avoid leaving sockets open which can trigger a ResourceWarning in some # cases, and look like a memory leak in others. with sessions.Session() as session: return session.request(method=method, url=url, **kwargs)
2.5 cookie
Cookies指某些网站为了辨别用户身份,进行绘画跟踪二存储在用户本地终端的数据。可以用来保持会话状态。
客户端第一次请求服务器时,返回set-cookie字段的响应,客户端浏览器会储存Cookie信息。
第二次访问时,客户端浏览器会放到请求中提交给服务器,服务器来判断会话状态。
import requests r=requests.get('http://www.baidu.com') cookie=r.cookies for key,value in cookies.items(): print(key+':'+value) """ BAIDUID:974A BIDUPSID:974A9F PSTM:1657 """
2.6 会话维持
引入Session对象,维持同一个会话。(区分是几个浏览器)
- Demo1:相当于通过两个不同的浏览器访问
import requests requests.get('http://httpbin.org/cookies/set/number/1234567') r=requests.get('http://httpbin.org/cookies') print(r.text) """output { "cookies": {} } """
也可以尝试用浏览器分别访问这两个url,更好理解。
- Demo2:Session对象
import requests sess=requests.session() sess.get('http://httpbin.org/cookies/set/number/1234567') r=sess.get('http://httpbin.org/cookies') print(r.text) """Output { "cookies": { "number": "1234567" } } """
2.7 Prepared Request对象
Request对象,这部分直接看源码就行,有了这个Request对象就可以吧请求当作是独立的对象来看待。下方写个小demo。
from requests import Request,Session url='http://www.baidu.com' headers={ } s=Session() req=Request('get',url) #headers可加 prepare=s.prepare_request(req) r=s.send(prepare) print(r.text)
可以尝试了解补充prepare()与Prepared_Request()的区别。
Advanced Usage — Requests 2.28.1 documentation文档中有,但是没说清楚,后期用到再补充。