1 urllib

这部分简单了解即可

一个简单demo

from urllib import request
response=request.urlopen('http://www.baidu.com')
response.read().decode('utf8')

四个模块

urllib.request - 打开和读取 URL。
urllib.error - 包含 urllib.request 抛出的异常。
urllib.parse - 解析 URL。
urllib.robotparser - 解析 robots.txt 文件。

主要介绍几个自己不太熟悉或者常用的方法。

1.1.1 urllib.request.Request

request.Request(
    url,# 必传
    data=None, #这部分必须使用bytes（字节流）
    headers={},
    origin_req_host=None,
    unverifiable=False,
    method=None,
)

demo

from urllib import request,parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66',
    'Host':'httpbin.org'
}
dict={ 
    'name':'test_user'
}
data=bytes(parse.urlencode(dict),encoding='utf8') #先将字典类型转化编码格式。
req=request.Request(url=url,data=data,headers=headers，method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))
#output
"""
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {  #这部分使我们上传的内容
    "name": "test_user"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "14", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66", 
    "X-Amzn-Trace-Id": "Root=1-62c7e429-2dfdef7313887be20022ab31"
  }, 
  "json": null, 
  "origin": "60.168.149.12", 
  "url": "http://httpbin.org/post"
}
"""

也可通过add_header()来添加。

req.add_header('User-Agent','XXXXX')

1.2 Handler类

主要用来进行一些其他高级操作（cookie、代理处理等）

baseHandle类

需要时再补充

1.3 异常处理urllib.error

urllib.error 模块为 urllib.request 所引发的异常定义了异常类，基础异常类是 URLError。

urllib.error 包含了两个方法，URLError 和 HTTPError。

# URLError 只有一个属性reason
from urllib import request,error
try:
    response=request.urlopen('https://baidumatches999.com')
except error.URLError as e:
    print(e.reason)

HTTPError是URLError的子类，拥有三个属性，分别是

code：状态码
reason
headers

from urllib import request,error
try:
    response=request.urlopen('https://baidu.com/test.htm')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
"""
Not Found
404
Content-Length: 206
Content-Type: text/html; charset=iso-8859-1
Date: Fri, 08 Jul 2022 08:24:56 GMT
Server: Apache
Connection: close
"""

1.4 urllib.prase

URL分成六部分（scheme(协议)、netloc(域名)、path(‘访问路径’)、params(‘参数’)，query(‘查询条件’)）、fragment(‘锚点’)

scheme://netloc/path;params?query#fragment

主要用来处理URL，拆分，合并等。

urlprase()：识别和分段

urlstring：必填项，待解析的URL
scheme：协议，默认http
allow_fragments：是否忽略fragment

urlprase():

urlsplit():

urlunsplit()

urljoin()

parse_qsl()

quote():将内容转化为URL编码的格式

# parse
from urllib.parse import quote
url="http://www.baidu.com/"+quote('你好')
url
"""
'http://www.baidu.com/%E4%BD%A0%E5%A5%BD'
"""

unquote():解码

1.5 Robots协议

robots协议也称爬虫协议、爬虫规则等,是指网站可建立一个robots.txt文件来告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取,而搜索引擎则通过读取robots.txt文件来识别这个页面是否允许被抓取。但是,这个robots协议不是防火墙,也没有强制执行力,搜索引擎完全可以忽视robots.txt文件去抓取网页的快照**。**

robotparser模块来解析robots.txt

2 Requests模块

requests在实现某些操作，更为便捷，下方通过相同示例进行对比两种方法异同。

2.1 基本用法

用的最多的当属get()

import requests
url='XXXX' 
r=requests.get(url)

也可通过post、put、delete实现对应请求。

json

import requests 
r=requests.get('http://httpbin.org/get')
print(r.text)
print(r.json()) # 返回成json格式
""" 输出
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0", 
    "X-Amzn-Trace-Id": "Root=1-62c80370-7ed4a0a10c7b27106031d457"
  }, 
  "origin": "60.168.149.12", 
  "url": "http://httpbin.org/get"
}
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-62c80370-7ed4a0a10c7b27106031d457'}, 'origin': '60.168.149.12', 'url': 'http://httpbin.org/get'}
"""

2.2 请求头headers

有很多网站，如果我们不添加请求头信息，可能会获取不到内容。

比如下方的知乎发现页

import requests 
r=requests.get('https://www.zhihu.com/explore')
print(r.text)
"""输出
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>openresty</center>
</body>
</html>
"""

添加请求头之后就可以正常获取内容了。

import requests 
headers={
        'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66'
}
r=requests.get('https://www.zhihu.com/explore',headers=headers)
print(r.text)

当然请求头还有其他属性要掌握，比如Referer，Cookie也需要了解，反爬有时用得着。

参考之前的博文：【爬虫】Web基础——响应头、请求头、http&https、状态码

2.3 post

def post(url, data=None, json=None, **kwargs):
    r"""Sends a POST request.
    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
       object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

demo就不列举了，数据通过打他提交。通过看文档可知里面的参数与request类似，即下面request中有的属性都可以用，比如file(我们可以通过这个来完成文件上传的功能)

# 文件上传
import requests
file={'file':open("./data/test.png",'rb')}
r=requests.post('http://httpbin.org/post',files=file)
print(r.text)

2.4 requests.request

我们可以看下Request这个类，具体有哪些属性。

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.
    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    Usage::
      >>> import requests
      >>> req = requests.request('GET', 'https://httpbin.org/get')
      <Response [200]>
    """
    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

2.5 cookie

Cookies指某些网站为了辨别用户身份，进行绘画跟踪二存储在用户本地终端的数据。可以用来保持会话状态。

客户端第一次请求服务器时，返回set-cookie字段的响应，客户端浏览器会储存Cookie信息。

第二次访问时，客户端浏览器会放到请求中提交给服务器，服务器来判断会话状态。

import requests
r=requests.get('http://www.baidu.com')
cookie=r.cookies
for key,value in cookies.items():
    print(key+':'+value)
"""
BAIDUID:974A
BIDUPSID:974A9F
PSTM:1657
"""

2.6 会话维持

引入Session对象，维持同一个会话。（区分是几个浏览器）

Demo1：相当于通过两个不同的浏览器访问

import requests
requests.get('http://httpbin.org/cookies/set/number/1234567')
r=requests.get('http://httpbin.org/cookies')
print(r.text)
"""output
{
  "cookies": {}
}
"""

也可以尝试用浏览器分别访问这两个url，更好理解。

Demo2：Session对象

import requests
sess=requests.session()
sess.get('http://httpbin.org/cookies/set/number/1234567')
r=sess.get('http://httpbin.org/cookies')
print(r.text)
"""Output
{
  "cookies": {
    "number": "1234567"
  }
}
"""

2.7 Prepared Request对象

Request对象，这部分直接看源码就行，有了这个Request对象就可以吧请求当作是独立的对象来看待。下方写个小demo。

from requests import Request,Session
url='http://www.baidu.com'
headers={
}
s=Session()
req=Request('get',url) #headers可加
prepare=s.prepare_request(req) 
r=s.send(prepare)
print(r.text)

可以尝试了解补充prepare（）与Prepared_Request()的区别。

Advanced Usage — Requests 2.28.1 documentation文档中有，但是没说清楚，后期用到再补充。

Requests库简单方法使用笔记

1 urllib

1.1.1 urllib.request.Request

1.2 Handler类

1.3 异常处理urllib.error

1.4 urllib.prase

1.5 Robots协议

2 Requests模块

2.1 基本用法

2.2 请求头headers

2.3 post

2.4 requests.request

2.5 cookie

2.6 会话维持

2.7 Prepared Request对象

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Requests库简单方法使用笔记

1 urllib

1.1.1 urllib.request.Request

1.2 Handler类

1.3 异常处理urllib.error

1.4 urllib.prase

1.5 Robots协议

2 Requests模块

2.1 基本用法

2.2 请求头headers

2.3 post

2.4 requests.request

2.5 cookie

2.6 会话维持

2.7 Prepared Request对象

热门文章

最新文章

相关课程

相关电子书