零基础教你写python爬虫-阿里云开发者社区

零基础教你写python爬虫

2017-09-20 1312

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 大家都知道python经常被用来做爬虫，用来在互联网上抓取我们需要的信息。使用Python做爬虫，需要用到一些包：requestsurllibBeautifulSoup等等，关于python工具的说明，请看这里：Python 爬虫的工具列表今天介绍一个简单的爬虫,网络聊天流行斗图，偶然发现一个网站www.doutula.com.上面的图片挺搞笑的，可以摘下来使用。

大家都知道python经常被用来做爬虫，用来在互联网上抓取我们需要的信息。

使用Python做爬虫，需要用到一些包：

requests

urllib

BeautifulSoup

等等，关于python工具的说明，请看这里：Python 爬虫的工具列表
今天介绍一个简单的爬虫,网络聊天流行斗图，偶然发现一个网站www.doutula.com.上面的图片挺搞笑的，可以摘下来使用。

我们来抓一下“最新斗图表情”：

看到下面有分页，分析下他的分页url格式：

不难发现分页的url是：https://www.doutula.com/photo/list/?page=x

一步步来：

先简单抓取第一页上的图片试试:

将抓取的图片重新命名，存储在项目根目录的images目录下：

分析网页上img格式：

好了，我们开始准备写程序吧：使用pycharm IDE创建项目

我们抓包会用到：requests 和urllib,需要先安装这些包：file->default settings

点击右侧绿色的+号：

同样的引入：BeautifulSoup，lxml

接下来就可以引入这些包，然后开始开发了：

import requests
from bs4 import BeautifulSoup
import urllib
import os

url = 'https://www.doutula.com/photo/list/?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
img_list = soup.find_all('img',attrs={'class':'img-responsive lazy image_dta'})
i=0
for img in img_list:
    print (img['data-original'])
    src = img['data-original']
    #src = '//ws1.sinaimg.cn/bmiddle/9150e4e5ly1fjlv8kgzr0g20ae08j74p.gif'
    if not src.startswith('http'):
        src= 'http:'+src
    filename = src.split('/').pop()
    fileextra = filename.split('.').pop()
    filestring = i+'.'+fileextra
    path = os.path.join('images',filestring)
    # 下载图片
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
    }
    #urllib.request.urlretrieve(url,path,header)
    req = urllib.request.Request(url=src, headers=headers)
    cont = urllib.request.urlopen(req).read()
    root = r""+path+""
    f=open(root,'wb')
    f.write(cont)
    f.close
    i += 1

View Code

注意：

　　1.请求的时候需要加上header，伪装成浏览器请求，网站大多不允许抓取。

抓完一页的图片，我们试着抓取多页的图片：这里试下抓取第一页和第二页的图片

import requests
from bs4 import BeautifulSoup
import urllib
import os
import datetime
#begin
print (datetime.datetime.now())
URL_LIST = []
base_url = 'https://www.doutula.com/photo/list/?page='
for x in range(1,3):
    url = base_url+str(x)
    URL_LIST.append(url)
i = 0
for page_url in URL_LIST:
        response = requests.get(page_url)
        soup = BeautifulSoup(response.content,'lxml')
        img_list = soup.find_all('img',attrs={'class':'img-responsive lazy image_dta'})
        for img in img_list: #一页上的图片
            print (img['data-original'])
            src = img['data-original']
            if not src.startswith('http'):
                src= 'http:'+src
            filename = src.split('/').pop()
            fileextra = filename.split('.').pop()
            filestring = str(i)+'.'+fileextra
            path = os.path.join('images',filestring)
            # 下载图片
            headers = {
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Encoding': 'gzip, deflate, sdch',
                'Accept-Language': 'zh-CN,zh;q=0.8',
                'Connection': 'keep-alive',
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
            }
            #urllib.request.urlretrieve(url,path,header)
            req = urllib.request.Request(url=src, headers=headers)
            cont = urllib.request.urlopen(req).read()
            root = r""+path+""
            f=open(root,'wb')
            f.write(cont)
            f.close
            i += 1
#end
print (datetime.datetime.now())

View Code

这样我们就完成了多页图片的抓取，但是貌似有点慢啊，要是抓所有的，那估计得花一点时间了。
python是支持多线程的，我们可以利用多线程来提高速度：

分析一下这是怎么样的一个任务：我们将网页地址全部存储到一个list中，所有的图片地址也存储在一个list中，然后按顺序来取图片地址，再依次下载

这样类似一个：多线程有序操作的过程，就是“消费者生产者模式”，使用list加锁来实现队列（FIFO先进先出）。

一起回忆一下队列的特点吧：

看代码吧：我们下载第一页到第99页的图片

import requests
from bs4 import BeautifulSoup
import urllib
import os
import datetime
import threading
import time

i = 0
FACE_URL_LIST = []
URL_LIST = []
base_url = 'https://www.doutula.com/photo/list/?page='
for x in range(1,100):
    url = base_url+str(x)
    URL_LIST.append(url)
#初始化锁
gLock = threading.Lock()

#生产者，负责从页面中提取表情图片的url
class producer(threading.Thread):
    def run(self):
        while len(URL_LIST)>0:
            #访问时需要加锁
            gLock.acquire()
            cur_url = URL_LIST.pop()
            #使用完后及时释放锁，方便其他线程使用
            gLock.release()
            response = requests.get(cur_url)
            soup = BeautifulSoup(response.content, 'lxml')
            img_list = soup.find_all('img', attrs={'class': 'img-responsive lazy image_dta'})
            gLock.acquire()
            for img in img_list:  # 一页上的图片
                print(img['data-original'])
                src = img['data-original']
                if not src.startswith('http'):
                    src = 'http:' + src
                FACE_URL_LIST.append(src)
            gLock.release()
            time.sleep(0.5)


#消费者，负责从FACE_URL_LIST中取出url，下载图片
class consumer(threading.Thread):
    def run(self):
        global i
        j=0
        print ('%s is running' % threading.current_thread)
        while True:
            #上锁
            gLock.acquire()
            if len(FACE_URL_LIST) == 0:
                #释放锁
                gLock.release()
                j = j + 1
                if (j > 1):
                    break
                continue
            else:
                #从FACE_URL_LIST中取出url，下载图片
                face_url = FACE_URL_LIST.pop()
                gLock.release()
                filename = face_url.split('/').pop()
                fileextra = filename.split('.').pop()
                filestring = str(i) + '.' + fileextra
                path = os.path.join('images', filename)
                #path = os.path.join('images', filestring)
                # 下载图片
                headers = {
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                    'Accept-Encoding': 'gzip, deflate, sdch',
                    'Accept-Language': 'zh-CN,zh;q=0.8',
                    'Connection': 'keep-alive',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
                }
                # urllib.request.urlretrieve(url,path,header)
                req = urllib.request.Request(url=face_url, headers=headers)
                cont = urllib.request.urlopen(req).read()
                root = r"" + path + ""
                f = open(root, 'wb')
                f.write(cont)
                f.close
                print(i)
                i += 1



if __name__ == '__main__': #在本文件内运行
    # begin
    print(datetime.datetime.now())
    #2个生产者线程从页面抓取表情链接
    for x in range(2):
        producer().start()

    #5个消费者线程从FACE_URL_LIST中提取下载链接，然后下载
    for x in range(5):
        consumer().start()
    #end
    print (datetime.datetime.now())

View Code

看看images文件夹下多了好多图，以后斗图不用愁了！

OK，到此算是结束了。最后为python宣传一下。

零基础教你写python爬虫

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

零基础教你写python爬虫

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像