一、环境安装
安装scprayd,网址:https://github.com/scrapy/scrapyd
pip install scrapyd
安装scrapyd-client,网址:https://github.com/scrapy/scrapyd-client
pip install scrapyd-client
启动服务
scrapyd
环境测试: http://localhost:6800/
二、部署工程到scrapyd
修改爬虫工程的scrapy.cfg 文件
[deploy] url = http://localhost:6801/ username = scrapy password = secret project = myspider [deploy:target] url = http://localhost:6802/ username = scrapy password = secret project = myspider
command+N 新打开一个终端,进入到爬虫项目目录下,部署爬虫项目
# 部署单个服务器单个项目 scrapyd-deploy <target> -p <project> --version <version> # 部署全部服务器单个项目 scrapyd-deploy -a -p <projec
target为你的服务器名称,没有指定target则为默认地址,project是你的工程名字
部署示例:
$ scrapy list # 检查项目爬虫 $ scrapyd-deploy -l # 查看项目 $ scrapyd-deploy # 部署默认项目到默认服务器 $ scrapyd-deploy -p myspider # 部署指定项目到默认服务器 $ scrapyd-deploy target # 部署默认项目到target服务器 $ scrapyd-deploy -a # 部署全部项目到全部服务器
三、启动爬虫
使用如下命令启动一个爬虫
curl http://localhost:6800/schedule.json -d project=PROJECT_NAME -d spider=SPIDER_NAME
PROJECT_NAME填入你爬虫工程的名字,SPIDER_NAME填入你爬虫的名字
我输入的代码如下:
curl http://localhost:6800/schedule.json -d project=myspider -d spider=baidu
因为这个测试爬虫写的非常简单,一下子就运行完了。查看网站的jobs可以看到有一个爬虫已经运行完,处于Finished一列中
四、停止爬虫
curl http://localhost:6800/cancel.json -d project=PROJECT_NAME -d job=JOB_ID
更多API可以查看官网:http://scrapyd.readthedocs.io/en/latest/api.html
五、自定义配置
当前目录下新建文件scrapyd.conf 可以修改scrapyd服务启动的端口号,日志目录等信息
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 127.0.0.1 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
如果的pendding 任务较多,可以尝试修改 poll_interval=1.0
六、Scrapyd-API
项目地址:https://github.com/djm/python-scrapyd-api
用简单的 Python 代码就可以实现 Scrapy 项目的监控和运行
pip install python-scrapyd-api
from scrapyd_api import ScrapydAPI scrapyd = ScrapydAPI('http://localhost:6800') scrapyd.list_jobs('project_name')