首页 > 编程知识 正文

scrapy设置爬虫停止条件,scrapy爬虫认证码

时间:2023-05-04 02:57:40 阅读:184029 作者:4334

今天用Scrapy框架时出现了一些小问题,折腾了半天,记录一下。

返回415状态码:请求包未加header

请求包
这是一个POST请求并需要提交表单数据,所以我用了scrapy.FormRequest构造数据包,具体spider代码:

class yilicai(Spider): name = "yilicai" urls = "http://api.yilicai.cn/product/all5" base_url = "https://www.yilicai.cn" DOWNLOAD_DELAY = 0 count = 0 appmc = "壹理财" def start_requests(self): formdata = { 'page': '1', 'sType': '0', 'sTerm': '0', 'sRate': '0', 'sRecover': '0', 'sStart': '0' } yield scrapy.FormRequest(self.urls, callback=self.parse, formdata=formdata) def parse(self,response): datas = json.loads(response.body) print(json.dumps(datas, sort_keys=True, indent=2))

运行该爬虫出现415错误:

2018-05-07 17:00:20 [scrapy.core.engine] DEBUG: Crawled (415) <POST http://api.yilicai.cn/product/all5> (referer: None)2018-05-07 17:00:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <415 http://api.yilicai.cn/product/all5>: HTTP status code is not handled or not allowed2018-05-07 17:00:21 [scrapy.core.engine] INFO: Closing spider (finished)

去找了一下关于HTTP状态码415的解释:

415 Unsupported Media Type 服务器无法处理请求附带的媒体格式
后来发现是我没有添加header,添加了header的代码修改如下:

headers={ "Accept-Language":"zh-CN,zh;q=0.8", "User-Agent ":"Mozilla/5.0 (Linux; U; Android 6.0; zh-cn; AOSP on HammerHead Build/MRA58K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30", "Content-Type":"application/json;charset=utf-8", "Host":"api.yilicai.cn", "Accept-Encoding":"gzip", }yield scrapy.FormRequest(self.urls, headers=self.headers,callback=self.parse, formdata=formdata)

返回400状态码:未将提交数据转化为json格式
再次运行415状态码错误算是解决了,但是出现了一个新的错误,报错400:

2018-05-07 17:11:59 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://api.yilicai.cn/product/all5> (referer: None)2018-05-07 17:11:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://api.yilicai.cn/product/all5>: HTTP status code is not handled or not allowed2018-05-07 17:11:59 [scrapy.core.engine] INFO: Closing spider (finished)

真的伤心,就是这个400错误是关键,我先去找了一下关于400状态码的解释:

400 bad request 错误的请求

后来发现是这个请求严格要求提交的表单必须是json格式,所以在提交表单时候需要把formdata转换成json格式,然后进行提交。

由于使用scrapy.FormRequest在构造包时语句formdata=json.dumps(formdata)会报错,所以使用scrapy.Request来进行爬取:

class yilicai(Spider): name = "yilicai" urls = "http://api.yilicai.cn/product/all5" base_url = "https://www.yilicai.cn" DOWNLOAD_DELAY = 0 count = 0 appmc = "壹理财" headers={ "Accept-Language":"zh-CN,zh;q=0.8", "User-Agent ":"Mozilla/5.0 (Linux; U; Android 6.0; zh-cn; AOSP on HammerHead Build/MRA58K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30", "Content-Type":"application/json;charset=utf-8", "Host":"api.yilicai.cn", "Accept-Encoding":"gzip", } def start_requests(self): formdata = { 'page': '1', 'sType': '0', 'sTerm': '0', 'sRate': '0', 'sRecover': '0', 'sStart': '0' } temp=json.dumps(formdata) yield scrapy.Request(self.urls,body=temp,headers=self.headers,callback=self.parse) def parse(self,response): datas = json.loads(response.body) print(json.dumps(datas,sort_keys=True, indent=2))

最后终于成功抓到返回的数据包了,然后愉快的进行数据分析了

2018-05-07 17:47:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://api.yilicai.cn/product/all5> (referer: None){ "base_url": "https://www.yilicai.cn", "current_page": "1", "new_hand": 1, "pager": "1", "pagerParam": { "count": 16063, "maxPage": 1607, "perPage": 10 }, "product_list": [ { ...... } ], "sid": null, "status": "0"}2018-05-07 17:47:16 [scrapy.core.engine] INFO: Closing spider (finished)2018-05-07 17:47:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:......

到此,问题解决!

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。