首页 > 编程知识 正文

请列出使用Scrapy爬取网站数据的操作流程,scrapy爬取片

时间:2023-05-06 03:22:58 阅读:278948 作者:710

Scrapy框架实战

爬取目标:网站 唯美女生 女生图片

首先进入网站首页

分析网站源代码

不难发现详情页的规律


详情页中每张图片的地址

网页数据都是直接渲染出来的,所以我们可以通过直接获取图片地址来下载图片

我使用的是Scrapy框架中的crawl模板


爬取代码:
vmgirls_spider.py

注意:代码里的注释是因为不同详情页有的图片路径存储的结构不一样,爬取的方式也不一样

有的是一个p标签里一个a标签包含一个img,有的是一个p标签包含了所有的img,还有就是多个p标签包含图片,如下图

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom vmgirls.items import VmgirlsItemclass VmgirlsSpiderSpider(CrawlSpider): name = 'vmgirls_spider' allowed_domains = ['www.vmgirls.com'] start_urls = ['https://www.vmgirls.com'] rules = ( Rule(LinkExtractor(allow=r'https://www.vmgirls.com/d+.html'), callback='parse_item', follow=False), ) def parse_item(self, response): girl_div = response.xpath("//div[@class='post']") girl_title = girl_div.xpath(".//h1/text()").get() # girl_imgs_urls = girl_div.xpath(".//div[@class='post-content']/div[@class='nc-light-gallery']/p[last()]/a/@href").getall() girl_imgs_ps = girl_div.xpath(".//div[@class='post-content']/div[@class='nc-light-gallery']/p") girl_imgs_urls=[] for girl_imgs_p in girl_imgs_ps: # girl_imgs_url = girl_imgs_p.xpath(".//a/@href").getall() girl_imgs_url = girl_imgs_p.xpath(".//img/@data-src").getall() girl_imgs_urls.extend(girl_imgs_url) # extend 追加多个元素到列表 item = VmgirlsItem(title=girl_title,imgurls=girl_imgs_urls) yield item

数据处理:
piplines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport osfrom urllib import requestclass VmgirlsPipeline(object): def __init__(self): pass def open_spider(self,spider): print("爬虫开始了") self.images_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images") if not os.path.exists(self.images_path): os.mkdir(self.images_path) def process_item(self, item, spider): title = item['title'] imgurls = item['imgurls'] title_path = os.path.join(self.images_path,title) if not os.path.exists(title_path): os.mkdir(title_path) for url in imgurls: image_name = url.split('/')[-1] opener = request.build_opener() opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36')] request.install_opener(opener) request.urlretrieve(url,os.path.join(title_path,image_name)) return item def close_spider(self,spider): print("爬虫结束了")

爬取的结果:

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。