python爬取百度网盘资源,python3

经过上一章的学习，我们开始了真正意义上的爬行动物。

为了实现目标，我们这次获得的网站是百度贴吧，具体的贴吧是生活大爆炸吧。

公告栏地址：

https://tieba.baidu.com/f？ kw=生命大爆炸ie=utf-8

python :3.6.2版(建议使用Python) ) )。

浏览器版本： Chrome

目标分析从internet爬上特定页码的页面可以轻松过滤和分析所爬页面的内容，找到每个帖子的标题、投稿人、日期、楼层和跳转链接，并将结果保存到文本中。你觉得上个阶段看布告栏上的网址很混乱吗？有很多不能识别的文字吗？其实这些都是中文文字，

生活大爆炸

编码后，是生活的大爆炸。

链接末尾： ie=utf-8表示此连接采用utf-8编码。

Python3默认情况下全局采用utf-8编码，因此在此不需要转换编码。

接下来，我们转到布告栏的第二页：

https://tieba.baidu.com/f？ kw=生活大爆炸ie=utf-8pn=50

不，连接末尾增加了一个参数

pn=50

在此可以很容易地推测，与此参数的页码的联系如下。

pn=0:首页pn=0 50 :第二页pn=第100页pn=50*n页50表示每页有50篇帖子。

现在，您可以通过简单的url修改来翻页。

chrome开发工具需要写爬行动物。我们一定要使用开发工具。虽然这个工具原本是给前端开发人员的，但是我们可以通过它快速识别我们要爬的信息，找到相应的规律。

右键单击，选中，打开chrome工具。

使用模拟单击工具快速导航到各个帖子的位置。 (左上方的鼠标箭头图标) ) )。

仔细观察后，发现每个帖子的内容都包含在li标签中：

li class=' j_thread_list clearfix '这样我们就可以快速找到符合所有规则的标签，进而分析里面的内容，最后筛选数据。

开始写代码后，首先写抓住页面中人的函数：

这是之前介绍的框架，今后经常使用。

importrequestsfrombs4importbeautifulsoup #首先，用于抓取网页的函数defget_html(URL ) : try : r=requests.get (URL， timeout=30当您爬上其他页面时，# r.endcodding=r.apparent _ endcondingr.encoding=' utf-8 ' returnr.text except : rete

试着划分一下各li标签内部的结构。

一个大li标签中包裹着很多div标签

我们想要的信息在这一个个的div标签里。 #标题投稿链接： a href='/p/4830198616' title='你再给你几分第九季的这张侧脸？' target='_blank'class='j_th_}投稿人： span class=' TB _ icon _ author ' title='主题作者： Li欣远' data-field user_idquot；836897637 } ' iclass=' icon _ author '/ispa nclass=' FRS-author-name-wrap ' adata-field=' { quot； unquot；quot； Liu6b23u8fdcquot； } ' class=' FRS-author-namej _ user _ card ' href='/home/main/un=Li欣远ie=utf-8fr=FRS ' target=' _ bll

dlist_rep_num center_text" title="回复">24</span></div>#发帖日期： <span class="pull-right is_show_create_time" title="创建时间">2016-10</span>

分析完之后，我们就能很容易的通过soup.find()方法得到我们想要的结果

具体代码的实现：

'''抓取百度贴吧---生活大爆炸吧的基本内容爬虫线路： requests - bs4Python版本： 3.6OS： mac os 12.12.4'''import requestsimport timefrom bs4 import BeautifulSoup# 首先我们写好抓取网页的函数def get_html(url): try: r = requests.get(url, timeout=30) r.raise_for_status() # 这里我们知道百度贴吧的编码是utf-8，所以手动设置的。爬去其他的页面时建议使用： # r.endcodding = r.apparent_endconding r.encoding = 'utf-8' return r.text except: return " ERROR "def get_content(url): ''' 分析贴吧的网页文件，整理信息，保存在列表变量中 ''' # 初始化一个列表来保存所有的帖子信息： comments = [] # 首先，我们把需要爬取信息的网页下载到本地 html = get_html(url) # 我们来做一锅汤 soup = BeautifulSoup(html, 'lxml') # 按照之前的分析，我们找到所有具有‘ j_thread_list clearfix’属性的li标签。返回一个列表类型。 liTags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'}) # 通过循环找到每个帖子里的我们需要的信息： for li in liTags: # 初始化一个字典来存储文章信息 comment = {} # 这里使用一个try except 防止爬虫找不到信息从而停止运行 try: # 开始筛选信息，并保存到字典中 comment['title'] = li.find( 'a', attrs={'class': 'j_th_tit '}).text.strip() comment['link'] = "http://tieba.baidu.com/" + li.find('a', attrs={'class': 'j_th_tit '})['href'] comment['name'] = li.find( 'span', attrs={'class': 'tb_icon_author '}).text.strip() comment['time'] = li.find( 'span', attrs={'class': 'pull-right is_show_create_time'}).text.strip() comment['replyNum'] = li.find( 'span', attrs={'class': 'threadlist_rep_num center_text'}).text.strip() comments.append(comment) except: print('出了点小问题') return commentsdef Out2File(dict): ''' 将爬取到的文件写入到本地保存到当前目录的 TTBT.txt文件中。 ''' with open('TTBT.txt', 'a+') as f: for comment in dict: f.write('标题： {} t 链接：{} t 发帖人：{} t 发帖时间：{} t 回复数量： {} n'.format( comment['title'], comment['link'], comment['name'], comment['time'], comment['replyNum'])) print('当前页面爬取完成')def main(base_url, deep): url_list = [] # 将所有需要爬去的url存入列表 for i in range(0, deep): url_list.append(base_url + '&pn=' + str(50 * i)) print('所有的网页已经下载到本地！开始筛选信息。。。。') #循环写入所有的数据 for url in url_list: content = get_content(url) Out2File(content) print('所有的信息都已经保存完毕！')base_url = 'http://tieba.baidu.com/f?kw=%E7%94%9F%E6%B4%BB%E5%A4%A7%E7%88%86%E7%82%B8&ie=utf-8'# 设置需要爬取的页码数量deep = 3if __name__ == '__main__': main(base_url, deep)

代码里有详细的注释和思路，看不懂的话多看几遍
好了今天的文章到这里就结束了。