廖雪峰的网站,git教程廖雪峰pdf

偶然中看到[Python之禅]的推文，关注公众号之后，发现里面好多有趣的东西，于是按照作者的讲解，打算自己亲自去试一试！

爬虫：将廖雪峰网上资源保存为本地PDF文件 1. 准备工作： 1.1 分析网站结构：

网址：廖雪峰Python教程
分析：
网页的左侧是教程的目录大纲，每个URL对应到右边的一篇文章，右侧上方的是标题，中间是文章的正文部分，正文内容是我们关心的重点，我们要爬取的数据就是所有网页的正文部分，下方是用户评论区，评论区对我们没什么用，我们可以忽略它。

1.2 工具准备：

Requests和beautifulsoup是爬虫的两大神器，requests用于网络请求，beautifulsoup用于操作html数据。要把 html 文件转为 pdf，要有相应的库支持， wkhtmltopdf 就是一个非常好的工具，它可以用适用于多平台的 html 到 pdf 的转换，pdfkit 是 wkhtmltopdf 的Python封装包。

1.2.1 安装pip，如果在安装Python时没有选择安装该包

参考Python的包管理工具pip的安装与使用

python get-pip.py 1.2.2 安装requests pip install requests 1.2.3 安装beautifulsoup pip install beautifulsoup

出现下面的错误：

从打印结果可以看出，beautifulsoup中的内容支持python2，不支持python3。
解决方案：安装beautifulsoup4

1.2.4 安装requests：安装pdfkit pip install pdfkit 1.2.5 下载并安装wkhtmltopdf

下载地址：wkhtmltopdf
安装完成后，将安装目录添加至系统path中。

2. 爬虫实现：

程序的目的是，要把所有的URL对应的html正文部分保存到本地，然后利用pdfkit把这些文件转换成一个pdf文件。

将某一个URL对应的html正文保存到本地找到所有的URL执行相同的操作

用Chrome浏览器找到页面正文部分的标签，按F12找到正文对应的div标签：<div class=”x-wiki-content”>，该div是网页的正文内容，用requests把整个页面加载到本地后就可以使用beautifulsoup操作HTML的dom元素来提取正文内容了。

3. Python 实现：

下载了作者的源码并做了一些整理和注释！

3.1 存在的问题及解决方案：找不到文件
Configuration中的wkhtmltopdf赋值错误！
修改方式：
configuration.py中，self.wkhtmltopdf = wkhtmltopdf修改为
协议未知错误

可以得到爬虫结果，但是仍然存在此问题！ 3.2 源码： # coding=utf-8from __future__ import unicode_literalsimport loggingimport osimport timeimport retry: from urllib.parse import urlparse #py3except: from urlparse import urlparse #py2import pdfkitimport requestsfrom bs4 import BeautifulSouphtml_template = """<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"></head><body>{content}</body></html>""""""爬虫基类，所有的爬虫都应该继承此类"""class Crawler(object): name = None """ 初始化 :param name:保存的PDF文件名，不需要后缀名 :param start_url:爬虫入口URL """ def __init__(self, name, start_url): self.name = name self.start_url = start_url self.domain = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(self.start_url)) """ """ def crawl(self, url): print(url) response = requests.get(url) return response """ 解析目录结构，获取所有URL目录列表，由子类实现：param response 爬虫返回的response对象：return url 可迭代对象（iterable）列表，生成器，元组都可以 """ def parse_menu(self, response): raise NotImplementedError """ 解析正文，由子类实现：param response：爬虫返回的response对象：return 返回经过处理的html文本 """ def parse_body(self, response): raise NotImplementedError def run(self): start = time.time() # options 设置PDF格式 options = { 'page-size': 'Letter', 'margin-top': '0.75in', 'margin-right': '0.75in', 'margin-bottom': '0.75in', 'margin-left': '0.75in', 'encoding': "UTF-8", 'custom-header': [ ('Accept-Encoding', 'gzip') ], 'cookie': [ ('cookie-name1', 'cookie-value1'), ('cookie-name2', 'cookie-value2'), ], 'outline-depth': 10, } #将menu对应的html解析出来，保存为html文件 htmls = [] for index,url in enumerate(self.parse_menu(self.crawl(self.start_url))): html = self.parse_body(self.crawl(url)) f_name = ".".join([str(index),"html"]) with open(f_name,'wb') as f: f.write(html) htmls.append(f_name) pdfkit.from_file(htmls, self.name+".pdf", options=options) for html in htmls: os.remove(html) total_time = time.time() - start print(u"总共耗时：%f 秒" % total_time)"""子类：爬虫廖雪峰的Python3教程"""class LiaoXueFengPythonCrawler(Crawler): #括号，表示继承 """ 完善目录解析函数,获取所有URL目录列表：param response 爬虫返回的response对象：return url生成器 """ def parse_menu(self, response): soup = BeautifulSoup(response.content, "html.parser") menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1] for li in menu_tag.find_all("li"): url = li.a.get("href") if not url.startswith("http"): url = "".join([self.domain, url]) #补全为全路径 yield url """ 完善正文解析函数，：param response：爬虫返回的response对象：return 返回处理后的html文本 """ def parse_body(self, response): try: soup = BeautifulSoup(response.content, 'html.parser') body = soup.find_all(class_="x-wiki-content")[0] #加入标题，居中显示 title = soup.find('h4').get_text() center_tag = soup.new_tag("center") title_tag = soup.new_tag('h1') title_tag.string = title center_tag.insert(1,title_tag) body.insert(1,center_tag) html = str(body) #body中的img标签的src相对路径改成绝对路径 pattern = "(<img .*?src=")(.*?)(")" def func(m): if not m.group(3).startswith("http"): rtn = "".join([m.group(1), self.domain, m.group(2), m.group(3)]) return rtn else: return "".join([m.group(1), m.group(2), m.group(3)]) html = re.compile(pattern).sub(func, html) html = html_template.format(content=html) html = html.encode("utf-8") return html except Exception as e: logging.error("解析错误", exc_info=True)if __name__ == '__main__': start_url = "http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000" crawler = LiaoXueFengPythonCrawler("廖雪峰blogs", start_url) crawler.run()