python有什么用,python3反爬虫原理与绕过实战pdf

1 .准备工作

写爬行动物前的步骤：

1 .从哪里爬where

2 .爬什么样的what

3 .如何爬how

4 .攀登后，信息如何保存？

我叫它WWHS。这是最基本的步骤。

1.1从哪里爬where，从什么爬what

其实where和what是融会贯通的一体，彩色的金毛找到what的时候，自然就找到了where。当彩色金毛确定where时，what自然知道。

这次，将登上《三国演义》、《隋唐演义》等诗歌名文网《http://www.shicimingju.com/》的名著。

爬网的内容如下

主页：“小说名称、各章名称、各章链接”

几页：“各章内容”

1.2怎么爬How

1 .在主页上利用正则取小说名称存储在book_name中，然后取各章名称存储在chapter中，取各章链接存储在bookurl中。

2 .使用for循环逐一使用bookurl[？以获取子页面代码。

3 .使用正则表达式写出子页面代码各章的内容。

a图片is worth a thousand words

1.3攀登后，信息如何保存

1 .小说类型适合文件存储，不适合数据库存储。

将book_name创建为文件名，并在for循环中使用各章的名称chapter[？ ]和各章的内容chapterText将保存在文件中，成为完整的小说文件。

a图片is worth a thousand words

2 .写代码

读取网页代码

import urllib.request

导入re

index URL=' http://www.shicimingju.com/book/sanguo yanyi.html '

hml=urllib.request.urlopen (index URL ).read ) )。

html=html.decode(UTF8 ) )。

爬书名book_name、各章的名称chapter和书的链接bookurl

具体问题具体分析后，我在爬各章的链接时，发现原来是相对路径，不能直接使用。因此，只要对书中的链接进行字符串修改，就可以跳转到各章的页面。）

book_name=re.findall ('

()、html、re.S ) chapter=re.findall )、href='/book/. { 0，30 } ((d ).html ).*？ ()、html、re.S ) )

bookURL=re.findall('href=' )/book/. { 0，30 }d. html ) '，html，re.S ) () ) ) )。

chapterurlbegin=re.sub('.html '，'，indexUrl ) #用各章链接的开头替换指向书的链接

获取各章的内容chapterText，输出到文件中。

注意看其中具体的输出，替换其中的几个字符和标签。

forIinrange(0，Len ) bookURL ) :

#提取各章的number

number=re.findall((/).{ 1，4 } ) )..html )，bookurl[i] ) ) ) ) ) ) ) )。

#连接字符串形成到各章的链接

chapterURL=re.sub('$ '，'/' number[0] '.html '，chapterUrlBegin ) )。

#打开链接页面

chapter html=urllib.request.urlopen (chapter URL ).read ) ) )。

chapter html=chapter html.decode (' utf-8 '，' ignore ' )。

找出各章的内容

chapterText=re.findall ('

() )？ )、chapterHtml、re.S ) #替换其中的标签

chapterText=re.sub ('

'、'、'、'、'.join(chaptertext ) )

chapterText=re.sub ('

'，'，' '.join(chaptertext ) ) chapterText=re.sub ('，'，'.join ) chaptertext ) )

#输出文件

f=open(d://book/'''.join ) book_name ).txt )，' a '，encoding='utf-8 ' )

f.write(chapter[I]'n ' )

f.write(chaptertext'n ' ) )。

f.close () )

3 .总结

整个过程是一句话：

写入正规的网页信息并保存到数据库或文件中。