Python爬虫保存网页到本地

Python爬虫是一种获取互联网上数据的技术，而保存网页到本地则是爬虫的常见应用之一。本文将从多个方面介绍Python爬虫保存网页到本地的方法和技巧。

一、使用urllib库获取网页源代码

Python的urllib库是一个强大的网络请求库，可以发送HTTP请求并获取网页的源代码。下面是示例代码：

import urllib.request

url = "http://www.example.com"
response = urllib.request.urlopen(url)
html_content = response.read()

with open("example.html", "wb") as f:
    f.write(html_content)

上述代码通过urlopen()函数打开指定URL的网页，然后使用read()方法读取网页的源代码，并将其保存到本地的example.html文件中。

二、使用requests库获取网页源代码

相比于urllib库，requests库更加方便和易用。下面是示例代码：

import requests

url = "http://www.example.com"
response = requests.get(url)
html_content = response.text

with open("example.html", "w", encoding="utf-8") as f:
    f.write(html_content)

上述代码通过get()方法发送GET请求，并获取网页的源代码。然后将源代码保存到本地的example.html文件中。需要注意的是，在使用requests库时，我们需要设置文件的编码方式。

三、解决网页中的相对路径问题

有些网页中的资源引用使用的是相对路径，保存网页到本地时需要解决这个问题。可以使用urljoin()函数来处理相对路径。下面是示例代码：

import requests
from urllib.parse import urljoin

url = "http://www.example.com/part2/page.html"
response = requests.get(url)
html_content = response.text

base_url = "http://www.example.com/part2/"
html_content = html_content.replace('src="', 'src="' + base_url)
html_content = html_content.replace('href="', 'href="' + base_url)

with open("example.html", "w", encoding="utf-8") as f:
    f.write(html_content)

上述代码中，使用urljoin()函数将相对路径转换为绝对路径，然后通过替换来修复网页中的资源引用。

四、使用Beautiful Soup解析网页

如果需要对网页的内容进行进一步处理，可以使用Beautiful Soup库来解析网页。下面是示例代码：

import requests
from bs4 import BeautifulSoup

url = "http://www.example.com"
response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')
# 在这里进行网页内容的解析和处理

with open("example.html", "w", encoding="utf-8") as f:
    f.write(str(soup))

上述代码中，使用BeautifulSoup类将网页的源代码转换为Beautiful Soup对象。然后就可以对网页的内容进行解析和处理，并将修改后的内容保存到本地的example.html文件中。

五、设置请求头信息

有些网站对爬虫进行了限制，会根据请求头信息来判断是否允许访问。我们可以在请求中设置User-Agent等头信息来模拟浏览器的请求。下面是示例代码：

import requests

url = "http://www.example.com"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
html_content = response.text

with open("example.html", "w", encoding="utf-8") as f:
    f.write(html_content)

上述代码中，我们在请求中设置了User-Agent头信息，使得请求看起来像是来自于一个普通的浏览器。

六、处理异常情况

在爬取网页的过程中，可能会遇到各种异常情况，比如网络连接超时、网页不存在等。我们可以使用try-except语句来捕获这些异常，并进行相应处理。下面是示例代码：

import requests

url = "http://www.example.com"
try:
    response = requests.get(url)
    response.raise_for_status()
    html_content = response.text

    with open("example.html", "w", encoding="utf-8") as f:
        f.write(html_content)
except requests.exceptions.RequestException as e:
    print("请求失败:", e)

上述代码中，我们使用raise_for_status()方法来判断请求是否成功，如果请求失败会抛出一个异常。然后在except块中进行相应的异常处理。

通过上述方法，我们可以灵活地使用Python爬虫保存网页到本地，同时还可以进行进一步的处理和分析。