使用Python进行网页爬虫

在本文中，将详细介绍如何使用Python进行网页爬虫。首先，让我们从爬取网页内容开始。

一、爬取网页内容

1、使用requests库发送HTTP请求

import requests

url = "http://example.com"
response = requests.get(url)
content = response.text
print(content)

2、使用BeautifulSoup库解析HTML内容

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
title = soup.title.string
print("网页标题：" + title)

3、保存爬取的内容到文件

with open("output.html", "w", encoding="utf-8") as file:
    file.write(content)

二、提取数据

1、使用CSS选择器提取特定元素

elements = soup.select("div.container > h1")
for element in elements:
    print(element.text)

2、使用正则表达式提取特定模式的数据

import re

pattern = r"d+"
matches = re.findall(pattern, content)
for match in matches:
    print(match)

3、使用XPath提取特定元素

from lxml import etree

tree = etree.HTML(content)
elements = tree.xpath("//div[@class='container']/h1")
for element in elements:
    print(element.text)

三、处理数据

1、数据清洗和转换

cleaned_data = [data.strip() for data in matches]
print(cleaned_data)

converted_data = [int(data) for data in cleaned_data]
print(converted_data)

2、数据存储到数据库

import sqlite3

conn = sqlite3.connect("data.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS numbers (value INTEGER)")
for data in converted_data:
    cursor.execute("INSERT INTO numbers VALUES (?)", (data,))
conn.commit()
conn.close()

3、数据可视化

import matplotlib.pyplot as plt

plt.plot(converted_data)
plt.xlabel("Index")
plt.ylabel("Value")
plt.title("Data Visualization")
plt.show()

四、处理反爬措施

1、设置请求头信息

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36",
}
response = requests.get(url, headers=headers)

2、使用代理IP

proxies = {
    "http": "http://127.0.0.1:8080",
    "https": "https://127.0.0.1:8080",
}
response = requests.get(url, proxies=proxies)

3、处理验证码

# 在代码中添加处理验证码的逻辑

通过以上步骤，您可以使用Python进行网页爬虫，并对爬取的内容进行数据提取、处理、存储和可视化。