在本文中,将详细介绍如何使用Python进行网页爬虫。首先,让我们从爬取网页内容开始。
一、爬取网页内容
1、使用requests库发送HTTP请求
import requests url = "http://example.com" response = requests.get(url) content = response.text print(content)
2、使用BeautifulSoup库解析HTML内容
from bs4 import BeautifulSoup soup = BeautifulSoup(content, "html.parser") title = soup.title.string print("网页标题:" + title)
3、保存爬取的内容到文件
with open("output.html", "w", encoding="utf-8") as file: file.write(content)
二、提取数据
1、使用CSS选择器提取特定元素
elements = soup.select("div.container > h1") for element in elements: print(element.text)
2、使用正则表达式提取特定模式的数据
import re pattern = r"d+" matches = re.findall(pattern, content) for match in matches: print(match)
3、使用XPath提取特定元素
from lxml import etree tree = etree.HTML(content) elements = tree.xpath("//div[@class='container']/h1") for element in elements: print(element.text)
三、处理数据
1、数据清洗和转换
cleaned_data = [data.strip() for data in matches] print(cleaned_data) converted_data = [int(data) for data in cleaned_data] print(converted_data)
2、数据存储到数据库
import sqlite3 conn = sqlite3.connect("data.db") cursor = conn.cursor() cursor.execute("CREATE TABLE IF NOT EXISTS numbers (value INTEGER)") for data in converted_data: cursor.execute("INSERT INTO numbers VALUES (?)", (data,)) conn.commit() conn.close()
3、数据可视化
import matplotlib.pyplot as plt plt.plot(converted_data) plt.xlabel("Index") plt.ylabel("Value") plt.title("Data Visualization") plt.show()
四、处理反爬措施
1、设置请求头信息
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", } response = requests.get(url, headers=headers)
2、使用代理IP
proxies = { "http": "http://127.0.0.1:8080", "https": "https://127.0.0.1:8080", } response = requests.get(url, proxies=proxies)
3、处理验证码
# 在代码中添加处理验证码的逻辑
通过以上步骤,您可以使用Python进行网页爬虫,并对爬取的内容进行数据提取、处理、存储和可视化。