Python爬虫例子用法介绍

本文将从多个方面对一些Python爬虫例子进行详细的阐述，为读者提供全面的了解和学习指导。

一、基本的页面爬取

1、使用Requests库发送HTTP请求并获取网页内容。

2、使用BeautifulSoup库解析HTML页面，提取所需的信息。

下面是一个简单的例子，使用Requests和BeautifulSoup库来获取并打印豆瓣电影Top250的电影名称：

import requests
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/top250'

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
movies = soup.find_all('div', class_='hd')

for movie in movies:
    title = movie.a.span.text
    print(title)

二、数据存储

1、将爬取到的数据存储到CSV文件中。

2、将爬取到的数据存储到MySQL数据库中。

下面是一个示例，将豆瓣电影Top250的电影名称和评分存储到CSV文件中：

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://movie.douban.com/top250'

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
movies = soup.find_all('div', class_='item')

with open('top250.csv', 'w', encoding='utf-8', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['电影名称', '评分'])

    for movie in movies:
        title = movie.find('span', class_='title').text
        rating = movie.find('span', class_='rating_num').text
        writer.writerow([title, rating])

三、动态页面爬取

1、使用Selenium库模拟浏览器行为进行爬取。

2、使用抓包工具分析XHR请求，直接请求API接口获取数据。

下面是一个示例，使用Selenium模拟浏览器行为获取知乎首页动态加载的问题：

from selenium import webdriver

url = 'https://www.zhihu.com'

driver = webdriver.Chrome()
driver.get(url)

questions = driver.find_elements_by_css_selector('.ListShortcut-item .ContentItem-title')

for question in questions:
    print(question.text)

driver.quit()

四、数据解析和提取

1、使用正则表达式进行数据解析和提取。

2、使用XPath或CSS选择器进行数据解析和提取。

下面是一个示例，使用正则表达式从HTML页面中提取邮箱地址：

import re

text = '''
联系我们：info@example.com
咨询邮箱：contact@example.com
'''

emails = re.findall(r'[w.-]+@[w.-]+', text)

for email in emails:
    print(email)

五、登陆验证

1、使用Cookies进行登陆验证。

2、模拟登陆表单提交进行验证。

下面是一个示例，使用Requests库进行模拟登陆GitHub并获取用户个人信息：

import requests

login_url = 'https://github.com/login'
profile_url = 'https://github.com/profile'

session = requests.session()

# 登陆
response = session.get(login_url)
csrf_token = re.search(r'name="csrf-token" content="(.*?)"', response.text).group(1)
payload = {
    'authenticity_token': csrf_token,
    'login': 'your_username',
    'password': 'your_password'
}
session.post(login_url, data=payload)

# 获取个人信息
response = session.get(profile_url)
username = re.search(r'(.*?)', response.text).group(1)
print(username)

六、反爬虫处理

1、设置请求头(User-Agent, Referer等)进行伪装。

2、使用代理IP进行请求。

下面是一个示例，使用伪装User-Agent发送请求爬取淘宝商品信息：

import requests

url = 'https://www.taobao.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

response = requests.get(url, headers=headers)
print(response.text)

通过以上例子的讲解，我们可以看到Python爬虫的基本原理和常用的技术。希望本文对读者在学习和应用Python爬虫方面有所帮助。