本文将从多个方面对Python3网页提取进行详细阐述。
一、网页解析库
1、Beautiful Soup
<code>
from bs4 import BeautifulSoup
import requests
# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text
# 使用BeautifulSoup解析网页
soup = BeautifulSoup(html_content, "html.parser")
# 提取指定标签内容
title = soup.title.text
print(title)
# 提取所有链接
links = soup.find_all("a")
for link in links:
print(link.get("href"))
</code>
2、lxml库
<code>
from lxml import etree
import requests
# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.content
# 使用lxml解析网页
tree = etree.HTML(html_content)
# 提取指定标签内容
title = tree.xpath('//title')[0].text
print(title)
# 提取所有链接
links = tree.xpath('//a/@href')
for link in links:
print(link)
</code>
二、正则表达式
1、使用re模块提取网页内容
<code>
import re
import requests
# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text
# 使用正则表达式提取指定内容
pattern = r'(.*?)
'
result = re.findall(pattern, html_content)
if result:
print(result[0])
pattern = r''
links = re.findall(pattern, html_content)
for link in links:
print(link)
</code>
2、使用re模块提取网页链接
<code>
import re
import requests
# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text
# 使用正则表达式提取链接
pattern = r''
links = re.findall(pattern, html_content)
for link in links:
print(link)
</code>
三、网页API提取
1、使用requests库获取API数据
<code>
import requests
# 请求API数据
url = "https://api.example.com/data"
response = requests.get(url)
data = response.json()
# 提取指定数据
title = data["title"]
print(title)
urls = data["urls"]
for url in urls:
print(url)
</code>
2、使用json库处理API数据
<code>
import json
import requests
# 请求API数据
url = "https://api.example.com/data"
response = requests.get(url)
data = json.loads(response.text)
# 提取指定数据
title = data["title"]
print(title)
urls = data["urls"]
for url in urls:
print(url)
</code>
四、网页爬虫
1、使用Scrapy框架进行网页爬取
<code>
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"https://www.example.com",
]
def parse(self, response):
# 提取指定内容
title = response.css("title::text").get()
print(title)
links = response.css("a::attr(href)").getall()
for link in links:
print(link)
yield scrapy.Request(url=link, callback=self.parse)
</code>
2、使用Requests和正则表达式进行网页爬取
<code>
import re
import requests
# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text
# 使用正则表达式提取指定内容
pattern = r'(.*?)
'
result = re.findall(pattern, html_content)
if result:
print(result[0])
pattern = r''
links = re.findall(pattern, html_content)
for link in links:
print(link)
</code>