首页 > 编程知识 正文

Python3网页提取

时间:2023-11-22 15:44:08 阅读:302341 作者:DFNZ

本文将从多个方面对Python3网页提取进行详细阐述。

一、网页解析库

1、Beautiful Soup

<code>
from bs4 import BeautifulSoup
import requests

# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text

# 使用BeautifulSoup解析网页
soup = BeautifulSoup(html_content, "html.parser")

# 提取指定标签内容
title = soup.title.text
print(title)

# 提取所有链接
links = soup.find_all("a")
for link in links:
    print(link.get("href"))
</code>

2、lxml库

<code>
from lxml import etree
import requests

# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.content

# 使用lxml解析网页
tree = etree.HTML(html_content)

# 提取指定标签内容
title = tree.xpath('//title')[0].text
print(title)

# 提取所有链接
links = tree.xpath('//a/@href')
for link in links:
    print(link)

</code>

二、正则表达式

1、使用re模块提取网页内容

<code>
import re
import requests

# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text

# 使用正则表达式提取指定内容
pattern = r'

(.*?)

' result = re.findall(pattern, html_content) if result: print(result[0]) pattern = r'' links = re.findall(pattern, html_content) for link in links: print(link) </code>

2、使用re模块提取网页链接

<code>
import re
import requests

# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text

# 使用正则表达式提取链接
pattern = r''
links = re.findall(pattern, html_content)
for link in links:
    print(link)
</code>

三、网页API提取

1、使用requests库获取API数据

<code>
import requests

# 请求API数据
url = "https://api.example.com/data"
response = requests.get(url)
data = response.json()

# 提取指定数据
title = data["title"]
print(title)

urls = data["urls"]
for url in urls:
    print(url)
</code>

2、使用json库处理API数据

<code>
import json
import requests

# 请求API数据
url = "https://api.example.com/data"
response = requests.get(url)
data = json.loads(response.text)

# 提取指定数据
title = data["title"]
print(title)

urls = data["urls"]
for url in urls:
    print(url)
</code>

四、网页爬虫

1、使用Scrapy框架进行网页爬取

<code>
import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "https://www.example.com",
    ]

    def parse(self, response):
        # 提取指定内容
        title = response.css("title::text").get()
        print(title)

        links = response.css("a::attr(href)").getall()
        for link in links:
            print(link)

            yield scrapy.Request(url=link, callback=self.parse)

</code>

2、使用Requests和正则表达式进行网页爬取

<code>
import re
import requests

# 获取网页内容
url = "https://www.example.com"
response = requests.get(url)
html_content = response.text

# 使用正则表达式提取指定内容
pattern = r'

(.*?)

' result = re.findall(pattern, html_content) if result: print(result[0]) pattern = r'' links = re.findall(pattern, html_content) for link in links: print(link) </code>

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。