用Python爬取历史天气数据

本文将详细介绍如何使用Python进行历史天气数据的爬取。

一、准备工作

在进行爬取之前，需要进行一些准备工作：

1、下载安装Python以及相关的第三方库：requests、beautifulsoup4、pandas。

pip install requests
pip install beautifulsoup4
pip install pandas

2、了解爬取的目标网站。

本次使用的是中国天气网（http://www.weather.com.cn/）。

二、获取历史天气链接

首先，需要获取历史天气页面的链接。

import requests
from bs4 import BeautifulSoup

url = 'http://www.weather.com.cn/textFC/hb.shtml'

# 发送请求并获取网页内容
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# 得到历史天气的链接
link = soup.find('div', {'class': 'conMidtab'}).find_all('a')[0].get('href')
history_url = 'http://www.weather.com.cn' + link

代码解释：

首先，使用requests库和BeautifulSoup库获取中国天气网的网页内容。

然后，在网页内容中找到历史天气链接，并拼接成完整链接。

三、获取历史天气数据

有了历史天气页面的链接，接下来就是获取历史天气数据。

import re

# 获取每个城市历史天气数据
def get_history_weather(city_code):
    url = 'http://www.weather.com.cn/weather/' + city_code + '.shtml'

    # 发送请求并获取网页内容
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # 获取城市名称
    city_name = soup.find('div', {'class': 'crumbs fl'}).text.split()[0]

    # 获取历史天气数据
    pattern = re.compile(r'var hour3data=(.*?);', re.S)
    script = soup.find('script', text=pattern).text.strip()
    data = pattern.search(script).group(1)
    data = eval(data)

    # 将数据转换成DataFrame格式
    df = pd.DataFrame(data)
    df.columns = ['date', 'temp', 'weather', 'wind_direction', 'wind_power']
    df['city'] = city_name

    return df

# 获取所有城市历史天气数据
def get_all_history_weather():
    # 发送请求并获取网页内容
    response = requests.get(history_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # 获取所有省份的div
    province_divs = soup.find('div', {'class': 'lqcontentBoxheader'}).find_all_next('div', {'class': 'lqcontentBox'})
    for province_div in province_divs:
        # 获取所有城市code
        city_links = province_div.find_all('a')
        city_codes = [link.get('href')[-10:-6] for link in city_links]

        # 获取每个城市的历史天气数据
        for city_code in city_codes:
            df = get_history_weather(city_code)
            yield df

代码解释：

首先，定义了一个get_history_weather函数，用于获取单个城市的历史天气数据。

该函数接受一个城市编码，通过拼接URL发送请求获取该城市的网页内容。

然后，从网页内容中利用正则表达式提取出历史天气数据，并将其转换成DataFrame格式。

最后，返回包含该城市历史天气数据的DataFrame。

接下来，定义了一个get_all_history_weather函数，用于获取所有城市的历史天气数据。

该函数通过发送请求获取历史天气页面的网页内容，然后从中找到所有城市的编码。

接着，循环遍历所有城市编码，并调用get_history_weather函数获取该城市的历史天气数据。

最终，使用yield关键字返回所有城市历史天气数据的DataFrame。

四、存储历史天气数据

有了历史天气数据之后，接下来就是将其存储到本地文件中。

def save_to_csv():
    for df in get_all_history_weather():
        try:
            df.to_csv('history_weather.csv', mode='a', header=None, index=False, encoding='utf-8-sig')
        except:
            pass

代码解释：

定义了一个save_to_csv函数，用于将历史天气数据存储到本地文件中。

该函数通过调用get_all_history_weather函数获取所有城市的历史天气数据。

然后，使用pandas库的to_csv方法将数据存储到本地文件中。

注意：mode='a' 表示将数据追加到文件末尾，header=None表示不添加列名，index=False表示不添加行索引，encoding='utf-8-sig'表示使用UTF-8编码。

五、总结

本文介绍了如何使用Python爬取历史天气数据，并将其存储到本地文件中。

具体来说，首先获取历史天气页面的链接，然后通过获取每个城市的编码逐一获取历史天气数据。最后，将所有城市的历史天气数据存储到本地文件中。

相信本文对你在Python爬虫方面的学习会有所帮助。