用python新闻网站抓取新闻,python爬取新浪新闻

本文目录一览：

1、Python如何简单爬取腾讯新闻网前五页文字内容？
2、怎么用Python网络爬虫爬取腾讯新闻内容
3、python3 怎么爬取新闻网站
4、如何用Python爬虫抓取网页内容?

Python如何简单爬取腾讯新闻网前五页文字内容？

可以使用python里面的一个爬虫库，beautifulsoup，这个库可以很方便的爬取数据。爬虫首先就得知道网页的链接，然后获取网页的源代码，通过正则表达式或者其他方法来获取所需要的内容，具体还是要对着网页源代码进行操作，查看需要哪些地方的数据，然后通过beautifulsoup来爬取特定html标签的内容。网上有很多相关的内容，可以看看。

怎么用Python网络爬虫爬取腾讯新闻内容

所谓网页抓取，就是把URL地址中指定的网络资源从网络流中读取出来，保存到本地。类似于使用程序模拟IE浏览器的功能，把URL作为HTTP请求的内容发送到服务器端，然后读取服务器端的响应资源。在Python中，我们使用urllib2这个组件来抓取网页。u...

python3 怎么爬取新闻网站

需求：

从门户网站爬取新闻，将新闻标题，作者，时间，内容保存到本地txt中。

用到的python模块：

import re # 正则表达式

import bs4 # Beautiful Soup 4 解析模块

import urllib2 # 网络访问模块

import News #自己定义的新闻结构

import codecs #解决编码问题的关键，使用codecs.open打开文件

import sys #1解决不同页面编码问题

其中bs4需要自己装一下，安装方法可以参考：Windows命令行下pip安装python whl包

程序：

#coding=utf-8

import re # 正则表达式

import bs4 # Beautiful Soup 4 解析模块

import urllib2 # 网络访问模块

import News #自己定义的新闻结构

import codecs #解决编码问题的关键，使用codecs.open打开文件

import sys #1解决不同页面编码问题

reload(sys) # 2

sys.setdefaultencoding('utf-8') # 3

# 从首页获取所有链接

def GetAllUrl(home):

html = urllib2.urlopen(home).read().decode('utf8')

soup = bs4.BeautifulSoup(html, 'html.parser')

pattern = 'http://w+.baijia.baidu.com/article/w+'

links = soup.find_all('a', href=re.compile(pattern))

for link in links:

url_set.add(link['href'])

def GetNews(url):

global NewsCount,MaxNewsCount #全局记录新闻数量

while len(url_set) != 0:

try:

# 获取链接

url = url_set.pop()

url_old.add(url)

# 获取代码

html = urllib2.urlopen(url).read().decode('utf8')

# 解析

soup = bs4.BeautifulSoup(html, 'html.parser')

pattern = 'http://w+.baijia.baidu.com/article/w+' # 链接匹配规则

links = soup.find_all('a', href=re.compile(pattern))

# 获取URL

for link in links:

if link['href'] not in url_old:

url_set.add(link['href'])

# 获取信息

article = News.News()

article.url = url # URL信息

page = soup.find('div', {'id': 'page'})

article.title = page.find('h1').get_text() # 标题信息

info = page.find('div', {'class': 'article-info'})

article.author = info.find('a', {'class': 'name'}).get_text() # 作者信息

article.date = info.find('span', {'class': 'time'}).get_text() # 日期信息

article.about = page.find('blockquote').get_text()

pnode = page.find('div', {'class': 'article-detail'}).find_all('p')

article.content = ''

for node in pnode: # 获取文章段落

article.content += node.get_text() + 'n' # 追加段落信息

SaveNews(article)

print NewsCount

break

except Exception as e:

print(e)

continue

else:

print(article.title)

NewsCount+=1

finally:

# 判断数据是否收集完成

if NewsCount == MaxNewsCount:

break

def SaveNews(Object):

file.write("【"+Object.title+"】"+"t")

file.write(Object.author+"t"+Object.date+"n")

file.write(Object.content+"n"+"n")

url_set = set() # url集合

url_old = set() # 爬过的url集合

NewsCount = 0

MaxNewsCount=3

home = '' # 起始位置

GetAllUrl(home)

file=codecs.open("D:\test.txt","a+") #文件操作

for url in url_set:

GetNews(url)

# 判断数据是否收集完成

if NewsCount == MaxNewsCount:

break

file.close()

新闻文章结构

#coding: utf-8

# 文章类定义

class News(object):

def __init__(self):

self.url = None

self.title = None

self.author = None

self.date = None

self.about = None

self.content = None

对爬取的文章数量就行统计。

如何用Python爬虫抓取网页内容?

爬虫流程

其实把网络爬虫抽象开来看，它无外乎包含如下几个步骤

模拟请求网页。模拟浏览器，打开目标网站。

获取数据。打开网站之后，就可以自动化的获取我们所需要的网站数据。

保存数据。拿到数据之后，需要持久化到本地文件或者数据库等存储设备中。

那么我们该如何使用 Python 来编写自己的爬虫程序呢，在这里我要重点介绍一个 Python 库：Requests。

Requests 使用

Requests 库是 Python 中发起 HTTP 请求的库，使用非常方便简单。

模拟发送 HTTP 请求

发送 GET 请求

当我们用浏览器打开豆瓣首页时，其实发送的最原始的请求就是 GET 请求

import requests

res = requests.get('')

print(res)

print(type(res))

Response [200]

class 'requests.models.Response'