爬虫python能做什么,python为什么叫爬虫

欢迎来到python爬虫大礼堂。现在开始爬行动物之旅吧。

开始你的爬虫。我以爬取我的博客页面为例，为大家解析爬虫的基础知识。首先安装requests库：

打开cmd窗口并输入pip install requests。首先，使用requests库获取页面。

importrequestslink=' https://blog.csdn.net/weixin _ 42183408 ' headers={ ' user-agent ' : ' Mozilla/5.0 } win 660 x64 ) appleWebKit/537.36(khtml，like Gecko ) chrome/71.0.3578.98 safari/537.36 ' } r=requests.get } link，het 有几个应该注意的地方。

user-agent伪装成浏览器访问r.text是网页的源代码，后面会介绍headers

运行代码时，将显示所有web代码，例如：

要提取数据，接下来必须安装bs4库。

打开cmd窗口，然后输入pip install bs4代码：

importrequestsfrombs4importbeautifulsouplink=' https://blog.csdn.net/weixin _ 42183408 ' headers={ ' user-agent ' 3339 x64 ) appleWebKit/537.36(khtml，like Gecko ) chrome/71.0.3578.98 safari/537.36 ' } r=requests.get } link，het 652019-02-141666

这里，我们使用BeautifulSoup库分析web页，首先导入库，然后将web代码分析为BeautifulSoup格式，然后输入soup.find(span )、class _=' doup

那么，在那么长的代码中怎么找到标题的位置？

于是，Chrome的检查功能隆重登场：

步骤1 :在Chrome浏览器中打开3359 blog.csdn.net/weixin _ 42183408，右键单击页面，然后在出现的菜单中单击“检查”。

步骤单击elements旁边的鼠标按钮(左上角)，选择要显示的元素，自动移动到该元素的位置。

步骤3我们发现这里的代码是span class=' date ' 2019-02-1219336029336008/span。因此，可以使用soup.find('span )、class_='date ' ).ttate

存储数据importrequestsfrombs4importbeautifulsouplink=' https://blog.csdn.net/weixin _ 42183408 ' headers={ ' user-age x64 ) appleWebKit/537.36(khtml，like Gecko ) chrome/71.0.3578.98 safari/537.36 ' } r=requests.get } link，het

Python入门知识(8)-open )函数

接下来，打开date.txt文件。我知道上面写着日期。

学完第一个爬虫例子后，是不是感觉不难呢？当然，我建议大家自己手写代码，而不是直接复制黏贴，只有自己写代码才能发现自己的缺点，加以改进，代码也能真正被记到心中，久而久之，熟能生巧。

现在，我们将讨论我们使用的requests库。回头见。