python网络爬虫可以干什么,python多进程和多线程

简而言之，我的网络爬虫有两个主要工作。收集器和爬虫程序。收集器收集每个站点的所有url条目，并存储不重复的url。爬虫程序从内存中获取url，提取并保存所需的数据。 2 MachinesBot machine - 8 core，physicallinuxos (novmonthismachine )上

存储机器- mysqlwithclustering (vmforclustering )、2数据库(URLanddata )； URL数据库on port1and data port 2

目的：对100个站点进行爬行动物，减少瓶颈的第一个案例：收集器*请求(urllib ) all sites，collect the url

itemsforeachsitesand * insert ifit ' snonduplicatedurlto

storagemachineonport1. crawler * gettheurlfromstorageport 1，

* requestsiteandextractneededdataand * store it ' sbackonport 2

这将导致要求网站与mySql连接的连接瓶颈second case : insteadofinsertingacrossthemachine，Collector store

theurlonmyownminidatabasefilesystem.there isno * reada huge

用户命令管理器(file ) just * write (append ) and *remove header

Ofthe文件。

这可能会导致请求站点和I/O (读、写)瓶颈。

在这两种情况下，由于CPU限制，我们收集并爬网了100个站点

正如我听说I/O边界使用多线程，CPU边界使用多重处理

哪个都怎么样？好战的？你有什么想法或建议吗？在