requests库获取网页的方法,selenium爬虫

[Scraping爬虫实例从nature获取感兴趣的文章doi，并在sci-hub上下载

你想成为文献(收藏)的大师吗？你想最早得到自己专业领域的新鲜文献吗？想省去反复查找文献点击下载的麻烦做有益健康的事吗？起来登上主页吧！

本文以nature官网上的几个subjects为例，首先利用requests获取页面，用lxml分析页面，获取文献的doi并保存，然后用selenium陆续进入sci-hub

以下是源代码。

importrequestsfromlxmlimporthtmlfromseleniumimportwebdriverfromselenium.web driver.com mon.keysimportkeysfromselenium.welenium selenium.weys rtbyfromselenium.web driver.support.uiimportwebdriverwaitfromselenium.web driver.supportimportexpected actionchainsimportospath=' c :_ users\ Tao ZZ\ one drive\ selenium ' # tostorethedoilistmain _ URL=' http://www.nature.com ' # thesourceofpdfsscihub=' http://sci-hub.tw/' # addyourfavoritesubjectsheresubjects.append (https://www.nature.com/subjects/subjects subjects.append ) (3https://www.natat spatial-memory ) ) subjects.append ) https://www.nature.com subjects.append (https://www.nature.com/subjects/操作学习) ) subjects.append ) https://www.nature.coom subjects.append (https://www.nature.com/subjects/attate research-and-reviews ' ) option=web driver.chrom mer runthescrapinginthebackgrounddriver=web driver.chrome (选项=launchchromeforsubjectinsubjects 3360 re=request gettotheurlroot=html.from string (re.content ) # use lxml to parse the he html cooot ) getthehrefofthearticleslinks=[ main _ urllinkforlinkinroot.XPath ] '/H3 [ @ class=' mb10额外tight-line-heige ] getthedoisofthearticlesdois=[ linkforlinkinroot.XPath ] '/H3 [ @ class=' mb10 extra-tight-line-height ' ] ]/a/a ' _ doi.txt ' iftitlenotinos.list dir ) path ) :f=open ) title，' a ' ) f 'r ' ) as f: lines=f.readlines () n ' not inlines 3360 new _ dois.append (doi ) ifthedoihasbeenaddedf.write (str (doi ) ) () ) () ) ) ) ) ) wewillnotsearchitagainf.close (ind=0fail _ count=0whileindlen ) new_dois ) 3360try:driver.get ) sc cer elem=1 ).until ) EC.presence_of_element_located ) () by.name， ' request ' ) (# findtheinputareadoi=dois [ ind ] elem.send _ keys (doi ) # typeinthedoielem.send_keys ) ) keys.) ) 652 elem=driver.find _ element _ by _ link _ text (' ) ckthesavebuttonfail _ count=0except : fail _ count=1# some times sci-hubcrashesatspecificdoisifffail _ count 65333365292;

外卖看视频的时候随便看看文件，下载的文献已经堆积如山了。这还是爬行动物慢慢蠕动的前提。可以想象，如果优化性能变成小飞虫，这些文献可能就没有一天看完了。