首页 > 编程知识 正文

requests库获取网页的方法,selenium爬虫

时间:2023-05-04 06:45:01 阅读:130789 作者:1461

[Scraping爬虫实例从nature获取感兴趣的文章doi,并在sci-hub上下载

你想成为文献(收藏)的大师吗? 你想最早得到自己专业领域的新鲜文献吗? 想省去反复查找文献点击下载的麻烦做有益健康的事吗? 起来登上主页吧!

本文以nature官网上的几个subjects为例,首先利用requests获取页面,用lxml分析页面,获取文献的doi并保存,然后用selenium陆续进入sci-hub

以下是源代码。

importrequestsfromlxmlimporthtmlfromseleniumimportwebdriverfromselenium.web driver.com mon.keysimportkeysfromselenium.welenium selenium.weys rtbyfromselenium.web driver.support.uiimportwebdriverwaitfromselenium.web driver.supportimportexpected actionchainsimportospath=' c :_ users\ Tao ZZ\ one drive\ selenium ' # tostorethedoilistmain _ URL=' http://www.nature.com ' # thesourceofpdfsscihub=' http://sci-hub.tw/' # addyourfavoritesubjectsheresubjects.append (https://www.nature.com/subjects/subjects subjects.append ) (3https://www.natat spatial-memory ) ) subjects.append ) https://www.nature.com subjects.append (https://www.nature.com/subjects/操作学习) ) subjects.append ) https://www.nature.coom subjects.append (https://www.nature.com/subjects/attate research-and-reviews ' ) option=web driver.chrom mer runthescrapinginthebackgrounddriver=web driver.chrome (选项=launchchromeforsubjectinsubjects 3360 re=request gettotheurlroot=html.from string (re.content ) # use lxml to parse the he html cooot ) getthehrefofthearticleslinks=[ main _ urllinkforlinkinroot.XPath ] '/H3 [ @ class=' mb10额外tight-line-heige ] getthedoisofthearticlesdois=[ linkforlinkinroot.XPath ] '/H3 [ @ class=' mb10 extra-tight-line-height ' ] ]/a/a ' _ doi.txt ' iftitlenotinos.list dir ) path ) :f=open ) title,' a ' ) f 'r ' ) as f: lines=f.readlines () n ' not inlines 3360 new _ dois.append (doi ) ifthedoihasbeenaddedf.write (str (doi ) ) () ) () ) ) ) ) ) wewillnotsearchitagainf.close (ind=0fail _ count=0whileindlen ) new_dois ) 3360try:driver.get ) sc cer elem=1 ).until ) EC.presence_of_element_located ) () by.name, ' request ' ) (# findtheinputareadoi=dois [ ind ] elem.send _ keys (doi ) # typeinthedoielem.send_keys ) ) keys.) ) 652 elem=driver.find _ element _ by _ link _ text (' ) ckthesavebuttonfail _ count=0except : fail _ count=1# some times sci-hubcrashesatspecificdoisifffail _ count 65333365292;

外卖看视频的时候随便看看文件,下载的文献已经堆积如山了。 这还是爬行动物慢慢蠕动的前提。 可以想象,如果优化性能变成小飞虫,这些文献可能就没有一天看完了。

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。