批量网页文字提取工具,一键提取网页文字

是否遇到过这样的问题，很多的原文链接，想要识别里面文字，一个个打开进去截取，过于费劲

可以用python的newspaper库来实现
这个库分为 Python2 和 Python3 两个版本，Python2 下的版本叫做 newspaper，Python3 下的版本叫做 newspaper3k，这里使用 Python3 版本来进行测试。

pip3 install newspaper3k import urllibimport reimport osimport stringfrom bs4 import BeautifulSoupimport loggingfrom newspaper import Articlecounts1=0counts2=0counts3=0urlLinks = []save_urls = '3.txt'# file = open(save_urls, 'r')file= open("3.txt",encoding='utf-8')# 读取之前保存的urlfor line in file: urlLinks.append(line)file.close()print(len(urlLinks))print(urlLinks)for link in urlLinks: try: news = Article(link.strip(), language='zh') news.download() # 加载网页 news.parse() # 解析网页 print(news.text) if len(news.text)>256: counts1=counts1+1 elif len(news.text)<256: counts2=counts2+1 print('-------------------------------------------------------------------------------------------------------') print('counts1:'+str(counts1)) print('counts2:' +str(counts2)) print('counts3:' + str(counts3)) except Exception as e: counts3 = counts3 + 1 pass continueprint('第一成功率：'+str(counts1/len(urlLinks)*100)+'%')print('第二成功率：'+str((counts2+counts1)/len(urlLinks)*100)+'%')

其中第一成功率是在链接网址下识别出来大于256个字除于总链接数（可以测试newspaper库）
第二成功率是在链接网址下识别出来小于256个字除于总链接数
counts1是识别出来大于256字的网址个数
counts2是识别出来小于256字的网址个数
counts3是报错无法识别的网址个数

newspaper常用方法

print(news.title) # 题目print(news.text) # 正文内容 print(news.authors) # 作者print(news.keywords) # 关键词print(news.summary) # 摘要print(news.top_image) # 配图地址print(news.movies) # 视频地址print(news.publish_date) # 发布日期print(news.html) # 网页源代码