python批量替换pdf中字符,使用python提取pdf文件

网上有各种各样的软件处理pdf，但这些软件处理pdf的效果并不理想，本人使用的是Python的库

PyPDF2实现了pdf文字内容的提取，但图像的提取是日后的事了，无厘头的直接代码： frompypdf2importpdfilereader # pdf定义获取内容的方法defgetpdfcontent (filenanation ) ) pdffilereader对象pdf=pdffilereader ) open 'rb ' ) (content='' #content是输出文本forIinrange(0， pdf.getNumPages () : # )老虎每页的pageObj=pdf.getPage(i ) (I ) try : extracted text=page obj.extract text ) 如果当前页面包含图像，请跳过content=extracted text 'n ' exceptbaseexception 3360 passreturncontent.encode (ascii )，然后单击“ignore” 本人设定的是10行#将获取的文本作为字符串用空白分隔的foriteminstr(getpdfcontent ).pdf ) ).split ) '') : # )当前文本以句点结束时，换行ifitem(-1 )=='.'3360 )的n ' ) count=0else:f.write(item ' ) count =1 # )写10个字符后换行if count