Python 中的 PDF 处理

Python 被认为是一种极其灵活的编程语言，拥有广泛的库，是一种高级语言，具有易于阅读和编写的语法。Python 的范围正在不同的领域扩展，如机器学习、网络开发、网络安全、应用开发等等。因此，这种编程语言在程序员、工程师和开发人员中被广泛选择。

在下面的教程中，我们将在 Python 编程语言的帮助下处理 pdf。pdf，缩写为便携式文档格式，是一种包含文本、表格、图像等的文档文件格式，通常在我们需要保存无法进一步修改或易于共享或打印的文件时使用。 PDF 文件格式是由 Adobe 在 1993 年开发的，目的是以独立于软件、应用、操作系统和硬件的方式呈现包含格式化文本和图像的文档。

为了让我们更好地理解与使用 Python 处理 PDF 相关的所有内容，下面的教程被分成了不同的部分。

那么，让我们开始吧。

一些著名的 Python PDF 库

Python 提供了大量用于操作 PDF 文件的库。处理 pdf 时通常使用的一些著名库有:

PDFMiner，
PyPDF4，
PyPDF2，
Python-docx，
PyMuPDF，

还有更多。

虽然在 Python 中使用不同的包来对 PDF 执行不同的功能操作，但我们将只讨论一些库的工作，如 PDFMiner、PyPDF2、PyMuPDF、reportlab、以及本教程中的其他一些库。 PyPDF2 被认为是广泛选择的与 PDF 一起工作的 Python 模块之一。该软件包易于使用，并提供各种功能。但是，当我们谈论文本提取时， PDFMiner 包更加精确和可靠。 PDFMiner 是专门为用户从 PDF 文件中提取文本而设计的。当我们考虑 PDF 文件操作时，有不同的场景，其中一个包在不同方面比另一个包更有效。因此，在本教程中，我们将根据 PDF 文件的舒适性和可靠性来讨论用于操作 PDF 文件的不同库。

使用 Python 从 pdf 中提取文本

pdf 由各种内容组成，如文本、表格、图像、表单等。这些文件是数据的图形解释。他们提供关于显示器或纸张确切位置的信息。然而，它们没有为句子或段落指定的逻辑结构，并且当显示的大小改变时，它们不能自我调整。 PDFMiner 包通过评估布局和预测文本和其他内容的位置来为用户执行工作。

PDFMiner 被认为是用来执行从 PDF 文件中提取文本等操作的健壮库之一。因此，在下一节中，我们将演示 PDFMiner 用于文本提取的用法。

首先我们要安装 PDFMiner 包。

安装 PDFMiner 包

我们可以使用以下命令安装 PDFMiner 包:

语法:


$ pip install pdfminer

一旦安装完成，我们将进入主要部分，使用 PDFMiner 库提取文本。

让我们考虑下面的例子，演示如何借助 Python 中的 PDFMiner 提取文本。

示例:


from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
O_string = StringIO()
with open('my_file.pdf', 'rb') as input_file:
    my_parser = PDFParser(input_file)
    my_doc = PDFDocument(my_parser)
    rsrcmgr = PDFResourceManager()
    my_device = TextConverter(rsrcmgr, O_string, laparams = LAParams())
    my_interpreter = PDFPageInterpreter(rsrcmgr, my_device)
    for my_page in PDFPage.create_pages(my_doc):
        my_interpreter.process_page(my_page)
print(O_string.getvalue())

输出:

A Simple PDF File 
 This is a small demonstration .pdf file - 
 just for use in the Virtual Mechanics tutorials. More text. And more 
 text. And more text. And more text. And more text. 
 And more text. And more text. And more text. And more text. And more
 text. And more text. Boring, zzzzz. And more text. And more text. And
 more text. And more text. And more text. And more text. And more text.
 And more text. And more text.
 And more text. And more text. And more text. And more text. And more
 text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2
 ...continued from page 1. Yet more text. And more text. And more text.
 And more text. And more text. And more text. And more text. And more
 text. Oh, how boring typing this stuff. But not as boring as watching
 paint dry. And more text. And more text. And more text. And more text.
 Boring.  More, a little more text. The end, and just as well.

说明:

在上面的代码片段中，我们已经从 IO 库中导入了 StringIO 模块，并从 PDFMiner 模块中导入了所需的函数和类。我们创建了一个 StringIO 对象，并使用 with 语句从目录中打开 pdf 文件。根据 PDFMiner 文档， PDFPageInterpreter 用于处理页面内容，而 PDFResourceManager 用于存储字体或图像等共享资源。 PDFPage 用于对数据进行逐页分析。参数加载字符、文本框、文本行、图像和图形的布局分析。在这些帮助下，文本转换器功能有助于将 PDF 文档转换为文本。

我们提供“my _ file . pdf”作为 PDF 文件，在 PDFMiner 模块的帮助下进行分析和执行。我们可以使用进程页面功能从 PDF 文件中提取文本。

最后，打印(文本)功能将从 PDF 中打印出提取的文本。因此，以这种方式，可以使用 PDFMiner 库从 PDF 文件中提取文本。

基于 Python 的 pdf 图像提取

每当我们想从 PDF 中提取图像时，我们可以利用 PyMuPDF。这个库使用了一个额外的模块， fitz，，这使得从 PDF 文件中提取图像变得更加容易。在开始直接使用模块之前，让我们安装所需的库。

安装 PyMuPDF 包

我们可以使用以下命令安装 PyMuPDF 包:

语法:


$ pip install pymupdf
$ pip install fitz

一旦安装完成，我们将进入主要部分，使用 PyMuPDF 库和 fitz 模块提取文本。

让我们考虑下面这个用 Python 演示图像提取的例子。

示例:


# PyMuPDF
import fitz
import io
from PIL import Image
# path to our input file
my_file = "file2.pdf" 
# Input PDF file
my_pdf = open(my_file)
for page_num in range(len(my_pdf)):
   cur_page = my_pdf[page_num]
   img = cur_page.getImageList()
   for image_num, image in enumerate(cur_page.getImageList()):
       # get the XREF of the image
       xref = image[0]
       # extract the image bytes
       cur_image = my_pdf.extractImage(xref)
       imgBytes = cur_image["image"]
       # get the image extension
       img_ext = cur_image["ext"]
       # load it to PIL
       image = Image.open(io.BytesIO(imgBytes))
       # save it to local disk
       image.save(open(f"page{page_num + 1}_img{image_num}.{img_ext}", "wb"))

输出:

[+] Found a total of 2 images in page 0
[+] Found a total of 2 images in page 1

说明:

在上面的代码片段中，我们已经导入了所需的模块。然后，我们使用 fitz 模块加载了 PDF 文件。然后我们一页一页地找到图像列表。然后，我们将 pdf 中的图像字节转换为实际图像，并保存在本地。因此，以这种方式，我们已经从一个 PDF 文件中提取了图像。

使用 Python 从 pdf 中提取表格

与图像和文本提取相比，从 PDF 文件中提取表格有点容易。Python 提供了一个预定义的库，我们可以用它来提取表。因此，在我们开始实现代码之前，首先有必要安装这个库。

安装卡梅洛特图书馆

我们可以使用 pip 安装程序使用以下命令安装 camelot 模块:

语法:


$ pip install camelot

安装完成后，让我们继续用 Python 从 PDF 文件中提取表格。

示例:


import camelot
# reading the pdf file
my_tables = camelot.read_pdf("my_table.pdf")
print(my_tables[0].df)

说明:

在上面的代码片段中，我们已经导入了 camelot 库。然后，我们使用卡梅洛特库的 read_pdf() 功能从 PDF 文件中提取表格，并将它们作为列表存储在变量中。最后，我们使用表的索引值和 df 属性打印了一个提取的表。因此，我们已经成功地从 PDF 文件中提取了表格。

使用 Python 从 pdf 中提取网址

提取网址被认为是 Python 提供的另一个方便的功能。Python 有一个预定义的库，称为“pdfx”，，通常用于从 PDF 文件中提取网址。我们可以利用诸如 PDFMiner、PyPDF2、等库来提取文本，并使用正则表达式来找出网址。然而，这个过程是漫长而忙碌的。因此，为了缩短代码的长度，我们将使用 pdfx 库从 PDF 文件中提取网址。

安装 pdfx 库

我们可以使用 pip 安装程序使用以下命令安装 pdfx 库:

语法:


$ pip install pdfx

一旦安装完成，让我们考虑下面的例子来理解从 pdf 中提取网址。

示例:


import pdfx
# reading the PDF File
my_pdf = pdfx.PDFx("sample-url.pdf")
# get list of URLS
print(my_pdf.get_references_as_dict())

输出:

{'url': ['https://www.javatpoint.com/python-pass', 'https://www.javatpoint.com/python-tutorial', 'https://www.javatpoint.com/python-seaborn-library', 'https://www.javatpoint.com/', 'https://www.javatpoint.com/chatbot-in-python', 'https://www.javatpoint.com/python-if-else']}

说明:

在上面的代码片段中，我们已经导入了 pdfx 库。然后我们使用 PDFx() 函数从目录中读取 PDF 文件。然后我们使用 get_references_as_dict() 函数以字典的形式提取输入 PDF 文件中所有可用的网址。

使用 Python 从 pdf 中提取页面作为图像

在本节中，我们将了解从图像形式的 PDF 文件中提取页面。为了完成任务，我们将需要另一个被称为 pdf2image 的简短库。当我们想要将 PDF 文件转换成图像时，通常会用到这个库。

让我们从安装库开始。

安装 pdf2image 库

我们可以使用 pip 安装程序的以下命令来安装 pdf2image 库:

语法:


$ pip install pdf2image

一旦安装完成，让我们考虑下面的例子来理解 pdf2image 库的工作。

示例:


from pdf2image import convert_from_path
my_pages = convert_from_path("my_file.pdf", 120) 
n = 0
# iterating through pages
for page in my_pages:
   n += 1
   page.save(f"output{n}.jpg", "JPEG")

说明:

在上面的代码片段中，我们从 pdf2image 库中导入了 convert_from_path 函数。然后，我们使用了导入的函数，其中我们提供了值 120。该值称为 DPI 或每英寸点数。数值越高，图像越清晰，尺寸越大。我们通过将页面保存为 JPEG 图像来遍历每一页。