使用Python处理PDF文件的完整代码示例

本文将详细介绍如何使用Python处理PDF文件。通过Python，我们可以提取PDF中的文本、图像，进行文本搜索，合并、拆分、生成PDF等操作。

一、安装PyPDF2库

要处理PDF文件，我们需要先安装PyPDF2库。在终端中运行以下命令：

pip install PyPDF2

二、提取PDF中的文本

我们可以使用PyPDF2库提取PDF中的文本。以下是一个示例代码：

import PyPDF2

def extract_text(file_path):
    with open(file_path, 'rb') as file:
        pdf = PyPDF2.PdfFileReader(file)
        text = ''
        for page_num in range(pdf.getNumPages()):
            page = pdf.getPage(page_num)
            text += page.extractText()
        return text

pdf_text = extract_text('example.pdf')
print(pdf_text)

以上代码定义了一个extract_text函数，接受一个PDF文件路径作为参数，并返回提取的文本。通过循环处理每一页，使用extractText()方法提取文本，并将其拼接为一个字符串。最后，我们调用该函数并打印结果。

三、提取PDF中的图像

除了提取文本，我们还可以提取PDF中的图像。以下是一个示例代码：

import PyPDF2

def extract_images(file_path):
    with open(file_path, 'rb') as file:
        pdf = PyPDF2.PdfFileReader(file)
        images = []
        for page_num in range(pdf.getNumPages()):
            page = pdf.getPage(page_num)
            xobjects = page['/Resources']['/XObject'].getObject()
            for obj in xobjects:
                if xobjects[obj]['/Subtype'] == '/Image':
                    images.append(xobjects[obj])
        return images

pdf_images = extract_images('example.pdf')
for i, image in enumerate(pdf_images):
    with open(f'image_{i}.jpg', 'wb') as file:
        file.write(image._data)

以上代码定义了一个extract_images函数，接受一个PDF文件路径作为参数，并返回提取的图像列表。通过遍历每一页的XObject，找到Subtype为Image的对象，将其添加到images列表中。最后，我们将每个图像保存为单独的JPG文件。

四、搜索PDF中的文本

使用PyPDF2库，我们可以搜索PDF中的文本。以下是一个示例代码：

import PyPDF2

def search_text(file_path, keyword):
    with open(file_path, 'rb') as file:
        pdf = PyPDF2.PdfFileReader(file)
        page_numbers = []
        for page_num in range(pdf.getNumPages()):
            page = pdf.getPage(page_num)
            text = page.extractText()
            if keyword in text:
                page_numbers.append(page_num + 1)
        return page_numbers

keyword = 'Python'
page_numbers = search_text('example.pdf', keyword)
print(f'关键词 "{keyword}" 出现在以下页码：{page_numbers}')

以上代码定义了一个search_text函数，接受一个PDF文件路径和关键词作为参数，并返回包含该关键词的页码列表。通过循环处理每一页，提取文本并在其中搜索关键词。如果找到关键词，则将该页码加入page_numbers列表中。最后，我们打印包含关键词的页码。

五、合并和拆分PDF文件

使用PyPDF2库，我们可以进行PDF文件的合并和拆分操作。以下是一个示例代码：

import PyPDF2

def merge_pdfs(file_paths, output_path):
    merger = PyPDF2.PdfFileMerger()
    for file_path in file_paths:
        merger.append(file_path)
    merger.write(output_path)
    merger.close()

def split_pdf(file_path, page_numbers, output_path):
    with open(file_path, 'rb') as file:
        pdf = PyPDF2.PdfFileReader(file)
        writer = PyPDF2.PdfFileWriter()
        for page_num in page_numbers:
            page = pdf.getPage(page_num - 1)
            writer.addPage(page)
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

file_paths = ['file1.pdf', 'file2.pdf', 'file3.pdf']
output_path = 'merged.pdf'
merge_pdfs(file_paths, output_path)

file_path = 'example.pdf'
page_numbers = [1, 3, 5]
output_path = 'split.pdf'
split_pdf(file_path, page_numbers, output_path)

以上代码定义了一个merge_pdfs函数和一个split_pdf函数，分别用于合并和拆分PDF文件。merge_pdfs函数接受一个文件路径列表和输出路径作为参数，将输入的PDF文件合并为一个输出文件。split_pdf函数接受一个文件路径、页码列表和输出路径作为参数，将输入的PDF文件拆分为指定的页码并存储为输出文件。

六、生成PDF文件

使用PyPDF2库，我们还可以生成PDF文件。以下是一个示例代码：

import PyPDF2

def create_pdf(file_path, content):
    writer = PyPDF2.PdfFileWriter()
    for text in content:
        page = PyPDF2.pdf.PageObject.createBlankPage(None, 595, 842)
        page.mergePage(text)
        writer.addPage(page)
    with open(file_path, 'wb') as file:
        writer.write(file)

content = []
with open('text1.txt', 'r') as file:
    text1 = PyPDF2.pdf.PageObject.createTextObject(file.read())
    content.append(text1)
with open('text2.txt', 'r') as file:
    text2 = PyPDF2.pdf.PageObject.createTextObject(file.read())
    content.append(text2)

create_pdf('output.pdf', content)

以上代码定义了一个create_pdf函数，接受一个文件路径和内容列表作为参数，并生成一个包含输入内容的PDF文件。通过循环处理每条内容，创建一个空白页面，并将内容合并到该页面中。最后，将生成的PDF文件保存到指定路径。

通过以上示例代码，我们可以发现使用Python处理PDF文件非常方便。无论是提取文本、图像，还是进行搜索、合并、拆分、生成等操作，都可以通过PyPDF2库轻松实现。