github上的12306教程(免费ocr)

对许多人来说，将PDF转换为可编辑的文本是必要的，但却苦于没有简单的方法。在本文介绍的项目中，K1 Digital的高级机器学习工程师wmddy Soares试图使用OCR (光学字符识别)自动转录pdf幻灯片，但转录效果良好。

传统讲座通常伴随一组pdf幻灯片。一般来说，要在这样的讲座上做笔记，需要从pdf中复制并粘贴很多内容。

最近，K1 Digital高级机器学习工程师wmddy Soares使用“OCR光学字符识别”自动转录pdf幻灯片，并直接在markdown文件中处理内容，以避免手动复制和粘贴pdf内容，从而完成此过程

项目作者wmddy Soares。项目地址： https://github.com/enkrateialucca/ocr _ for _ transcribing _ pdf _ slides

为什么不用传统的pdf文本转换工具呢？

wmddy Soares发现，传统工具带来了更多问题，需要时间来解决。他试图使用传统的Python包，但面临许多问题，包括必须使用复杂的正则表达式模式分析最终输出，他决定使用目标检测和OCR来解决。

基本过程分为以下几个步骤。

将pdf转换为图像；检测和识别图像中的文本；表示样品输出。基于深度学习的OCR将pdf复制到文本

将pdf转换为图像

Soares使用的pdf幻灯片来自于David Silver的扩展学习。请参考以下pdf幻灯片的地址。使用pdf2image软件包将每张幻灯片转换为png图像格式。

pdf幻灯片示例。地址： https://www.David silver.uk/WP-content/uploads/2020/03/intro _ rl.pdf

代码如下。

from pdf2imageimportconvert _ from _ path

from pdf2image.exceptions import (

PDFInfoNotInstalledError、

PDFPageCountError、

pdf syntax错误

）

pdf _ path=' path/to/file/intro _ rl _ lecture1. pdf '

images=convert _ from _ path (pdf _ path )

for i，imageinenumerate(images ) :

fname='image'str(I ) '.png '

image.save(fname，' PNG ' ) )。

处理后，所有pdf幻灯片都将转换为png格式的图像。

图像中文本的检测与识别

Soares使用ocr.pytorch库中的文本检查器来检测和识别png图像中的文本。按照说明下载模型并将其保存到checkpoints文件夹。

ocr.pytorch库地址： https://github.com/cour ao/ocr.py torch

代码如下。

# adaptedfromthissource 33603359 github.com/cour ao/ocr.py torch

%load_ext autoreload

%autoreload 2

导入操作系统

来自ocr导入ocr

导入时间

导入shutil

import numpy as np

导入路径lib

来自pil导入图像

来自glob导入glob

import matplotlib.pyplot as plt

import seaborn as sns

sns.set () )

导入类型服务

defsingle_pic_proc(image_file ) :

image=NP.array(image.open ) image_file ).convert('RGB ' ) )

result，image_framed=ocr(image )

返回结果，image_framed

image _ files=glob ('./input _ images/*.* ' )

result _ dir='./output _ images _ with _ boxes/'

# iftheoutputfolderexistswewillremoveitandredoit。

IFOS.path.exists(result_dir ) :

shutil.RMtree(result_dir ) )。

操作系统. mkdir (result _ dir ) )。

for image _ file in sorted (image _ files ) :

result，image _ framed=single _ pic _ proc (image _ file ) #检测语言识别文本

filename=path lib.path (image _ file ).name

output _ file=OS.path.join (result _ dir，image_file.split('/' ) [-1] )

txt _ file=OS.path.join (result _ dir，image_file.split('/' ) [-1].split('.0 ) ) ) (.txt ' )

TXT_f=open(TXT_file，' w ' ) )。

image.fromarray，image_framed，save，output_file

for key in result:

TXT_f.write(result[key][1]'n ' ) )

txt_f.close (

设置输入和输出文件夹，然后遍历所有输入图像(转换的pdf幻灯片)，使用single_pic_proc )函数在OCR模块中运行检测和识别模型，最后将输出保存到输出文件夹

其中检测继承Pytorch CTPN模型，识别继承Pytorch CRNN模型，两者均存在于OCR模块中。

样本输出

代码如下。

import cv2 as cv

output _ dir=path lib.path (./output _ images _ with _ boxes ) () ) ) ) ) ) ) ) )。

# image=cv.im read (str (NP.random.choice ) list(output_dir.iterdir ()，1 ) [0] )

image=cv.im read (f ({ output _ dir }/image7. png ) )

size _ reshaped=(int (image.shape [1]，int ) image.shape[0] )

image=cv.resize(image，size_reshaped ) )。

cv.imshow('image '，image ) )。

(cv.waitkey(0) ) ) ) ) ) ) ) ) ) ) )。

cv.destroyAllWindows (

下图左边是原始的pdf幻灯片，右边是转录后的输出文本，转录后的精度非常高。

文本识别输出如下。

filename=f ' { output _ dir }/image7. txt '

withopen(filename，' r ' ) as text:

for line in text.readlines () :

print(line.strip('n ' ) )

通过上述方法，您最终可以获得一个非常强大的工具来转录各种文档，从检测和识别手写笔记到检测和识别照片中的随机文本。使用自己的OCR工具处理文本内容要比依赖外部软件传输文档好得多。

原文链接： 3359 towardsdatascience.com/faster-notes-with-python-and-deep-learning-b 713 BBB 3c 186