Converting a PDF or Image to Text using Tesseract OCR on Ubuntu

A friend asked me to convert a scanned document (PDF) to text. This is how I did it.

First I installed tesseract-ocr: sudo apt install tesseract-ocr. This software seems to be one of the most accurate solutions available on ubuntu for converting an image to text.

Tesseract doesn’t accept PDF so I needed to convert the PDF to an image. I used the built-in convert library for this. When I tried convert file.pdf image.jpg I got an error about authorization. I worked around this by changing the imagemagick policy.

I used convert -quality 100 -density 300 file.pdf image.jpg for a good quality conversion.

And after this I used tesseract image.jpg name-of-output -l spa for converting the image to text. I used the -l spa part because the language in the image was Spanish. This leads to better results. Spanish isn’t installed by default. You can install it using sudo apt install tesseract-ocr-spa.

Tags