A friend asked me to convert a scanned document (PDF) to text. This is how I did it.
First I installed tesseract-ocr: sudo apt install tesseract-ocr
. This software seems to be one of the most accurate solutions available on ubuntu for converting an image to text.
Tesseract doesn’t accept PDF so I needed to convert the PDF to an image. I used the built-in convert
library for this. When I tried convert file.pdf image.jpg
I got an error about authorization. I worked around this by changing the imagemagick policy.
I used convert -quality 100 -density 300 file.pdf image.jpg
for a good quality conversion.
And after this I used tesseract image.jpg name-of-output -l spa
for converting the image to text. I used the -l spa
part because the language in the image was Spanish. This leads to better results. Spanish isn’t installed by default. You can install it using sudo apt install tesseract-ocr-spa
.