A friend asked me to convert a scanned document (PDF) to text. This is how I did it.
First I installed tesseract-ocr:
sudo apt install tesseract-ocr. This software seems to be one of the most accurate solutions available on ubuntu for converting an image to text.
Tesseract doesn’t accept PDF so I needed to convert the PDF to an image. I used the built-in
convert library for this. When I tried
convert file.pdf image.jpg I got an error about authorization. I worked around this by changing the imagemagick policy.
convert -quality 100 -density 300 file.pdf image.jpg for a good quality conversion.
And after this I used
tesseract image.jpg name-of-output -l spa for converting the image to text. I used the
-l spa part because the language in the image was Spanish. This leads to better results. Spanish isn’t installed by default. You can install it using
sudo apt install tesseract-ocr-spa.