If you want to OCR a document image, modern versions of Tesseract can work well....

amelius · on July 9, 2022

Thanks for posting. I immediately tried the browser link, and although the uploaded image has quite a decent quality, I'm not getting the results I'm looking for. Perhaps my expectations are too high?

This is the image I tried: https://imgur.com/a/tKId2al

robertknight · on July 9, 2022

Thanks for this test case. When I drop that image I see that the individual words are recognized correctly, but starting from about mid-way though are not displayed in the correct order in the text box at the bottom. If the image is rotated so that the text baselines are horizontal (about a ~1.5 degree rotation), the words are displayed in the correct order. So it looks like smarter methods or defaults are needed for the layout analysis.

I think with modern methods it ought to be relatively easy to teach a system to predict the amount of rotation needed to straighten the image, or make the layout analysis tolerate minor rotations of the input better. Needs someone to actually implement it though!

gumboshoes · on July 9, 2022

This command:

ocrmypdf --deskew --clean-final --output-type pdf --tesseract-timeout 600 --force-ocr -l eng --jbig2-lossy --optimize 3 /Users/username/Desktop/C1jn2Kz.png.pdf /Users/username/Desktop/C1jn2Kz.out.pdf

generates a PDF with this text for me:

Bad Ul is causing people to get scammed 2022-07-08 If you've asked anybody who's tried to sell anything on Facebook Marketplace, Offerup or Craigslist, | can guarantee you that every one of them have encountered somebody trying to scam them. I've encountered quite a few but I'll explain how this particular scam works and how bad UI contributes to scammers being successful.

I see one error — "I" recognized as a pipe.

On edit: here's the output PDF:

https://www.dropbox.com/s/8l9otcu9ohyoz9w/C1jn2Kz.out.pdf?dl...

rbanffy · on July 9, 2022

I opened the page, didn’t recognise the image you posted as the actual thing and immediately bellow it was a video of ice cream bars being made and I immediately imagined how could you expect the OCR to figure out that and read it as “vanilla”. :-)