Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you want to OCR a document image, modern versions of Tesseract can work well. If you last used it a few years ago, the recognition has improved since due to a new text recognition algorithm that uses modern (deep learning) techniques. Browser demo using a modern version: https://robertknight.github.io/tesseract-wasm/.

OCR processing typically consist of two major steps: detecting/locating words or lines of text on the page, and recognizing lines of text.

Tesseract's text recognition uses modern methods, but the text detection phase is still based on classical methods involving a lot of heuristics, and you may need to experiment with various configuration variables to get the best results. As a result it can fail to detect text if you present it with something other than a reasonably clean document image.

Doctr (https://github.com/mindee/doctr) is a new package that uses modern methods for both text detection and recognition. It is pretty new however and I expect will take more time and effort to mature.



Thanks for posting. I immediately tried the browser link, and although the uploaded image has quite a decent quality, I'm not getting the results I'm looking for. Perhaps my expectations are too high?

This is the image I tried: https://imgur.com/a/tKId2al


Thanks for this test case. When I drop that image I see that the individual words are recognized correctly, but starting from about mid-way though are not displayed in the correct order in the text box at the bottom. If the image is rotated so that the text baselines are horizontal (about a ~1.5 degree rotation), the words are displayed in the correct order. So it looks like smarter methods or defaults are needed for the layout analysis.

I think with modern methods it ought to be relatively easy to teach a system to predict the amount of rotation needed to straighten the image, or make the layout analysis tolerate minor rotations of the input better. Needs someone to actually implement it though!


This command:

ocrmypdf --deskew --clean-final --output-type pdf --tesseract-timeout 600 --force-ocr -l eng --jbig2-lossy --optimize 3 /Users/username/Desktop/C1jn2Kz.png.pdf /Users/username/Desktop/C1jn2Kz.out.pdf

generates a PDF with this text for me:

Bad Ul is causing people to get scammed 2022-07-08 If you've asked anybody who's tried to sell anything on Facebook Marketplace, Offerup or Craigslist, | can guarantee you that every one of them have encountered somebody trying to scam them. I've encountered quite a few but I'll explain how this particular scam works and how bad UI contributes to scammers being successful.

I see one error — "I" recognized as a pipe.

On edit: here's the output PDF:

https://www.dropbox.com/s/8l9otcu9ohyoz9w/C1jn2Kz.out.pdf?dl...


I opened the page, didn’t recognise the image you posted as the actual thing and immediately bellow it was a video of ice cream bars being made and I immediately imagined how could you expect the OCR to figure out that and read it as “vanilla”. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: