Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tesseract.js – A Javascript port of the Tesseract OCR engine (projectnaptha.com)
132 points by kiyanwang on Aug 8, 2021 | hide | past | favorite | 37 comments


I've used Tesseract.js to recognise the https://** links from the camera input and to make them clickable.

First issue I've encountered was the text recognition performance. Depending on the camera input (if the image contained something that looked like the text or not) I've got 2-20+ seconds per 640x640px image for text recognition on iPhone X. Not so fast as you may see. But the recognition was pretty accurate though.

The performance, as expected, improves when the image size is getting smaller and the amount of text on the image is also smaller.

Since I did't want to recognise the whole text, but only the links, I've used the TensorFlow Object Detection model to quickly find the areas with the text http://**. Then, instead of recognising the whole image I needed to do it only for smaller parts of the image. This gave some improvements to the performance: from the variable 2-20 seconds per frame I've got more stable 0.5-1 seconds. Also not good, but several times faster.

I've described the challenges in more details here https://trekhleb.dev/blog/2020/printed-links-detection/. But to sum up, I had a good recognition quality with an arguable performance with Tesseract.js


Tesseract sucked for me. Had a simple use case where I was trying to read numbers (in a computer font) from .png files and at completely predictable locations in the image -- and Tesseract was getting it horribly wrong a huge percent of the time. Went with AWS Rekognition and results were instantly 1000x better.


Post processing is absolutely essential with tesseract. Not to self promote but I discussed this at some length in this blog post, if you're interested: https://kn100.me/taking-back-data-from-eufy/


We really need better open source {OCR, TTS, dictation, ...}. All of the common FOSS tools for these tasks are so horribly behind the state of the art.

The sad thing is most of the state of the art models and algorithms are open research, they just are usually not written by software engineers and need to be rewritten to be deployable. Usually you just get some shell script like "run_eval.sh" that generates the figures in the paper through a bunch of spaghetti code, and most of the time it will depend on a specific old version of Tensorflow, that probably isn't available for your CUDA version, and probably won't compile on your system without hours of Googling.


Had the _exact_ same situation! Was just trying to OCR values of screenshots, which were always of the same screen (app screenshots taken by users) and it was so bad. Ended up just using AWS Rekognition and it worked really well.


I think it's mostly for OCR'ing high-resolution scans of printed media. I scanned and OCR'd a several hundred page printed book (my grandfather's memoirs) with great results. The text needed very little processing. But I had to manually transcribe all of the image captions, because they were scans of photocopies of photocopies of typewriter labels stuck to photos by hand, and thus very poor quality, and Tesseract produced complete gibberish.


I spend 2 months 2 years ago on building a passport data extractor. For KYC (know your customer) purposes. Unfortunately I did not manage to get to a situation where the extracted data was really useful. I just tried this JS version (sure the native one is the same) and without changing anything (apart from the training dataset) I got much better results. Exciting.


For passports, I would use the MRZ instead. All of the passport data is encoded there and it's machine readable.

http://writecodeeveryday.github.io/projects/passportjs/


Mrz is just encoded string. But it says nothing. For proper validation you need to get the readable values as well.


Being disappointed by classic open source OCR I started an attempt to package neural net based approaches (https://github.com/gtsoukas/scene_text, don't use it, it is crap), then I found out that Googles' ML Kit (https://developers.google.com/ml-kit/vision/text-recognition) gives quite good results, as long as it is for latin based character sets.


I've used this library in the past for prototyping a project to extract Chinese subtitles from youtube videos in a chrome extension. It worked pretty well. The only problem is the library couldn't really handle realtime video. Can't really fault it for that though I was sending it every frame. The throughput was good but latency kept increasing probably because I was giving it to much data.

There's a mode where you can increase the number of worker threads. Tesseract is also designed for text documents and the preprocessing filter I made to convert the images to look more like a text document was pretty naive.

I'm taking an online computer vision class next semester and hope to pick the project back up after learning a bit more.


The total size of the download seems to be 3-4MB (based on https://github.com/naptha/tesseract.js/blob/master/docs/loca...), which is actually less than I expected.


It could be even smaller, it seems, as the wasm file is base64-encoded (so that it all fits in a single file - which is convenient, but larger).


Only English language support is included. Additional downloads are required for other languages.


Tesseract is decent for scanned imagery, whether in actual images or in PDFs but definitely not for text in the wild.


what would you suggest to use instead?


I've reached for tesseract a few times throughout the last 5 or so years to see if it's ever improved for non-trivial use cases and... no, it hasn't.

Sad as it is to say, it's just not up to snuff for any application I've tried it on.


I wanted to use Tesseract for a project but found it to be a bit too slow for my needs. Doesn't it have options to speed up it's recognition or is there another OCR project out there that's made to be faster?


Nice stuff!

I found an error in the chinese demo, with the example you provided (4th character wasn't the same). I know no OCR is perfect, but IMHO at least your own demo should be free of errors.


> at least your own demo should be free of errors

:) That would be a dishonest demo.

You try to show how well it works, not that it works perfectly well (which is false). Edit: especially since we know that OCR is hardly perfect - we expect errors to be minimized, not absent, and the first interest is to see where the engine fails.


There's one in the English demo too: "hail!" -> "haill". They're both pretty bad images though. In practice I've found (command line) Tesseract very accurate on 300dpi scans of printed documents, with colour/greyscale, not binary.


Anyone has any experience with the JS version of Tesseract? Is it accurate in general? And is it English only or does it work with any language?


What is the best way, paid or otherwise, to attempt OCR on a pdf of old typewritten text?


ABBYY FineReader has always come out ahead for me in terms of OCR accuracy.


Upload to Google drive and it will do it automatically


I tried that but it wouldn’t load for me


> Tesseract.js wraps an emscripten port of the Tesseract OCR Engine

Calling this “pure JavaScript” seems misleading


Yes, it's kind of weird, since there's no benefit to claiming false things like "Tesseract.js is a pure Javascript port [...]". Say it's WASM, since people associate that with speed and newness (and heavyweight dependencies, but there's no hiding that).


Skimming the download, this does indeed use wasm, but it's also possible to build to pure JS with emscripten (in WASM=0 mode, wasm2js compiles the wasm to JS). Perhaps that's what they used to do and the docs have not been updated or something like that.


Still not a "port" though.


Congrats! Why would one create such a project with JS given all of the languages available to them?


It's mostly not JavaScript, since it uses the emscripten port of the Tesseract OCR Engine, and if you want to do things in the browser, JavaScript has to be involved.


Ease of deployment. Deploying a client-side JavaScript application remains far, far easier and less expensive than anything that runs server-side (or native compiled) code.

Also privacy: running OCR in someone's browser rather than sending the images back to the server keeps them fully in control of the data they are working with.


Client-side web apps. With today's smartphones, it does make sense to not do everything solely on the server side.

Theretically, cross platform support would be another possibility. But one could argue native C code could be bundled as well, albeit with separate integration being needed. (Android and iOS do support such extensions).


To reduce server side burden/costs


To use it on the web.


‘Drop an image’? Mobile devices exist…




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: