Saturday, February 11, 2006

gocr, optical recognition and oKular

So i sat down this evening and investigated the possibility to use gocr as a fallback for getting the text layer of a document in oKular (for example if the backend does not support generating text layer information). To let things be clear - gocr is not ready, to say the least. Personally I'd even say the effect of trying to OCR a page is so crappy it is not even worth installing the gocr engine (seems like the total rewrite in 0.40 did not help much). And I am talking about an ascii black text on a white page, without other elements. Gocr seems to go all the way down here - error in 98% of recognized characters, randomly added spaces, etc. For example: content is C unrir in gocr, sounds like drunken elvish to me.

So at least for now, it is not worth it, no OCR in oKular - opensource does not have good, working OCR software yet.

UPDATE: I also tried ocrad, which is advertised as good for latin1 only bitmaps. 3 out of 2400 characters in the probe were correctly matched, content here is C_ll_ICll_ - sounds like a totally drunken elf wih heroine injections.