Get your Portable ID!

Monday, August 22, 2011

Is OCR detecting the wrong thing?

After being totally disheartened with the current state of OCR and layout detection, I started thinking about the problem.

This is the sort of thing I do a lot: notice something that people have been working on for 50+ years still doesn't work, then start thinking maybe I have some (uninformed) ideas on the subject.

As usual, I've come to a potentially spurious conclusion solely backed by a bit of Googling and some logic: People are trying to detect the wrong thing.

In my mind, optical character recognition is only a tiny part of the actual problem involved in scanning a text for content. Arguably more important is layout detection - you can't read an OCR text from a three-column-and-some-tables-plus-a-center-justified-image scan if it thinks everything is text.

However, layout detection always seems to be centered around finding unbroken areas of grey (apparently, based on how it behaves) that are assumed to be entirely orthagonal to the image borders. This is obviously not usually true, and this is why people have been burning their sanity away trying to come up with really solid edge detection and deskewing algorithms - so they can make it true and then their orthagonality-based crap will work.

What if you didn't assume any orientation at all, except that the general top region of the image is upright? Obviously, this makes figuring out layout a bit harder, right?

Here's a good question: Why should it? If you're looking for straight lines, it shouldn't matter in what direction they're going - only that they're all (mostly) going in the same direction. I propose detecting the predominant angles of whitespace on the page. Yeah, whitespace.

Dig: A document is either going to be mostly left-justified or right-justified, whichever is most common for the language or format. Chinese/Japanese idiograms in newspapers let scanners get off even easier in layout detection: It's typically both left and right justified to the column.

So what's a column? Another good question. I propose a number of ways of finding what I'm going to call fuzzy whitespace.

Take half of one GREYSCALE page, flood fill it with a number of increasing tolerances. Use a color which doesn't exist on the page, or maybe an additional channel. Take the flood fill image that sits at the median of page percentage covered. If there are large holes in any one of the flood fills, remember where those are. Look for straightish holes in the flood fill, there are your lines. They're probably not right, but we don't care right now. Unless the page has really annoying sidebars, you've probably got a series of "horizontal" stripes connected to one longer "vertical" stripe. Count each by tracing a line parallel to the outside margin of the fattest stripe at 2cm intervals. Repeat at a 90 degree angle. The majority rule of the stripe count from each line wins. Unless the layout is downright weird, the more numerous broken stripes are whitespace between lines, the largest unbroken stripes are margins.

Using your orientation discovery, take the average angle of the "line whitespace" centerlines. They should be mostly parallel anyway. Now remove any whitespace that matches those centerlines. What's left over is your blocking. Determine how "broken" that blocking is, and segment the image into "zones" with that blocking.

Chances are you've got some false positives. That's why you kept the color image as well. The human eye performs OCR using color cues, why shouldn't your algorithm? Figure vs ground separation is a problem I don't claim to be clever enough to figure out, but it seems that applying a color vision model similar to the average human eye will help matters out. The AVERAGE color of stuff that sits inside one of your line chunks for each block is the font color. Everything else should be considered an aid for determining page layout. So remove everything foreground-colored and replace it with the average background color per block.

That's all I've got so far, mostly because it's early and I've hardly had any sleep. I really don't expect anybody in the field of computer vision to read this (or anybody at all), but I'm leaving this final note for my own satisfaction: there is more to text recognition than following the same old "tried-and-true" dogshit which produces pages of text which read "amd then the nan t7`'./\g+0 his m0th24 4/={ said"

If that's EVER your output, your method is not just a little wrong, but completely the opposite of how human textual recognition works. Also, what's wrong with having OCR that trains itself? "Hey, that's clearly not what the page says, I don't see any of that stuff in my dictionary... Try again with a bit of skew, offset, hell, spawn 20 worker threads to try to figure out that block there. If all else fails, I'll drop any characters that are giving me problems into image blocks and then asking the user to define any repeating ones."