The first idea resembling something like the idea of OCR got developed in 1870 as a reading machine for the blind - the Optophone. This was the first step to solve a problem that sounds pretty simple: How do we get writing on paper inside a computer?
150 years of research, engineering breakthroughs and hundreds of IDP products later we were finally able to scan a receipt and have the fields be filled out - if it looked nice and friendly enough to the OCR model. Heureka.
Unfortunately for Tesseract, Abbyy and co. they suffer from the complication that documents are written by humans. And humans love to do things like:
Stamp over the most critical data because it feels like the right spot
Organize data in tables with four nested of levels of groupings
Disagree on standards of data, thus abandoning any approach of standardization and simply sending their very own format around
Add handwritten comments in their own language
Create documents that need at least a college degree in their field to correctly understand.
This meant OCR models were basically just helpers for data scientists, handling cleanups, routings, and post-validations to get something only vaguely close to real automation at work.
Multimodal LLMs enter the scene
... continue reading