GoKawiil - PDF to Text, a challenging problem

The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”. These glyphs may be rotated, overlap, and appear out of order, with very little semantic information attached to them. You should probably be in awe at the fact that you can open a PDF file in your favorite viewer (or browser), hit ctrl+f, and search for text. Vertical, rotated text, next to horizontal text. Meanwhile the search engine preferrably wants clean HTML as input. The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundred ... Read full article.

Find Related products on Amazon

PDF to Text, a challenging problem

Related Articles