arrays: Tell if text of PDF is visible or not

mardi 4 août 2015

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.

I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.

Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.

Generally the problem is distinguishing between two very different, but similar looking cases.

In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.

Here's the PDF as text with the image truncated: http://ift.tt/1ONm1cL

In the other case there's a background image that covers most of the page with the text in front of it.

Telling them apart is proving difficult for me.

via Chebli Mohamed

arrays

mardi 4 août 2015

Tell if text of PDF is visible or not

Aucun commentaire:

Enregistrer un commentaire