I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://ift.tt/1ONm1cL
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.
via Chebli Mohamed
Aucun commentaire:
Enregistrer un commentaire