This is because OCR almost never has an accuracy of 100 percent, so in situations where it’s absolutely critical that the redactions cover everything, nothing beats a human double-checking things. The final output should look something like the following, with all occurrences of "work" being redacted.īut wait! Something you might have noticed in the above screenshot that I feel is worth pointing out is that the very first instance of the word "work" wasn’t actually redacted. ProcessedDocument.save(redactedFile.canonicalPath, documentSaveOptions) val documentSaveOptions = faultDocumentSaveOptionsĭtApplyRedactions( true) To apply the redactions, we simply create `DocumentSaveOptions` and make sure to call `setApplyRedactions(true)`. val redactedFile = File(filesDir, "$-ocr-redacted.pdf") Now save the file to a new path while applying the redactions. Make sure the redaction annotations are properly stored. val redactionAnnotation = RedactionAnnotation(searchResult.pageIndex, ) Now for each result, create a matching redaction annotation. val searchResults = textSearch.performSearch( "work") Search for the word we want to redact. val textSearch = TextSearch(processedDocument, ) Create a `TextSearch` object to perform the search. val processedDocument = PdfDocumentLoader.openDocument( this, Uri.fromFile(outputFile)) Open the processed document we just produced. ℹ️ Pro Tip: The below sample can also be used with Image Documents we just need to use ImageDocument#getDocument() when creating the PdfProcessorTask. We simply need to load the document and then use PdfProcessor to apply a PdfProcessorTask configured to perform OCR. This is your typical scanned document: It’s a bit roughed up, mostly legible, and saved as a PDF. Let’s start by looking at the document we’ll be working with today. For a more in-depth look at OCR, check out our Optical Character Recognition in Scanned PDFs blog post. This means we can select, copy, highlight, and redact any part of the text. Then, we extract lines of text from those areas.įinally, we embed the textual information into the PDF using an invisible layer of text above the image.Īfter this process is complete, we can work with the new PDF the same way we would with any other document. So how does this work in the case of PSPDFKit? Here’s a simple outline of the steps needed:įirst, we need to identify areas of text in a PDF. It’s the process of taking a plain picture and extracting machine-readable text from it. OCR stands for Optical Character Recognition. Before getting started, let me give you a quick reminder of what OCR actually is so you know what will be happening. More specifically, we’ll be taking a scanned-in document, performing OCR to make the text machine readable, and then automatically redacting certain parts of it. We added support for OCR in PSPDFKit 6.5 for Android, and in this small how-to, I’ll walk you through how to use PSPDFKit for Android to perform OCR on scanned documents and then explain what can be done with the resulting files.
0 Comments
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |