What is redaction?
redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the government. Categories of redacted items are phone numbers, e-mail addresses, bank account numbers, dates and names. It takes quite some time to manually redact documents, but fortunately AI can help to speed up this process. Natural Language Processing (NLP) is a subfield of AI that studies how to analyze and process a piece of natural text. This technology allows us to extract the keywords from the text.
Slimmer AI develops AI software products that support industries, solve real-world challenges and takes professionals into the future. They have developed an API that allows the redaction of PDF files. This API returns the redacted document based on your redaction action (e.g. all phone numbers). I have collaborated with Slimmer AI on building the interface for their new redaction application.
Redaction Application
The developed application has the following features:
- search for keyword(s) in the text, this can be a regular expression
- AI search: search for items in a category like phone numbers
- select a piece of text in the document
- redact the results from the actions above
- display the redacted PDF
Below you see a screenshot of the application. The left sidebar is the search column where the keyword and AI search can be performed. At the bottom of this sidebar, the results of the search are shown. When a user clicks on a result, it is selected for redaction.
The center of the application contains the document. This is the section where the text selection is performed. Once a piece of text is selected a popup appears that asks if the selected text should be redacted or not.
The right column contains the items that have been selected for redaction. When the user pushes the ‘Redact All’ button, the document is processed on the backend and the middle section will show the redacted version of the document.
The application uses the PDF.JS library for basic functionality like rendering the PDF and selecting some text. It is a free and open source library. There are some commercial libraries that offer more functionality, but they were unrequired. The rest of the technology stack for the application includes Javascript, JQuery, Bootstrap4 and HTML/CSS.
Improvements
The application was meant as a Proof of Concept to see if we could create a user-friendly wrapper for the API. Since the current functionality is working well, the application is being further developed. One thing on the improvements list is the option for a rectangle select. So next to redacting a piece of text on a line, like we can do now, this allows the user to redact any rectangular area in the document.