This is a bit involved, so please stay with me.
I'm working on a "Court Watch" project, which involves asking volunteers to take forms with them and indicating what happens during a trial/hearing in specific Courtrooms.
This is the form (redactions for privacy).
We have hundreds of them filled out by volunteers. Now, as these are handwritten documents, they need to be transcribed by other volunteers into a spreadsheet. We're using Google Forms and volunteers to handle this.
However, we need a primary key for this data and a way to handle duplicates.
Duplicates occur when two volunteers sit in the same courtroom and fill out different forms at the same hearing. We want to capture all of the data before assessing the quality and completeness, so we're not going through those and throwing out the duplicates.
We've decided on a system of barcodes to be created and used on these scanned PDFs. Using Reportlab, it seems that I can programmatically generate the barcodes on these OCR'd PDFs.
What I need is a way to handle the duplicates. The barcode (or primary key) used should account for the fact that two PDFs are duplicates.
For example, perhaps the resulting dataset partly looks like this:
+---------+---------+----------------+
| id | date | court |
+---------+---------+----------------+
| 45234-A | 7/24/18 | district court |
| 45234-B | 7/24/18 | district court |
| 45235 | 7/24/18 | superior court |
+---------+---------+----------------+
The "A" & "B" tell me there there are two forms that were compiled on the same day, in the same hearing.
How would you design a system (pseudocode is fine) do basically conditionally generate barcodes based on whether something is a duplicate or not?
Solution I've considered:
Perhaps, prior to scanning, I order the physical documents sequentially. I put duplicates one after another in this pile. Using a label maker, I put an "A", "B" on the top right of those files that are duplicates, and leave those that are not, blank. Then scan and OCR'd these documents. Then, in Python, I write some code to conditionally understand whether one of these documents is a duplicate or not by scanning the top right of a PDF for the letter "A", "B", etc. If it is, then generate a barcode and append "A" to it and check the next file. Append "B" to the barcode from "A" and add it to the 2nd file. Continue checking the next file until no letter is present, then move onto the next non-duplicate file.
If I were to begin the above solution, I'm not sure how to check for an "A" or "B" in a PDF using Python.
Thanks for reading! Any help is much appreciated.
Ninja edit: I forgot to mention. The purpose for using this sort of system is to allow distributed and decentralized volunteers to take like 50 forms, go to the Google Forms page, and input them in. Creating a barcode system allows us to account for the forms and keep track of them.