I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction.
I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties.
Here's an image of what each page represents:
And here's a page that lists some properties:
I'm comfortable with python and ruby so I don't have any issues scripting up a solution, but because the "columns" and the data in those said columns aren't necessary tied together, it seems like this would be a dubious proposition.
Any ideas would be greatly appreciated.
After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.
As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.
Here's a high-level view of my approach:
I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.
Use Homebrew to install Poppler for Mac:
After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:
Here's a preview of the XML (Click here for full XML):
Edit: Adding the Ruby I wrote to convert the XML to a CSV.
Link to Final CSV
Convert to text with Xpdf using command
pdftotext
.I converted your file with the following:
This conversion leaves text exactly in its original layout (due to
-layout
option). Options-f
and-l
indicate the first and last page numbers of the range of pages to extract.From there, parsing should be simple -- a number in column 8 indicates the first line of a record, a blank line ends the record. Follow the guide for the exact positioning of elements within a record.