I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:
EDGAR provides its Document Type Definitions starting on page 48 of this file:
The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.
import os
filename = 'parseme.txt'
with open(filename) as f:
lines = f.readlines()
My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.
The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.
Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the
osx
program to get an XML version of the input file, after which you can use XML processing tools.There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with
<!SGML "ISO 8879-1986"
). You will have to get these as text files and add them to the catalogs where the SP parser can find them.UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.
However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.
You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after
<!DOCTYPE submission
and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the<!DOCTYPE submission [
and]>
) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.The link below is a library that parses EDGAR filings into a SQLite DB. It contains functionality to pull Form10k and Form8Qk filings from the EDGAR FPT site for years that you specify and load them into a normalized format in SQLite DB tables. Considering the poorly adhered to standard for the filings, writing your own parsing script would be a significant undertaking. That library and code similar to the below will load filings for the wanted quarter and from there you can simply query the table for the data you are seeking.
http://rf-contrib.googlecode.com/svn/trunk/ha/src/main/python/edgar/