Converting SEC index files into an SQLite database

Here is some Perl code for taking index files downloaded (and, critically, named) by Andrew Leone’s Perl code, and turning them into a single table in an SQLite database:

#!/usr/bin/perl
use DBI; # Package for SQLite database connection.

# Set up database connection
$dbfile = "./filings.db";
$dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","","")
or die "Cannot connect: " . $DBI::errstr;

#  Delete the FILINGS table if it already exists.
$dbh ->do("DROP TABLE IF EXISTS filings;");

#
#  Create the FILINGS table afresh.
#
$sql = "CREATE TABLE filings (cik INTEGER, form_type, company_name, date_filed,";
$sql = "$sql file_name);";
$dbh ->do($sql);

foreach my $year (2000..2010) {         # Code imports index files for years 2000-2010
    foreach my $qtr (1..4) {            # and all four quarters
        $path = "./";                   # This code assumes that the index files      
                                        # are in the current directory
        $file = "$path" .               # The path and file name
        "company$qtr$year.idx";         # of the index file. 
        print "$file\n";                # Tell the user which file we're working on
        open(FH, "$file");              # Open the SEC index file
        $line = 0;                      # Reset the line counter

        # The SQL work is structured as a single transaction per file to keep the 
        # number of writes to the database down, thereby enhancing performance.
        $dbh-> begin_work;
        while (<FH>) {
            $line++;            # Auto-increment the line count
            if ($line >= 11) {  # The first 10 lines of the SEC index files are ignored,
                                # as they have header information
                   my ($company_name,$form_type,$cik,$date_filed,$file_name)
                  = unpack(a62a12a12a12a43,$_); # The string here ("a62a12...") reflect the 
                                                # location of the values for each field;
                                                # the SEC index files are in a fixed-width             
                                                # format.

                # Remove trailing spaces from each field
                $form_type =~ s/\s+$//;
                $cik =~ s/\s+$//;
                $date_filed =~ s/\s+$//;
                $file_name =~ s/\s+$//;
                $company_name =~ s/\s+$//;

                # Use SQL escape for single quotes (e.g., in "McDonald's") so that
                # SQL can interpret the names properly
                $company_name =~ s/'/''/g;

                # Insert the row in the database
                $sql = "INSERT INTO filings (cik, form_type, company_name, date_filed,";
                $sql = "$sql file_name) ";
                $sql = "$sql VALUES( $cik, '$form_type', '$company_name', '$date_filed',";
                $sql = "$sql '$file_name');";
                $dbh ->do( "$sql" );
            }
        }

        # Finish the transaction and close the file
        $dbh -> commit;
        close(FH);
    }
}

# Create indices to improve performance
$dbh -> do("CREATE INDEX filings_idx ON filings (cik)");
$dbh -> do("CREATE INDEX filings_idx2 ON filings (cik, date_filed)");

# Close the database connection
$dbh-> disconnect();

The current code only covers 2000 through 2010, but could easily be modified to cover earlier periods. To run the code, put it in the same directory as the download index files, then run it (you may need to change permissions, e.g., chmod +x ./make_filings.db.pl to execute the script).
./make_filings_db.pl
Here is what it looks like on my computer, including a sample of the table created.

Advertisements
This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s