Getting SEC filing index files

Here is some R code to download SEC index files and put them into a database. This is an alternative to Perl code provided by Andrew Leone here.

First, a function to download the zipped index file from the SEC website, then parse it into an R dataframe:

getSECIndexFile <- function(year, quarter) {
    
    # Download the zipped index file from the SEC website
    tf <- tempfile()
    result <- try(download.file(
        url=paste("ftp://anonymous:your_id@ftp.sec.gov/edgar/full-index/",
                  year,"/QTR", quarter, "/company.zip",sep=""),
        destfile=tf))
    
    # If we didn't encounter and error downloading the file, parse it
    # and return as a R data frame
    if (!inherits(result, "try-error")) {
        
        # Small function to remove leading and trailing spaces
        trim <- function (string) {
            gsub("^\\s*(.*?)\\s*$","\\1", string, perl=TRUE)
        }

        # Read the downloaded file
        raw.data <- readLines(con=(zz<- unz(description=tf,
                                            filename="company.idx")))
        close(zz)
        raw.data <- raw.data[11:length(raw.data)] # Remove the first 10 rows.

        # Parse the downloaded file and return the extracted data as a data frame
        company_name <- trim(substr(raw.data,1,62))
        form_type <- trim(substr(raw.data,63,74))
        cik <- trim(substr(raw.data,75,86))
        date_filed <- as.Date(substr(raw.data,87,98))
        file_name <- trim(substr(raw.data,99,150))
        rm(raw.data)
        return(data.frame(company_name, form_type, cik, date_filed, file_name))
    } else { return(NULL)} 
}

Second, a function to add the file to my database:

addIndexFileToDatabase <- function(data) {
    if (is.null(data)) return(NULL)
    library(RPostgreSQL)
    drv <- dbDriver("PostgreSQL")
    pg <- dbConnect(drv, db="crsp")
    
    rs <- dbWriteTable(pg, c("filings", "filings"), data, append=TRUE)
    dbDisconnect(pg)
    return(rs)
}   

Following sentence is out of date (see below). Note that some of the SEC index files contain embedded backslashes, which cause problems for the version of RPostgreSQL on CRAN. I instead downloaded the version available on the Google project site, then installed it by issuing install.packages("/home/iangow/rpostgresql-read-only/RPostgreSQL/", type="source", repos=NULL) within R.

Finally, a few lines of code to delete the filings table if it already exists, then download all the filing files from 1993 to 2011 and post them in the database.

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
pg <- dbConnect(drv, db="crsp")
dbGetQuery(pg, "DROP TABLE IF EXISTS filings.filings")

for (year in 1993:2011) {
    for (quarter in 1:4) {
        addIndexFileToDatabase(getSECIndexFile(year, quarter))
    }
}

Advertisement
This entry was posted in PostgreSQL, R and tagged , , , . Bookmark the permalink.

2 Responses to Getting SEC filing index files

  1. iangow says:

    Note that the issue with backslashes and RPostgreSQL has now been resolved in the CRAN version.

  2. Pingback: Extracting files from SEC “complete submission” text filings | iangow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s