Getting SEC filing index files

Here is some R code to download SEC index files and put them into a database. This is an alternative to Perl code provided by Andrew Leone here.

First, a function to download the zipped index file from the SEC website, then parse it into an R dataframe:

getSECIndexFile <- function(year, quarter) {
    
    # Download the zipped index file from the SEC website
    tf <- tempfile()
    result <- try(download.file(
        url=paste("http://www.sec.gov/Archives/edgar/full-index/",
                  year,"/QTR", quarter, "/company.zip",sep=""),
        destfile=tf))
    
    # If we didn't encounter and error downloading the file, parse it
    # and return as a R data frame
    if (!inherits(result, "try-error")) {
        
        # Small function to remove leading and trailing spaces
        trim <- function (string) {
            string <- enc2native(string)
            gsub("^\\s*(.*?)\\s*$","\\1", string, perl=TRUE)
        }
        
        # Read the downloaded file
        raw.data <- readLines(con=(zz<- unz(description=tf,
                                            filename="company.idx")))
        close(zz)
        raw.data <- raw.data[11:length(raw.data)] # Remove the first 10 rows.
        
        # Parse the downloaded file and return the extracted data as a data frame
        company_name <- trim(substr(raw.data,1,62))
        form_type <- trim(substr(raw.data,63,74))
        cik <- trim(substr(raw.data,75,86))
        date_filed <- as.Date(substr(raw.data,87,98))
        file_name <- trim(substr(raw.data,99,150))
        rm(raw.data)
        return(data.frame(company_name, form_type, cik, date_filed, file_name))
    } else { return(NULL)} 
}

Second, a function to add the file to my database:

addIndexFileToDatabase <- function(data) {
    if (is.null(data)) return(NULL)
    library(RPostgreSQL)
    drv <- dbDriver("PostgreSQL")
    pg <- dbConnect(drv, db="crsp")
    
    rs <- dbWriteTable(pg, c("filings", "filings"), data, append=TRUE)
    dbDisconnect(pg)
    return(rs)
}   

Finally, a few lines of code to delete the filings table if it already exists, then download all the filing files from 1993 to 2011 and post them in the database.

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
pg <- dbConnect(drv, db="crsp")
dbGetQuery(pg, "DROP TABLE IF EXISTS filings.filings")

for (year in 1993:2011) {
    for (quarter in 1:4) {
        addIndexFileToDatabase(getSECIndexFile(year, quarter))
    }
}
Advertisements
This entry was posted in PostgreSQL, R and tagged , , , . Bookmark the permalink.

26 Responses to Getting SEC filing index files

  1. iangow says:

    Note that the issue with backslashes and RPostgreSQL has now been resolved in the CRAN version.

  2. Pingback: Extracting files from SEC “complete submission” text filings | iangow

  3. An Nhiên says:

    Hi, I happened to bump into your website. I’m totally inexperienced in getting SEC filing index files. So, I just want to ask if i don’t want to download the files, I just want to get a merge file with all SEC filings url link from 1993-2011. How can I do that? Thanks a bunch

    • iangow says:

      Download an index file and look at the code on the blog for pulling the header files. It should be pretty easy to construct the URLs for the underlying filings from that.

  4. iangow says:

    I updated the code above to use the SEC’s Web servers, rather than its FTP servers, as the latter are very slow.

  5. Pingback: Extracting files from SEC “complete submission text files” | iangow

  6. Sam says:

    Ian, thanks for great examples how to download SEC filings.

    While using code above I get an error 2011-Q4 index file:
    > SEC.index.file.2011.Q4 <- getSECIndexFile(2011, 4)
    trying URL 'http://www.sec.gov/Archives/edgar/full-index/2011/QTR4/company.zip&#039;
    Content type 'application/zip' length 2910747 bytes (2.8 Mb)
    opened URL
    ==================================================
    downloaded 2.8 Mb

    Error in gsub("^\\s*(.*?)\\s*$", "\\1", string, perl = TRUE) :
    input string 60858 is invalid UTF-8

    Do you maybe have an idea on how to "solve" this "issue"?

    Thanks.

    • iangow says:

      I found an accented character near row 60858 in the index file you were looking at. I tweaked the code above (added “string <- enc2native(string)") and it now seems to work again.

  7. Zhu says:

    Hi, it is my first time to use R and I tried your program above. But I am confused with the final step, esp. library(RPostgreSQL). When I ran it, there shows no such package named “RPostgreSQL”. I guess it is your own package and used to store data. Would you like to tell me how to do it. BTW, is it similar with library in SAS?

    • iangow says:

      Zhu:

      RPostgreSQL is a way to connect to PostgreSQL from R. I have entries on PostgreSQL elsewhere on my blog and I think very highly of it as a RDBMS for empirical research. It is more like a way to get a much better version of PROC SQL in R. Other approaches exist, but this is the one I found most robust and powerful.

  8. J.T. says:

    Hi, thanks for sharing the code, this is what i was looking for. When running the code, I get the following error :
    pg <- dbConnect(drv, db="crsp")
    Error in postgresqlNewConnection(drv, …) :
    RS-DBI driver: (could not connect xxxxx@local on dbname "crsp"
    Is there a step that I am missing? thx

  9. I am at a loss on the Error in postgresqlNewConnection as well

    • iangow says:

      Does getSECIndexFile(year, quarter) work? Unless you have a PostgreSQL database set up, the portion of the code that pushes the value of getSECIndexFile into PostgreSQL won’t work. Of course, you don’t need to use PostgreSQL; one could adapt the code to store the data in some other way.

      • Thanks iangow. I do have PostreSQL, but you are right I cannot get it to listen in windows. I am now trying to write your code to MySQL unless you have other suggestions? I appreciate your code and help.

      • Gow, Ian says:

        Can you connect to your database using psql or pgAdmin3?

        From: iangow <comment-reply@wordpress.com> Reply-To: “comment+eyuyxt23w9zlil3w_gjjem78@comment.wordpress.com” <comment+eyuyxt23w9zlil3w_gjjem78@comment.wordpress.com> Date: Thursday, June 5, 2014 at 3:18 AM To: Ian Gow <igow@mba2002.hbs.edu> Subject: [iangow] Comment: “Getting SEC filing index files”

    • No I cannot, but I am trying to re-install PostgreSQL and see if I can get it to connect. Ian if you have any advice I will take it? I am running windows 7 home premium 64 bit and have installed all release versions of R.

  10. No I cannot, but I am trying to re-install PostgreSQL and see if I can get it to connect. Ian if you have any advice I will take it? I am running windows 7 home premium 64 bit and have installed all release versions of R.

  11. Thanks for the code. I had to re install postgresql, because I had my virus software running every time I installed postgresql. I had to uninstall postresql then I turned off my virus protection plus chromes malware service and re installed postgresql. Postgresql works great and so does the code. Now I just have to learn how to use postresql.

  12. Elizabeth S says:

    Thanks for sharing your code.

    It is really helpful and works perfectly with the exception of one quarter, 2011:Q4, that gives the error:
    Error in gsub(“^\\s*(.*?)\\s*$”, “\\1”, string, perl = TRUE) : input string 61312 is invalid UTF-8

    The input string 61312 is for the company: “EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA”.

    Any suggestions for addressing the error besides searching for this special case and substituting A for à would be greatly appreciated.

    • Elizabeth S says:

      Figured it out.
      load library(stringi)
      change line 16 in getSECIndexFile from:
      string <- enc2native(string)
      to:
      string <- stri_encode(string,"", "UTF-8")

  13. iangow says:

    Elizabeth:

    Looking at my code on GitHub (more recent), it seems that I may have solved it there too.

    https://github.com/iangow/acct_data/blob/master/filings/get_filings.R

  14. Pingback: Use Python to get EDGAR index files | KAI CHEN

  15. Pingback: Use Python to get EDGAR index files and save in a database » Kai Chen

  16. George Mount says:

    Thank you for sharing this, very helpful. I am just starting out with SEC textual analysis and one problem I’m struggling to overcome is how to parse/clean the markups that come with the 10-Ks I receive. Can you suggest any strategies or resources to help? Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s