Getting SEC filing header files

This post provides some code for downloading header (*.sgml) files associated with SEC filings by firms. The first piece of code is a function that takes the path to the text filing, which is found on the SEC index files, transforms that path into the path for the SGML file, checks whether the file has already been downloads, and, if not, downloads it.

get_sgml_file <- function(path) {
    
    # The name of the local file to be created and the remote file from which
    # it will be created.
    directory <- "/home/iangow/Documents/WRDS/filings/the_filings"
    local_filename <- gsub("^edgar\\/data(.*)\\.txt$", 
                          paste(directory,"\\1",".hdr.sgml",sep=""), 
                          path, perl=TRUE)
    remote_filename <- gsub("\\.txt$", ".hdr.sgml", path, perl=TRUE)                        
    
    # Only download the file if we don't already have a local copy
    if (!file.exists(local_filename)) {
        
        ftp <- paste("http://www.sec.gov/Archives",
                     dirname(path),
                     gsub("-|(.txt$)","",basename(path),perl=TRUE),
                     basename(remote_filename), sep="/") 
        dir.create(dirname(local_filename), showWarnings=FALSE)
        try(download.file(url=ftp, destfile=local_filename) )
    }                      
    
    # Return the local filename if the file exists
    if (file.exists(local_filename)) { 
        return(local_filename) 
    } else { return(NA) } 
}

Now, to test this code, I pull together announcements from First Call’s CIG (company-issued guidance) database, and then a list of 8-K filings made within the five-day window beginning with each announcement. (This takes around 50 seconds to pull together.)

# Connect to my database
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
pg <- dbConnect(drv, db="crsp")

# Get FirstCall identifier, announcement date, and CIK for all observations on 
# the FirstCall CIG file. I can get CIK matches for almost all observations and
# there are very few (about 6) Security_IDs with more than one CIK match; 
# I guess that the correct CIK match will be arrived at by the limited time
# window used below.
dbGetQuery(pg, "DROP TABLE IF EXISTS cig_ciks")
dbGetQuery(pg,"CREATE TABLE cig_ciks AS SELECT * FROM
    (SELECT DISTINCT \"Security_ID\" AS security_id, anndate, cusip AS cusip8
        FROM fc.cig) AS a
    LEFT JOIN (SELECT DISTINCT substr(cusip, 1,8) AS cusip8, 
           (cik::integer)::text AS cik
        FROM (SELECT DISTINCT gvkey, cusip FROM comp.secm) AS b
        INNER JOIN (SELECT DISTINCT gvkey, cik FROM comp.company 
                   WHERE cik IS NOT NULL) AS c
        USING (gvkey)) AS d
    USING (cusip8)")

# Pull together a list of 8-K filings within the five-day window beginning with
# the announcement on CIG; I should be more careful here with weekends, etc.
file.list <- dbGetQuery(pg, "
    SELECT * 
    FROM cig_ciks AS a
    INNER JOIN filings.filings AS b
    USING (cik)
    WHERE b.date_filed BETWEEN a.anndate AND a.anndate + interval '4 days'
                 AND form_type='8-K'")

Now, download the SGMLs for each filing. This step takes some time: over 5 hours for the 58,000+ filings identified in the previous step. However, this is a one-off cost.

# Get the files
file.list$sgml_file <- unlist(lapply(file.list$file_name, get_sgml_file))

Now, extract the “items” associated with each filing. See here for a description of each item. Finally, scan each filing for the presence of certain kinds of items. (This probably isn’t the most efficient code, but it gets through the 58,000+ filings in 11 seconds.)


# Extract the list of items for each filing
item_scan <- function(sgml_file) {
    con <- file(sgml_file)
    items <- grep("^<ITEMS>", readLines(con=con), value=TRUE, perl=TRUE)
    close(con)
    items <- gsub("^<ITEMS>","", items, perl=TRUE)
    return(items)
}

file.list$items <- lapply(file.list$sgml_file, item_scan)

# A function to create an indicator for the presence of a given item
has_item <- function(item_list, item) {
    is.element(item, unlist(item_list))
}

# Create an indicator for the presence of Item 9.01 on each 8-K filing
file.list$has_2.02 <- unlist(lapply(file.list$items, has_item, "2.02"))
file.list$has_7.01 <- unlist(lapply(file.list$items, has_item, "7.01"))
file.list$has_9.01 <- unlist(lapply(file.list$items, has_item, "9.01"))
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to Getting SEC filing header files

  1. iangow says:

    I edited the post above to replace “ftp://anonymous:your_id@ftp.sec.gov” with “http://www.sec.gov/Archives”, which results in download speeds an order of magnitude faster.

  2. roger tubby says:

    This is perfect. Could you provide some information on the database objects that you reference:
    fc.cig, comp.secm, comp.company, filings.filings?

    Thanks!

    • iangow says:

      Most of these database objects come from Wharton Research Data Services (WRDS).
      fc.cig: FirstCall Company-Issued Guidance
      comp.secm: Compustat monthly security file
      comp.company: Computstat’s company table
      filings.filings: Pulled together from SEC’s website (see my earlier posting on this)

  3. roger tubby says:

    Thanks for the info.It seems that the WRDS and Compustat data are licensed and not freely available. While I work for a very small financial services firm they wouldn’t want to obtain a license (if available at all.)

    I’ll check into parsing the sec.gov data directly. I have a fairly good record of scraping gov’t web sites for info but I’d much rather use a supported interface (FTP/web services.)

    • iangow says:

      I have a bunch of code for working with SEC’s EDGAR site that I’ll try to post over the next week or so.

      • I think that is simply great what you are doing here, I’m a PhD candidate researching on Mutual Funds and I’ve been stuck on downloading NSAR-B from EDGAR since long time and I’m still trying to figure out how to program it. I’d strongly appreciate if you could give some hints on how to extract the files and get them nicely structured on a database.

      • iangow says:

        I don’t know much about NSAR-B filings, which are quite different from the filings I’ve been working with. It looks like you might need to write some special code to handle those. But the code on this blog looking at downloading index files and the filings themselves should help you get started.

      • Alright, I just discovered your blog today and I thought you might also have some stuff for NSAR’s, yes basically it is to donwload only the files in which first column index appear ‘NSAR-B’.
        I’ll keep on following your blog.
        Cheers

      • iangow says:

        Using the code on the blog, I have a database of SEC filings, so I can query it like this:

        crsp=# SELECT * FROM filings.filings WHERE form_type='NSAR-B' LIMIT 20;
        +-----------+--------------------------------------------------------+-----------+--------+------------+--------------------------------------------+
        | row_names | company_name                                           | form_type | cik    | date_filed | file_name                                  |
        +-----------+--------------------------------------------------------+-----------+--------+------------+--------------------------------------------+
        | 1         | MERRILL LYNCH LIFE VARIABLE ANNUITY SEPARATE ACCOUNT B | NSAR-B    | 880794 | 1993-02-26 | edgar/data/880794/9999999997-05-050433.txt |
        | 4         | 2002 TARGET TERM TRUST INC                             | NSAR-B    | 893227 | 1994-01-31 | edgar/data/893227/0000893227-94-000002.txt |
        | 40        | ACACIA CAPITAL CORP                                    | NSAR-B    | 708950 | 1994-03-08 | edgar/data/708950/0000708950-94-000002.txt |
        | 44        | ACCESSOR FUNDS INC                                     | NSAR-B    | 876603 | 1994-02-24 | edgar/data/876603/0000876603-94-000001.txt |
        | 48        | ACM GOVERNMENT INCOME FUND INC                         | NSAR-B    | 816754 | 1994-03-01 | edgar/data/816754/0000816754-94-000001.txt |
        | 50        | ACM GOVERNMENT SECURITIES FUND INC                     | NSAR-B    | 825650 | 1994-03-01 | edgar/data/825650/0000825650-94-000001.txt |
        | 51        | ACM GOVERNMENT SPECTRUM FUND INC /NY/                  | NSAR-B    | 830624 | 1994-03-01 | edgar/data/830624/0000830624-94-000001.txt |
        | 163       | ADVISORS FUND L P                                      | NSAR-B    | 825201 | 1994-02-28 | edgar/data/825201/0000053798-94-000090.txt |
        | 179       | AETNA INCOME SHARES                                    | NSAR-B    | 2646   | 1994-02-28 | edgar/data/2646/0000002646-94-000001.txt   |
        | 180       | AETNA INVESTMENT ADVISERS FUND INC                     | NSAR-B    | 846799 | 1994-02-28 | edgar/data/846799/0000846799-94-000001.txt |
        | 194       | AETNA SERIES FUND INC                                  | NSAR-B    | 877233 | 1994-02-28 | edgar/data/877233/0000877233-94-000001.txt |
        | 195       | AETNA VARIABLE ENCORE FUND INC                         | NSAR-B    | 2663   | 1994-02-28 | edgar/data/2663/0000002663-94-000001.txt   |
        | 196       | AETNA VARIABLE FUND                                    | NSAR-B    | 2664   | 1994-02-28 | edgar/data/2664/0000002664-94-000001.txt   |
        | 204       | AGGRESSIVE STOCK TRUST                                 | NSAR-B    | 701388 | 1994-02-25 | edgar/data/701388/0000701388-94-000001.txt |
        | 214       | AIM FUNDS GROUP/MA                                     | NSAR-B    | 19034  | 1994-03-14 | edgar/data/19034/0000019034-94-000001.txt  |
        | 216       | AIM STRATEGIC INCOME FUND INC                          | NSAR-B    | 844778 | 1994-03-02 | edgar/data/844778/0000844778-94-000002.txt |
        | 302       | ALGER AMERICAN FUND                                    | NSAR-B    | 832566 | 1994-02-25 | edgar/data/832566/0000832566-94-000001.txt |
        | 307       | ALGER FUND                                             | NSAR-B    | 3521   | 1994-01-05 | edgar/data/3521/0000003521-94-000001.txt   |
        | 358       | ALLIANCE FUND INC                                      | NSAR-B    | 19614  | 1994-02-28 | edgar/data/19614/0000019614-94-000001.txt  |
        | 360       | ALLIANCE MORTGAGE SECURITIES INCOME FUND INC           | NSAR-B    | 725919 | 1994-02-28 | edgar/data/725919/0000725919-94-000001.txt |
        +-----------+--------------------------------------------------------+-----------+--------+------------+--------------------------------------------+
        20 rows in set (0.03 sec)
        
        crsp=#
        
  4. Pingback: Extracting files from SEC “complete submission” text filings | iangow

  5. Pingback: Extracting files from SEC “complete submission text files” | iangow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s