Extracting files from SEC “complete submission” text filings

An updated version of this post can be found here.

The index files that can be downloaded from the SEC website (see here for more information) provide the location of “complete submission” text filings.

Using the SEC header (SGML) files, I then identify filings of interest (e.g., 8-K filings with Item 2.02 information within 7 days of a quarterly reporting date per Compustat). What I’ve then done is to download these complete submissions to a folder on my computer (/edgar/data/) and then extract the files (e.g., PDFs, JPEGs) embedded in these complete submissions into a folder that is indexed using Recoll (based on Xapian) for fast full-text searches of the underlying files (this is on Linux, other solutions likely exist for Mac OS X or Windows). Here is some code that I used to extract the component files. This mostly works (the Excel files don’t seem to work either using this or the SEC links directly), but I’m sure there’s a better way to do this, however.

Note that, with the exception of putting the complete submission files in a different location, I try to retain the directory structure of the SEC website. My reason for putting the complete submissions in a separate location is to prevent Recoll from indexing meaningless binary data embedded in these files.

# Go through local copies of SEC text filings and extract the component
# documents from each filing
extract.filings <- function(file) {
    require(XML)
    if (is.na(file)) return(NA)
    
    # Parse the file as an XML file containing multiple documents
    webpage <- readLines(file.path("/edgar/data/", file)) #  encoding="Latin-1")
    
    file.name <- gsub("<FILENAME>","", 
                      grep("<FILENAME>.*$", webpage,  perl=TRUE, value=TRUE))
   
    if (length(file.name)==0) { return(NA) }
    
    partial.file.dir <- gsub("-(\\d{2})-(\\d{6})\\.txt$", "\\1\\2", file, perl=TRUE)
    file.dir <- file.path(directory, partial.file.dir)
       
    dir.create(file.dir, showWarnings=FALSE)
    
    file.name <- unique(file.path(file.dir, file.name))
    start.line <- grep("<DOCUMENT>.*$", webpage,  perl=TRUE) 
    end.line <- grep("</DOCUMENT>.*$", webpage,  perl=TRUE)     
    
    get.ext <- function(path) { gsub(".*\\.(.*?)$", "\\1", path) }
    
    for (i in 1:length(file.name)) {
         if(file.exists(file.name[i])) {
             next
         }
        
        if (get.ext(file.name[i])=="txt") {
            temp <- webpage[start.line[i]:end.line[i]]
            writeLines(temp, con=file.name[i])
        }
        
        if (get.ext(file.name[i])=="pdf") {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^<PDF>", temp,  perl=TRUE) +1
            pdf.end <- grep("^</PDF>", temp,  perl=TRUE) -1  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste("uudecode -o", file.name[i], t))
            system(paste("uudecode -o", file.name[i], t))
            unlink(t)
        }
        
        if (get.ext(file.name[i]) %in% c("js", "css", "paper", "xsd")) {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^<TEXT>", temp,  perl=TRUE) +1
            pdf.end <- grep("^</TEXT>", temp,  perl=TRUE) -1  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=file.name[i])
            unlink(t)
        }
        
        if (get.ext(file.name[i]) %in% c("zip", "xls", "jpg", "gif")) {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^begin", temp,  perl=TRUE)
            pdf.end <- grep("^end", temp,  perl=TRUE)  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste("uudecode -o", file.name[i], t))
            system(paste("uudecode -o", file.name[i], t))
            unlink(t)
        }
    }
    return(partial.file.dir) 
}
Advertisements
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

2 Responses to Extracting files from SEC “complete submission” text filings

  1. An Nhiên says:

    hi,
    I just want to ask if you have any code to standard the format HTML filings from EDGAR. I download the 10-K filings from EDGAR, but they are HTML Filings, so I just want to ask if you have any code to standardize them?

    • iangow says:

      Not really. Perhaps try the XML package. I think I have an example back in July 2011 using that package.

      Here’s a code snippet using the XML package (with “i” referring to a file, I think):

      pagetree <- htmlTreeParse(i, useInternalNodes = TRUE)
      body <- xpathSApply(pagetree, "//html", xmlValue)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s