Extracting files from SEC “complete submission text files”

The index files that can be downloaded from the SEC website (see here for more information) provide the location of “complete submission” text filings.

Using either the index files or the SEC header (SGML) files, I then identify filings of interest (e.g., all 10-K filings using the index files or 8-K filings with Item 2.02 information within 7 days of a quarterly reporting date per Compustat using the SGML files). What I’ve then done is to download these complete submissions to a folder on my computer (/edgar/data/) and then using the function below extract the files (e.g., PDFs, JPEGs) embedded in these complete submissions into a folder that is indexed using Recoll (based on Xapian) for fast full-text searches of the underlying files (this is on Linux, other solutions likely exist for Mac OS X or Windows).

Here is some code that I used to extract the component files. This mostly works (the Excel files don’t seem to work either using this or the SEC links directly), but I’m sure there’s a better way to do this, however. Note that, with the exception of putting the complete submission files in a different location, I try to retain the directory structure of the SEC website. My reason for putting the complete submissions in a separate location is to prevent Recoll from indexing meaningless binary data embedded in these files.

extract.filings <- function(file) {
## A function to extract filings from complete submission text files submitted
## to the SEC into the component files contained within them.
    new_location <- file.path(extract_directory, file)
    file <- file.path(raw_directory, file)
    # Parse the file as an XML file containing multiple documents
    webpage <- readLines(file)
    file.name <- gsub("<FILENAME>","", 
                      grep("<FILENAME>.*$", webpage,  perl=TRUE, value=TRUE))
    # If there are no file names, then the full text submission is simply a text file.
    # Rather than copying this to the new location, I just symlink it (this saves space).
    if (length(file.name)==0) { 
        if (!file.exists(new_location)) {
            dir.create(dirname(new_location), showWarnings=FALSE,
                       recursive = TRUE)
            file.symlink(from=file, to=new_location)
    } else {
      #  return(dirname(new_location))
    # If got here, we have a full-text submission that isn't simply a text file
    # We need to make the parent directory for the component files that are 
    # embedded in the submission
    file.dir <- gsub("-(\\d{2})-(\\d{6})\\.txt$", "\\1\\2", new_location, perl=TRUE)
    dir.create(file.dir, showWarnings=FALSE, recursive=TRUE)
    # Get a list of file names, and their start and end locations within the
    # text file. (I use unique file names, as sometimes--albeit rarely--the
    # filename is repeated).
    file.name <- unique(file.path(file.dir, file.name))
    start.line <- grep("<DOCUMENT>.*$", webpage,  perl=TRUE) 
    end.line <- grep("</DOCUMENT>.*$", webpage,  perl=TRUE)     
    for (i in 1:length(file.name)) {
        # Skip the file if it already exists and the extracted file was extracted 
        # recently.
        if(file.exists(file.name[i]) && 
            as.Date(file.info(file.name[i])$ctime) > "2012-02-15") {
        # Get the extension of the file to be extracted
        file.ext <- gsub(".*\\.(.*?)$", "\\1", file.name[i])
        # Extract binary files
        if (file.ext %in% c("zip", "xls", "jpg", "gif")) {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^begin", temp,  perl=TRUE)
            pdf.end <- grep("^end", temp,  perl=TRUE)  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste("uudecode -o", file.name[i], t))
            system(paste("uudecode -o", file.name[i], t))
        # Extract simple text files
        if (file.ext=="txt") {
            temp <- webpage[start.line[i]:end.line[i]]
            writeLines(temp, con=file.name[i])
        # Extract text-based formatted file types
        if (file.ext %in% c("htm", "js", "css", "paper", "xsd")) {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^<TEXT>", temp,  perl=TRUE) +1
            pdf.end <- grep("^</TEXT>", temp,  perl=TRUE) -1  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=file.name[i])
        # Extract PDFs
        if (file.ext=="pdf") {
            temp <- webpage[start.line[i]:end.line[i]]
            pdf.start <- grep("^<PDF>", temp,  perl=TRUE) +1
            pdf.end <- grep("^</PDF>", temp,  perl=TRUE) -1  
            t <- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste("uudecode -o", file.name[i], t))
            system(paste("uudecode -o", file.name[i], t))


I try to follow the directory structure of SEC’s EDGAR whenever possible, but I don’t want to bother indexing the binary data (e.g., for PDFs or images) embedded in the complete submission text files, so I put them in a separate directory from the source documents and only index this separate directory (I have been using Recoll for this purpose). I have the original filings in /edgar/data and the extracted filings in /hdd/edgar/data, which is a separate hard drive). So I set

raw_directory <- ""
extract_directory <- "/hdd"

before calling the code below. I have a complete submission text file located at /edgar/data/1412665/0001144204-09-014344.txt (see here for this online). So an example calling this function would be look like this (output included):

> extract.filings("edgar/data/1412665/0001144204-09-014344.txt")
[1] "/hdd/edgar/data/1412665/000114420409014344/v142581_10k.htm"   
[2] "/hdd/edgar/data/1412665/000114420409014344/logo.jpg"          
[3] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex21-1.htm"
[4] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex23-1.htm"
[5] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex31-1.htm"
[6] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex31-2.htm"
[7] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex32-1.htm"
[8] "/hdd/edgar/data/1412665/000114420409014344/v142581_ex32-2.htm"
[1] "/hdd/edgar/data/1412665"
This entry was posted in Uncategorized. Bookmark the permalink.

14 Responses to Extracting files from SEC “complete submission text files”

  1. Pingback: Extracting files from SEC “complete submission” text filings | iangow

  2. adrFinance says:

    Hi Ian,

    Your blog is very interesting. Thanks a lot.

    How do you get this information: “edgar/data/1412665/0001144204-09-014344.txt”)
    [1] “/hdd/edgar/data/1412665/000114420409014344/v142581_10k.htm”
    [2] “/hdd/edgar/data/1412665/000114420409014344/logo.jpg”
    [3] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex21-1.htm”
    [4] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex23-1.htm”
    [5] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex31-1.htm”
    [6] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex31-2.htm”
    [7] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex32-1.htm”
    [8] “/hdd/edgar/data/1412665/000114420409014344/v142581_ex32-2.htm”
    [1] “/hdd/edgar/data/1412665”

    Is it possible that you somehow download all forms (e.g. all 10-K) from SEC Edgar without knowing their paths in advance?


    • iangow says:

      Absolutely. See code here. I have code there to download all 13-D and 13-G filings that could be adapted. Note that this uses my PostgreSQL database, but could easily be adapted for other solutions.

      • adrFinance says:

        That’s great, I was not aware of your github. Do you have maybe some guide on how to set up this database? I have not used PostgreSQL but I have used MySQL so normally it should not be extremely different.

  3. Michelle says:

    Hi Ian. This is extremely helpful. Is there a way to download appendices, for example appendices to S-1 filings?

    • iangow says:

      They would be included in the “complete submission text files” that are being used here. I just updated code that downloads the filings, rather than extracting them today (see here).

  4. KG says:

    Hi Ian. One of the hardest things to do is to extract text reliably from the HTML/SGML within the SEC filings once I have downloaded them (nested divs, markup language remnants ect). Specifically, do you have a reliable way to for instance extract the text within the MD&A sections from 10K filings? Your help is greatly appreciated!

  5. Rascar says:


    To begin, I am new to R and I am learning R to work with SEC filings since you see to have done an amazing amount of work through your script.

    Suppose, I have the path to the submission file. For example, I have this filing at http://www.sec.gov/Archives/edgar/data/1441634/000144163415000002/0001441634-15-000002.txt

    Now, how do I get say the form 10Q data written to a comma file that can later be read by Excel using the extract.filings function?

    For example, I tried extract.filings(“http://www.sec.gov/Archives/edgar/data/1441634/000144163415000002/0001441634-15-000002.txt” and I get something like

    [61] “C://000144163415000002/R14.htm” “C://000144163415000002/R16.htm”
    [63] “C://000144163415000002/R34.htm” “C://000144163415000002/R51.htm”
    [65] “C://000144163415000002/R21.htm” “C://000144163415000002/R26.htm”
    [67] “C://000144163415000002/R49.htm” “C://000144163415000002/R41.htm”
    [69] “C://000144163415000002/R5.htm” “C://000144163415000002/R10.htm”
    [71] “C://000144163415000002/R58.htm” “C://000144163415000002/R27.htm”
    [73] “C://000144163415000002/FilingSummary.xml” “C://000144163415000002/R38.htm”
    [75] “C://000144163415000002/R20.htm”
    Error in file(con, “w”) : cannot open the connection
    In addition: Warning messages:
    1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
    there is no package called ‘XML’
    2: In file(con, “w”) :
    cannot open file ‘C://000144163415000002/avgo-02012015x10q.htm’: No such file or directory


    Thank you very much.

  6. Rascar says:

    Iangow. Thanks a lot!! I did try that as well. Then I get something like..

    [75] “C:/000144163415000002/R20.htm”
    [1] “uudecode -o C:/000144163415000002/Financial_Report.xls C:\\Users\\AppData\\Local\\Temp\\RtmpmmAgv7\\file239c395036c8”
    [1] “uudecode -o C:/000144163415000002/0001441634-15-000002-xbrl.zip C:\\Users\\AppData\\Local\\Temp\\RtmpmmAgv7\\file239c74114aa4”
    [1] “C:/”
    Warning messages:
    1: running command ‘uudecode -o C:/000144163415000002/Financial_Report.xls C:\Users\AppData\Local\Temp\RtmpmmAgv7\file239c395036c8’ had status 127
    2: running command ‘uudecode -o C:/000144163415000002/0001441634-15-000002-xbrl.zip C:\Users\AppData\Local\Temp\RtmpmmAgv7\file239c74114aa4’ had status 127

    I am only looking for income statement, balance sheet and statement of cash flow tables.

    • iangow says:

      It seems that the code below works for the XLSX file, but not the XLS file.

      Sys.setenv(EDGAR_DIR="/Volumes/2TB/data") # My local copy of EDGAR is here

      temp <- get_text_file("edgar/data/1441634/0001441634-15-000002.txt")

  7. Tahir Janjua says:

    i’m new with R and PostgreSQL and your blog is very helpful just want to know about i.e
    > extract.filings(“edgar/data/1412665/0001144204-09-014344.txt”) works for only one txt file (i guess). how can i use that extract.fillings function for whole Edgar/data/ or all file_name at once. thank you so very much heads off for all your work

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s