Here is some R code to download SEC index files and put them into a database. This is an alternative to Perl code provided by Andrew Leone here.
First, a function to download the zipped index file from the SEC website, then parse it into an R dataframe:
getSECIndexFile <- function(year, quarter) {
# Download the zipped index file from the SEC website
tf <- tempfile()
result <- try(download.file(
url=paste("ftp://anonymous:your_id@ftp.sec.gov/edgar/full-index/",
year,"/QTR", quarter, "/company.zip",sep=""),
destfile=tf))
# If we didn't encounter and error downloading the file, parse it
# and return as a R data frame
if (!inherits(result, "try-error")) {
# Small function to remove leading and trailing spaces
trim <- function (string) {
gsub("^\\s*(.*?)\\s*$","\\1", string, perl=TRUE)
}
# Read the downloaded file
raw.data <- readLines(con=(zz<- unz(description=tf,
filename="company.idx")))
close(zz)
raw.data <- raw.data[11:length(raw.data)] # Remove the first 10 rows.
# Parse the downloaded file and return the extracted data as a data frame
company_name <- trim(substr(raw.data,1,62))
form_type <- trim(substr(raw.data,63,74))
cik <- trim(substr(raw.data,75,86))
date_filed <- as.Date(substr(raw.data,87,98))
file_name <- trim(substr(raw.data,99,150))
rm(raw.data)
return(data.frame(company_name, form_type, cik, date_filed, file_name))
} else { return(NULL)}
}
Second, a function to add the file to my database:
addIndexFileToDatabase <- function(data) {
if (is.null(data)) return(NULL)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
pg <- dbConnect(drv, db="crsp")
rs <- dbWriteTable(pg, c("filings", "filings"), data, append=TRUE)
dbDisconnect(pg)
return(rs)
}
Following sentence is out of date (see below). Note that some of the SEC index files contain embedded backslashes, which cause problems for the version of RPostgreSQL on CRAN. I instead downloaded the version available on the Google project site, then installed it by issuing
install.packages("/home/iangow/rpostgresql-read-only/RPostgreSQL/", type="source", repos=NULL) within R.
Finally, a few lines of code to delete the filings table if it already exists, then download all the filing files from 1993 to 2011 and post them in the database.
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
pg <- dbConnect(drv, db="crsp")
dbGetQuery(pg, "DROP TABLE IF EXISTS filings.filings")
for (year in 1993:2011) {
for (quarter in 1:4) {
addIndexFileToDatabase(getSECIndexFile(year, quarter))
}
}
Advertisement
Note that the issue with backslashes and RPostgreSQL has now been resolved in the CRAN version.
Pingback: Extracting files from SEC “complete submission” text filings | iangow