Remote WRDS SAS from R via Stata .dta format

This post is a follow-up to an earlier post. PROC EXPORT exports special missing values as text, which causes my database to choke when such values occur in non-text columns (e.g., nameenddt in crsp.ccmxpf_linktable). So this code uses Stata as an alternative intermediate format.

date <- as.Date("2012-01-06")
ticker <- "GM"

generate_sas_code <- function(ticker, date, tq="t") {
  # This function generates SAS code that will be piped into SAS
  paste("
        PROC SQL;
        CREATE TABLE temp AS
        SELECT * FROM taq.c", tq, "_", format(date, "%Y%m%d"),
        " WHERE symbol =\"", ticker, "\";
        quit;
        
        proc export data=temp
        outfile=\"data.dta\"
        dbms=stata;
        run;", sep="")
}

get_data <- function(ticker, date, tq="t") {
  sas_code <- generate_sas_code(ticker, date, tq)
  temp_file <- tempfile()
  # This command calls SAS on the remote server.
  # -C means "compress output" ... this seems to have an impact even though we're
  # using gzip for compression of the CSV file spat out by SAS after it's
  # been transferred to the local computer (trial and error suggested this was
  # the most efficient approach).
  # -stdio means that SAS will take input from STDIN and output to STDOUT
  sas_command <- paste("ssh -C iangow@wrds.wharton.upenn.edu ",
                       "'sas -stdio -noterminal; cat data.dta' > ",
                       temp_file)
  
  # The following pipes the SAS code to the SAS command. The "intern=TRUE"
  # means that we can capture the output in an R variable.
  system(paste("echo '", sas_code, "' |", sas_command), intern=TRUE)
  library(foreign)
  read.dta(temp_file)
}

# Now get the data from WRDS
system.time(trades <- get_data(ticker, date, tq="t"))
system.time(quotes <- get_data(ticker, date, tq="q"))

This can easily be adapted to create a “fetch data” routine


convertToInteger <- function(vec) {
  # This is a small function that converts numeric vectors to
  # integers if doing so does not lose information
  notNA <- !is.na(vec)
  
  if (all(vec[notNA]==as.integer(vec[notNA]))) {
    return(as.integer(vec))
  } else {
    return(vec)
  }  
}

get_data <- function(libname, datatable) {
  
  sas_code <- paste("
    proc export data=", libname, ".", datatable,"
      outfile=\"data.dta\"
      dbms=stata replace;
    run;", sep="")
  
  temp_file <- tempfile()
  # This command calls SAS on the remote server.
  # -C means "compress output" ... this seems to have an impact even though we're
  # using gzip for compression of the CSV file spat out by SAS after it's
  # been transferred to the local computer (trial and error suggested this was
  # the most efficient approach).
  # -stdio means that SAS will take input from STDIN and output to STDOUT
  sas_command <- paste("ssh -C iangow@wrds.wharton.upenn.edu ",
                       "'sas -stdio -noterminal; cat data.dta' > ",
                       temp_file)
  
  # The following pipes the SAS code to the SAS command. The "intern=TRUE"
  # means that we can capture the output in an R variable.
  system(paste("echo '", sas_code, "' |", sas_command), intern=FALSE)
  library(foreign)
  temp <- read.dta(temp_file)
  
  # Convert numeric vectors to integers if possible
  for (i in names(temp)) {
    if(is.numeric(temp[,i])) { temp[,i] <- convertToInteger(temp[,i]) }
  } 
  
  # Delete the temporary file
  unlink(temp_file)
  return(temp)
}



# Now get the data from WRDS
system.time(ccmxpf_linktable <- get_data("crsp", "ccmxpf_linktable"))

This seems to work quite nicely! (The main wrinkle is that SAS data from WRDS often doesn’t yield integers in cases where it should do so. Using double-precision variables slows matching down a tremendous amount in PostgreSQL, so it is important to get things like years as integers. So I’ve included a kluge to convert to integers when possible.)

If I hadn’t already done so, to get this working, I would’ve needed to get public-key authentication going on WRDS/my computer. Following hints taken from here, I set up a public key. I then copied that key to the WRDS server from the terminal on my computer. (Note that this code assumes you have a directory “.ssh” in your home directory. If not, log into WRDS via SSH, then type “mkdir ~/.ssh” to create this.)

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub | ssh iangow@wrds.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s