Executing SQL in a file against an ODBC source from Stata

I recently discovered that Stata has good support for OBDC. While the built-in functionality includes odbc sqlfile, a “batch job alternative to odbc exec,” there seems to be no equivalent alternative for odbc load (even though odbc load is surely the more useful function).

But this is easily addressed using an .ado program I adapted from that discussed in an earlier post.

I created a file (loadsql2.ado) containing the following:

program define loadsql2
*! Load the output of an SQL file into Stata, version 1.5 (iandgow@gmail.com)
version 13.1
syntax using/, CONN(string)
tempname sqlfile exec line;
file open `sqlfile' using `"`using'"', read text;
file read `sqlfile' `line';
while r(eof)==0 {;
    local `exec' = "``exec''
    file read `sqlfile' `line';
file close `sqlfile';
* display "``exec''";
odbc load , exec("``exec''") dsn("`conn'") clear;

I placed this file in ~/Library/Application\ Support/Stata/ado/personal.

Since I have a ODBC DSN iangow set up, I can now issue the following (see my earlier post for details on ~/some_sql.sql):

loadsql2 using "~/some_sql.sql", conn("iangow") 

and then I have data in Stata:

. list in 1/5

     |  gvkey    datadate       at       ni |
  1. | 001073   31dec1985    3.095     .541 |
  2. | 001073   31dec1986    4.422     .584 |
  3. | 001073   31dec1987      3.9    -.359 |
  4. | 001074   31dec1970   34.856    1.716 |
  5. | 001073   31dec1988    2.067   -1.191 |
Posted in Mac, PostgreSQL, Stata | Leave a comment

Another way to access PostgreSQL from Stata

In an earlier post, I discussed a way of accessing PostgreSQL from Stata. One downside is that this approach would not work for Windows users without significant modification. I asked a question on StackOverflow to see if anyone could identify the necessary modifications.

While one user could answer my StackOverflow question, it seems the easier way to access PostgreSQL from Stata on Windows is via ODBC. It turns out to be fairly straightforward to get ODBC working on Mac OS X (see here).

Posted in Uncategorized | Leave a comment

GitHub code repository

For anyone reading this blog, I also have a repository on GitHub dedicated to code for compiling data from standard sources (e.g., WRDS). I’m still in the process of documenting this, but the documentation for some parts (e.g., Perl scripts for getting data from WRDS into PostgreSQL) is fairly complete.

Posted in Uncategorized | Tagged , | Leave a comment

Installing R on Mac OS from the command line

I just updated R remotely. I wanted to avoid the hassle of logging in via VNC (“Screen Sharing”), so I did this:

wget https://cran.r-project.org/bin/macosx/R-3.2.2.pkg
sudo installer -pkg R-3.2.2.pkg  -target /
rm R-3.2.2.pkg 
Posted in Mac, R | Tagged , | Leave a comment

Prototyping PL/Python functions

I’ve seen a variable called numerical intensity crop up in a couple of papers recently (one example is Lundholm, Rogo, and Zhang (2014)).

I spent a little time coming up with a Python program to extract numbers from text.

the_text = """
Thanks, Steve. Good morning everyone. Yesterday we reported second quarter revenues of $617 million -- a 17% increase over the second quarter of 2004. We reported net earnings of $24.1 million or $0.78 per diluted share for the quarter compared with net earnings of $15.3 million or $0.50 per diluted share for the same period last year.
 Our reported net earnings for both periods included special charges.
 For the 2005 period, our reported earnings include a charge of $118,000 net of tax (ph) related to the early extinguishment of debt. Excluding this charge, earnings for the 2005 second quarter would be $0.80 per share which is above the Wall Street average estimate.
 For the first 6 months net sales increased 23% to $1.15 billion from $938 million in 2004. Net earnings were $30.5 million or $1.00 per diluted share compared with $21 million or $0.77 per diluted share.
 Excluding special items earnings per diluted share for the first half of 2005 were $1.19 up 53% from the first half of 2004. Our tax rate for the quarter was 21% down from 30% in the first quarter of 2005. The reduction in the effective tax rate for the quarter was due to clarity we received concerning the realization of certain tax benefits that were previously reserved against due to their uncertainty. We anticipate having an effective tax rate for the full fiscal year of 27%.
 Now I will review our segment financial results. Crane segment sales for the second quarter were $427 million, an increase of 29% compared to the same quarter last year. Operating earnings were $35.5 million, up 97% and operating margin improved to 8.3% from 5.4% a year ago. This improved (indiscernible) performances illustrates this significant operating leverage of our Crane business.
 Crane Boot (ph) backlog of $530 million is up sharply from the year ago quarter and we are seeing continued strong demand for our Tower Crane, mobile telescopic crane and crawler cranes in all markets except for crawler cranes in North America.
 Crane backlog has been increasing steadily since (indiscernible) in 2003 and has continued to be strong through the positive outlook in many of our end-user markets. Our Crane backlog is not dependent on any one geography or product line.
 Foodservice sales of $126.5 million declined 3.4% from the June 16, 2004 quarter as a result of cooler than normal weather that dampened demand for refrigeration and icemaking equipment in April and May. Lower sales volumes at Manitowoc Ice coupled with continued margin pressures at diversified refrigeration -- DRI -- our contract manufacturing unit combined to reduce Foodservice earnings and margins for the quarter.
 Tim Kraus will provide more detail on our Foodservice results later in the call.
 In our Marine segment, revenue for the quarter was essentially flat from 2004 at $63.4 million. Operating results were a loss of $2.7 million during the quarter compared with Operating earnings of $2.7 million in the 2004 quarter. As we indicated during our first quarter conference call we anticipated that Marine was both disappointing results for the second quarter and the loss is directly related to 3 specific projects.
 One of the projects has now been delivered to the customer, one is currently on route for delivery and the third project will leave our yard next week.
 Among the issues creating the extreme inefficiencies on these projects were considerable revision and were previously inspected and certified by a third party. Overruns in delivery costs and additional hours devoted to bringing the vessel into weight-compliance requirements. These projects are now complete and we anticipate that the Marine segment will return to modest profitability in the second half of 2005.
 Cash from operations in the quarter was positive $13.1 million compared with the use of cash of $1.5 million in the same quarter last year.
 We remain confident that we will achieve our objective of generating net debt reduction of $50 million for the full year. Manitowoc's normal seasonal pattern is to use cash in the first half of the year followed by strong cash generation in the second half. We expect this pattern to continue to hold true this year.
 Manitowoc improved its EVA performance in the first half of 2005 by more than fivefold over the same period last year. We expect this positive trend to continue for the full year, which continues the year-over-year improvement we've achieved since 2003. This improvement in EVA performance validates the strategy and execution of our acquisition.
 My final comments are in earnings guidance which we are tightening to $2.15 to $2.30 per share. This is roughly a 50% increase over last year despite worse than expected year-to-date results in our Foodservice and Marine segments.
 Though market drivers in the North American Lattice-boom Crawler Crane industry continued to improve this is one market sector -- truly the only one in Crane -- that has not experienced an appreciable upturn from (indiscernible) levels. A meaningful recovery in the North American crawler crane revenues is not anticipated until 2006.
 Clearly, the Crane business is driving our improved results. And our visibility is for continued strong market condition in cranes for the foreseeable future. We are seeing good indicators in most of the construction and infrastructure market and we believe we are still several quarters away from achieving historic peak cycle earnings.
 With that I will turn the call over to Tim Kraus to talk about our Foodservice business. Tim."""

def numbers(text):

	import re
	# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
	months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
					'August', 'September', 'October', 
					'November', 'December']		
	months_abbrev = [ month[0:3] + '\.?' for month in months]
	month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

	# Years are four-digit numbers without commas.
	# I have a better rule using the fact that years will begin with 19 or 20, etc.
	years = '\s*\d{4}\.?'

	# Keep matching numbers, as well as month portion, if any
	matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

	# return matches that aren't years and aren't preceded by months
	return [ a_match for a_match in matches
				if not re.match(years, a_match) 
					and not re.match(month_re, a_match)]

Once I had the code above working (you could copy-and-paste it into any program that can execute Python code), I simply pasted it into a simple SQL wrapper and changed the return value to be the number of numbers (rather than the numbers themselves):

CREATE OR REPLACE FUNCTION num_count(text text) 
RETURNS integer AS
import re

# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
                'August', 'September', 'October', 
                'November', 'December']		
months_abbrev = [ month[0:3] + '\.?' for month in months]
month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

# Years are four-digit numbers without commas.
# I have a better rule using the fact that years will begin with 19 or 20, etc.
years = '\s*\d{4}\.?'

# Keep matching numbers, as well as month portion, if any
matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

# return matches that aren't years and aren't preceded by months
return len([ a_match for a_match in matches
            if not re.match(years, a_match) 
                and not re.match(month_re, a_match)])
$BODY$ LANGUAGE plpythonu;

Now I can call this function from SQL:

WITH raw_data AS (
    SELECT word_count(speaker_text), num_count(speaker_text)
    FROM streetevents.speaker_data
    LIMIT 10)
SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
FROM raw_data;

Here is the output:

=# WITH raw_data AS (
(#     SELECT word_count(speaker_text), num_count(speaker_text)
(#     FROM streetevents.speaker_data
(#     LIMIT 10)
-# SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
-# FROM raw_data;

 word_count | num_count |      prop_nums      
        804 |         8 | 0.00995024875621891
        753 |        12 |  0.0159362549800797
       1010 |        40 |  0.0396039603960396
        373 |         5 |  0.0134048257372654
         67 |         0 |                   0
         32 |         3 |             0.09375
         13 |         0 |                   0
          4 |         0 |                   0
        126 |         0 |                   0
         12 |         0 |                   0
(10 rows)

Posted in PostgreSQL, Python | Tagged , , , , | Leave a comment

Upgrading PostgreSQL from 9.4 beta1 to 9.4 beta2

MacPorts installs PostgreSQL 9.4 beta2 in the same directory as PostgreSQL 9.4 beta 1 was in. And going from beta 1 to beta 2 requires an upgrade for the cluster to work. So upgrading is more like going from 9.3 to 9.4, but with some additional (albeit minor) complications. Here’s how I upgraded.

0. Shut down existing server

sudo launchctl unload /Library/LaunchDaemons/org.macports.postgresql94-server.plist

1. Move old binary and data directories

sudo mv /opt/local/lib/postgresql94/bin /opt/local/lib/postgresql94/bin_old
sudo mv /opt/local/var/db/postgresql94/defaultdb /opt/local/var/db/postgresql94/defaultdb_old

2. Install new version of PostgreSQL

sudo port install postgresql94 +perl +python postgresql94-server

3. Set up new cluster

sudo mkdir -p /opt/local/var/db/postgresql94/defaultdb
sudo chown postgres:postgres /opt/local/var/db/postgresql94/defaultdb
sudo su postgres -c '/opt/local/lib/postgresql94/bin/initdb -D /opt/local/var/db/postgresql94/defaultdb'

My existing cluster has PL/R installed, so I need to install that too:

cd ~
wget http://www.joeconway.com/plr/plr-
tar xzf plr-
cd plr
R_HOME="/Library/Frameworks/R.framework/Resources" USE_PGXS=1 PG_CONFIG="/opt/local/lib/postgresql94/bin/pg_config" make
sudo R_HOME="/Library/Frameworks/R.framework/Resources" USE_PGXS=1 PG_CONFIG="/opt/local/lib/postgresql94/bin/pg_config" make install

4. Run pg_upgrade.

The upgrade routine pg_upgrade wants to be run by postgres, but needs postgres to have read-write access to the directory it is run in. So I make a directory to run it from.

cd ~
mkdir pg
sudo chown postgres:postgres pg
cd pg
sudo su postgres -c '/opt/local/lib/postgresql94/bin/pg_upgrade -b /opt/local/lib/postgresql94/bin_old -B /opt/local/lib/postgresql94/bin -d /opt/local/var/db/postgresql94/defaultdb_old -D /opt/local/var/db/postgresql94/defaultdb'

5. Get new cluster going.

sudo port select postgresql postgresql94
sudo defaults write /Library/LaunchDaemons/org.macports.postgresql94-server.plist Disabled -bool false
sudo launchctl load /Library/LaunchDaemons/org.macports.postgresql94-server.plist
Posted in Mac, PostgreSQL | Tagged , , | Leave a comment

Get “tone” from corporate disclosures: PostgreSQL, Python, and R.

The paper by Loughran and McDonald (2011) develops lists of words that are intended to reflect “tone” in financial text. The following code retrieves these lists and applies them to a large body of text (over a billion words) from 250,000+ conference calls.

This provides a nice illustration of R, PostgreSQL and Python working together. I also illustrates the use of JSON data as a way to store Python dictionaries in PostgreSQL.

1. Get tone dictionary

The following code gets the dictionaries from Bill McDonald’s website and puts them into a PostgreSQL table. I store the lists of words for each category (i.e., positive, etc.) as an array, as arrays are compact and easy to work with in Python (they become lists, see the PL/Python function below).

category <- c("positive", "negative", "uncertainty", 
                "litigious", "modal_strong", "modal_weak")
url <- c("http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_Positive.csv",

df <- data.frame(category, url, stringsAsFactors=FALSE)

getWords <- function(url) {
    words <- read.csv(url, as.is=TRUE)    
    paste(words[,1], collapse=",")
df$words <- unlist(lapply(df$url, getWords))

pg <- dbConnect(PostgreSQL())
rs <- dbWriteTable(pg, "lm_tone", df, row.names=FALSE, overwrite=TRUE)

rs <- dbGetQuery(pg, "ALTER TABLE lm_tone ADD COLUMN word_list text[];")
rs <- dbGetQuery(pg, "UPDATE lm_tone SET word_list = regexp_split_to_array(words, ',')")
rs <- dbGetQuery(pg, "ALTER TABLE lm_tone DROP COLUMN words;")
rs <- dbDisconnect(pg)

2. Create a function to extract tone counts

The following PL/Python function began life as standalone Python code to count the number of words from coming from lists of words appearing in a passage of text. It was easy to adapt this code to put inside a PostgreSQL function, change out the list of words (from LIWC, as used on another project, to the Loughran and McDonald (2011) tone words) and the run it on the 31 million+ passages of text (running on twelve cores, this took around 5 hours).

Most of the Python code is setting up the regular expressions (storing them in SD, a Python dictionary in PostgreSQL) After the first call to the function in a session, only the last three lines of code actually do anything.

CREATE OR REPLACE FUNCTION tone_count(the_text text)  
    if 're' in SD:
        re = SD['re']
        json = SD['json']
        import re, json
        SD['re'] = re
        SD['json'] = json

    if SD.has_key("regex_list"):
        regex_list = SD["regex_list"]
        categories = SD["categories"]
        rv = plpy.execute("SELECT category FROM lm_tone")

        categories = [ (r["category"]) for r in rv]

        # Prepare the regular expressions.
        plan = plpy.prepare("""
            SELECT word_list
            FROM lm_tone 
            WHERE category = $1""", ["text"])

        mod_word_list = {}
        for cat in categories:
            rows = list(plpy.cursor(plan, [cat]))
            word_list = rows[0]['word_list']
            mod_word_list[cat] = [word.lower() for word in word_list]

        # Pre-compile regular expressions.
        regex_list = {}
        for key in mod_word_list.keys():
            regex = '\\b(?:' + '|'.join(mod_word_list[key]) + ')\\b'
            regex_list[key] = re.compile(regex)
        SD["regex_list"] = regex_list
        SD["categories"] = categories

    # rest of function
    """Function to return number of matches against a LIWC category in a text"""
    text = re.sub(u'\u2019', "'", the_text).lower()
    the_dict = {category: len(re.findall(regex_list[category], text)) for category in categories}
    return json.dumps(the_dict)

$CODE$ LANGUAGE plpythonu;

3. Create an R function to get tone data for a file

So I have a Python function that I can call within PostgreSQL. But I’m more facile with R, especially when it comes to parallel processing. Additionally, the RJSONIO package makes it easy to convert each key of the dictionary Python creates into a column of a data frame.

# Need to run word_count.sql first.
addToneData <- function(file_name) {
    # Function to get word count data for all utterances in a call

    pg <- dbConnect(PostgreSQL())

    # Get tone data. Data is JSON converted to text.
    tone_raw <- dbGetQuery(pg, paste0("
        SELECT file_name, speaker_name, tone_count(string_agg(speaker_text, ' '))
        FROM streetevents.speaker_data
        WHERE file_name ='", file_name, "'
        GROUP BY file_name, speaker_name"))

    # Convert JSON-as-text to records where each key becomes a column
    tone_data <- as.data.frame(do.call(rbind,
                                       lapply(tone_raw$tone_count, fromJSON)))

    # Convert JSON data from numeric to integers
    for (i in names(tone_data)) {
        tone_data[,i] <- as.integer(tone_data[,i])

    # Combine converted JSON data with other fields
    tone_data <- cbind(subset(tone_raw, TRUE, select=-tone_count), tone_data)

    # Put data back in database.
    dbWriteTable(pg, "word_counts", tone_data, append=TRUE, row.names=FALSE)

    rs <- dbDisconnect(pg)

4. Apply R function.

pg <- dbConnect(PostgreSQL())

dbGetQuery(pg, "DROP TABLE IF EXISTS word_counts")

# Get a list of file names for which we need to get tone data.
file_names <-  dbGetQuery(pg, "
    SELECT DISTINCT file_name
    FROM streetevents.calls
    WHERE call_type=1")

# Apply function to get tone data. Run on 12 cores.
system.time(temp <- mclapply(file_names$file_name, addToneData, mc.cores=12))
rs <- dbGetQuery(pg, "
    SET maintenance_work_mem='1GB';
    CREATE INDEX ON word_counts (file_name)")
rs <- dbDisconnect(pg)
Posted in Uncategorized | Tagged , , , , , , , | 3 Comments