Prototyping PL/Python functions

I’ve seen a variable called numerical intensity crop up in a couple of papers recently (one example is Lundholm, Rogo, and Zhang (2014)).

I spent a little time coming up with a Python program to extract numbers from text.

the_text = """
Thanks, Steve. Good morning everyone. Yesterday we reported second quarter revenues of $617 million -- a 17% increase over the second quarter of 2004. We reported net earnings of $24.1 million or $0.78 per diluted share for the quarter compared with net earnings of $15.3 million or $0.50 per diluted share for the same period last year.
 Our reported net earnings for both periods included special charges.
 For the 2005 period, our reported earnings include a charge of $118,000 net of tax (ph) related to the early extinguishment of debt. Excluding this charge, earnings for the 2005 second quarter would be $0.80 per share which is above the Wall Street average estimate.
 For the first 6 months net sales increased 23% to $1.15 billion from $938 million in 2004. Net earnings were $30.5 million or $1.00 per diluted share compared with $21 million or $0.77 per diluted share.
 Excluding special items earnings per diluted share for the first half of 2005 were $1.19 up 53% from the first half of 2004. Our tax rate for the quarter was 21% down from 30% in the first quarter of 2005. The reduction in the effective tax rate for the quarter was due to clarity we received concerning the realization of certain tax benefits that were previously reserved against due to their uncertainty. We anticipate having an effective tax rate for the full fiscal year of 27%.
 Now I will review our segment financial results. Crane segment sales for the second quarter were $427 million, an increase of 29% compared to the same quarter last year. Operating earnings were $35.5 million, up 97% and operating margin improved to 8.3% from 5.4% a year ago. This improved (indiscernible) performances illustrates this significant operating leverage of our Crane business.
 Crane Boot (ph) backlog of $530 million is up sharply from the year ago quarter and we are seeing continued strong demand for our Tower Crane, mobile telescopic crane and crawler cranes in all markets except for crawler cranes in North America.
 Crane backlog has been increasing steadily since (indiscernible) in 2003 and has continued to be strong through the positive outlook in many of our end-user markets. Our Crane backlog is not dependent on any one geography or product line.
 Foodservice sales of $126.5 million declined 3.4% from the June 16, 2004 quarter as a result of cooler than normal weather that dampened demand for refrigeration and icemaking equipment in April and May. Lower sales volumes at Manitowoc Ice coupled with continued margin pressures at diversified refrigeration -- DRI -- our contract manufacturing unit combined to reduce Foodservice earnings and margins for the quarter.
 Tim Kraus will provide more detail on our Foodservice results later in the call.
 In our Marine segment, revenue for the quarter was essentially flat from 2004 at $63.4 million. Operating results were a loss of $2.7 million during the quarter compared with Operating earnings of $2.7 million in the 2004 quarter. As we indicated during our first quarter conference call we anticipated that Marine was both disappointing results for the second quarter and the loss is directly related to 3 specific projects.
 One of the projects has now been delivered to the customer, one is currently on route for delivery and the third project will leave our yard next week.
 Among the issues creating the extreme inefficiencies on these projects were considerable revision and were previously inspected and certified by a third party. Overruns in delivery costs and additional hours devoted to bringing the vessel into weight-compliance requirements. These projects are now complete and we anticipate that the Marine segment will return to modest profitability in the second half of 2005.
 Cash from operations in the quarter was positive $13.1 million compared with the use of cash of $1.5 million in the same quarter last year.
 We remain confident that we will achieve our objective of generating net debt reduction of $50 million for the full year. Manitowoc's normal seasonal pattern is to use cash in the first half of the year followed by strong cash generation in the second half. We expect this pattern to continue to hold true this year.
 Manitowoc improved its EVA performance in the first half of 2005 by more than fivefold over the same period last year. We expect this positive trend to continue for the full year, which continues the year-over-year improvement we've achieved since 2003. This improvement in EVA performance validates the strategy and execution of our acquisition.
 My final comments are in earnings guidance which we are tightening to $2.15 to $2.30 per share. This is roughly a 50% increase over last year despite worse than expected year-to-date results in our Foodservice and Marine segments.
 Though market drivers in the North American Lattice-boom Crawler Crane industry continued to improve this is one market sector -- truly the only one in Crane -- that has not experienced an appreciable upturn from (indiscernible) levels. A meaningful recovery in the North American crawler crane revenues is not anticipated until 2006.
 Clearly, the Crane business is driving our improved results. And our visibility is for continued strong market condition in cranes for the foreseeable future. We are seeing good indicators in most of the construction and infrastructure market and we believe we are still several quarters away from achieving historic peak cycle earnings.
 With that I will turn the call over to Tim Kraus to talk about our Foodservice business. Tim."""

def numbers(text):

	import re
	# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
	months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
					'August', 'September', 'October', 
					'November', 'December']		
	months_abbrev = [ month[0:3] + '\.?' for month in months]
	month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

	# Years are four-digit numbers without commas.
	# I have a better rule using the fact that years will begin with 19 or 20, etc.
	years = '\s*\d{4}\.?'

	# Keep matching numbers, as well as month portion, if any
	matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

	# return matches that aren't years and aren't preceded by months
	return [ a_match for a_match in matches
				if not re.match(years, a_match) 
					and not re.match(month_re, a_match)]

Once I had the code above working (you could copy-and-paste it into any program that can execute Python code), I simply pasted it into a simple SQL wrapper and changed the return value to be the number of numbers (rather than the numbers themselves):

CREATE OR REPLACE FUNCTION num_count(text text) 
RETURNS integer AS
import re

# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
                'August', 'September', 'October', 
                'November', 'December']		
months_abbrev = [ month[0:3] + '\.?' for month in months]
month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

# Years are four-digit numbers without commas.
# I have a better rule using the fact that years will begin with 19 or 20, etc.
years = '\s*\d{4}\.?'

# Keep matching numbers, as well as month portion, if any
matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

# return matches that aren't years and aren't preceded by months
return len([ a_match for a_match in matches
            if not re.match(years, a_match) 
                and not re.match(month_re, a_match)])
$BODY$ LANGUAGE plpythonu;

Now I can call this function from SQL:

WITH raw_data AS (
    SELECT word_count(speaker_text), num_count(speaker_text)
    FROM streetevents.speaker_data
    LIMIT 10)
SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
FROM raw_data;

Here is the output:

=# WITH raw_data AS (
(#     SELECT word_count(speaker_text), num_count(speaker_text)
(#     FROM streetevents.speaker_data
(#     LIMIT 10)
-# SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
-# FROM raw_data;

 word_count | num_count |      prop_nums      
        804 |         8 | 0.00995024875621891
        753 |        12 |  0.0159362549800797
       1010 |        40 |  0.0396039603960396
        373 |         5 |  0.0134048257372654
         67 |         0 |                   0
         32 |         3 |             0.09375
         13 |         0 |                   0
          4 |         0 |                   0
        126 |         0 |                   0
         12 |         0 |                   0
(10 rows)

Posted in PostgreSQL, Python | Tagged , , , , | Leave a comment

Upgrading PostgreSQL from 9.4 beta1 to 9.4 beta2

MacPorts installs PostgreSQL 9.4 beta2 in the same directory as PostgreSQL 9.4 beta 1 was in. And going from beta 1 to beta 2 requires an upgrade for the cluster to work. So upgrading is more like going from 9.3 to 9.4, but with some additional (albeit minor) complications. Here’s how I upgraded.

0. Shut down existing server

sudo launchctl unload /Library/LaunchDaemons/org.macports.postgresql94-server.plist

1. Move old binary and data directories

sudo mv /opt/local/lib/postgresql94/bin /opt/local/lib/postgresql94/bin_old
sudo mv /opt/local/var/db/postgresql94/defaultdb /opt/local/var/db/postgresql94/defaultdb_old

2. Install new version of PostgreSQL

sudo port install postgresql94 +perl +python postgresql94-server

3. Set up new cluster

sudo mkdir -p /opt/local/var/db/postgresql94/defaultdb
sudo chown postgres:postgres /opt/local/var/db/postgresql94/defaultdb
sudo su postgres -c '/opt/local/lib/postgresql94/bin/initdb -D /opt/local/var/db/postgresql94/defaultdb'

My existing cluster has PL/R installed, so I need to install that too:

cd ~
tar xzf plr-
cd plr
R_HOME="/Library/Frameworks/R.framework/Resources" USE_PGXS=1 PG_CONFIG="/opt/local/lib/postgresql94/bin/pg_config" make
sudo R_HOME="/Library/Frameworks/R.framework/Resources" USE_PGXS=1 PG_CONFIG="/opt/local/lib/postgresql94/bin/pg_config" make install

4. Run pg_upgrade.

The upgrade routine pg_upgrade wants to be run by postgres, but needs postgres to have read-write access to the directory it is run in. So I make a directory to run it from.

cd ~
mkdir pg
sudo chown postgres:postgres pg
cd pg
sudo su postgres -c '/opt/local/lib/postgresql94/bin/pg_upgrade -b /opt/local/lib/postgresql94/bin_old -B /opt/local/lib/postgresql94/bin -d /opt/local/var/db/postgresql94/defaultdb_old -D /opt/local/var/db/postgresql94/defaultdb'

5. Get new cluster going.

sudo port select postgresql postgresql94
sudo defaults write /Library/LaunchDaemons/org.macports.postgresql94-server.plist Disabled -bool false
sudo launchctl load /Library/LaunchDaemons/org.macports.postgresql94-server.plist
Posted in Mac, PostgreSQL, Uncategorized | Tagged , , | Leave a comment

Get “tone” from corporate disclosures: PostgreSQL, Python, and R.

The paper by Loughran and McDonald (2011) develops lists of words that are intended to reflect “tone” in financial text. The following code retrieves these lists and applies them to a large body of text (over a billion words) from 250,000+ conference calls.

This provides a nice illustration of R, PostgreSQL and Python working together. I also illustrates the use of JSON data as a way to store Python dictionaries in PostgreSQL.

1. Get tone dictionary

The following code gets the dictionaries from Bill McDonald’s website and puts them into a PostgreSQL table. I store the lists of words for each category (i.e., positive, etc.) as an array, as arrays are compact and easy to work with in Python (they become lists, see the PL/Python function below).

category <- c("positive", "negative", "uncertainty", 
                "litigious", "modal_strong", "modal_weak")
url <- c("",

df <- data.frame(category, url, stringsAsFactors=FALSE)

getWords <- function(url) {
    words <- read.csv(url,    
    paste(words[,1], collapse=",")
df$words <- unlist(lapply(df$url, getWords))

pg <- dbConnect(PostgreSQL())
rs <- dbWriteTable(pg, "lm_tone", df, row.names=FALSE, overwrite=TRUE)

rs <- dbGetQuery(pg, "ALTER TABLE lm_tone ADD COLUMN word_list text[];")
rs <- dbGetQuery(pg, "UPDATE lm_tone SET word_list = regexp_split_to_array(words, ',')")
rs <- dbGetQuery(pg, "ALTER TABLE lm_tone DROP COLUMN words;")
rs <- dbDisconnect(pg)

2. Create a function to extract tone counts

The following PL/Python function began life as standalone Python code to count the number of words from coming from lists of words appearing in a passage of text. It was easy to adapt this code to put inside a PostgreSQL function, change out the list of words (from LIWC, as used on another project, to the Loughran and McDonald (2011) tone words) and the run it on the 31 million+ passages of text (running on twelve cores, this took around 5 hours).

Most of the Python code is setting up the regular expressions (storing them in SD, a Python dictionary in PostgreSQL) After the first call to the function in a session, only the last three lines of code actually do anything.

CREATE OR REPLACE FUNCTION tone_count(the_text text)  
    if 're' in SD:
        re = SD['re']
        json = SD['json']
        import re, json
        SD['re'] = re
        SD['json'] = json

    if SD.has_key("regex_list"):
        regex_list = SD["regex_list"]
        categories = SD["categories"]
        rv = plpy.execute("SELECT category FROM lm_tone")

        categories = [ (r["category"]) for r in rv]

        # Prepare the regular expressions.
        plan = plpy.prepare("""
            SELECT word_list
            FROM lm_tone 
            WHERE category = $1""", ["text"])

        mod_word_list = {}
        for cat in categories:
            rows = list(plpy.cursor(plan, [cat]))
            word_list = rows[0]['word_list']
            mod_word_list[cat] = [word.lower() for word in word_list]

        # Pre-compile regular expressions.
        regex_list = {}
        for key in mod_word_list.keys():
            regex = '\\b(?:' + '|'.join(mod_word_list[key]) + ')\\b'
            regex_list[key] = re.compile(regex)
        SD["regex_list"] = regex_list
        SD["categories"] = categories

    # rest of function
    """Function to return number of matches against a LIWC category in a text"""
    text = re.sub(u'\u2019', "'", the_text).lower()
    the_dict = {category: len(re.findall(regex_list[category], text)) for category in categories}
    return json.dumps(the_dict)

$CODE$ LANGUAGE plpythonu;

3. Create an R function to get tone data for a file

So I have a Python function that I can call within PostgreSQL. But I’m more facile with R, especially when it comes to parallel processing. Additionally, the RJSONIO package makes it easy to convert each key of the dictionary Python creates into a column of a data frame.

# Need to run word_count.sql first.
addToneData <- function(file_name) {
    # Function to get word count data for all utterances in a call

    pg <- dbConnect(PostgreSQL())

    # Get tone data. Data is JSON converted to text.
    tone_raw <- dbGetQuery(pg, paste0("
        SELECT file_name, speaker_name, tone_count(string_agg(speaker_text, ' '))
        FROM streetevents.speaker_data
        WHERE file_name ='", file_name, "'
        GROUP BY file_name, speaker_name"))

    # Convert JSON-as-text to records where each key becomes a column
    tone_data <-,
                                       lapply(tone_raw$tone_count, fromJSON)))

    # Convert JSON data from numeric to integers
    for (i in names(tone_data)) {
        tone_data[,i] <- as.integer(tone_data[,i])

    # Combine converted JSON data with other fields
    tone_data <- cbind(subset(tone_raw, TRUE, select=-tone_count), tone_data)

    # Put data back in database.
    dbWriteTable(pg, "word_counts", tone_data, append=TRUE, row.names=FALSE)

    rs <- dbDisconnect(pg)

4. Apply R function.

pg <- dbConnect(PostgreSQL())

dbGetQuery(pg, "DROP TABLE IF EXISTS word_counts")

# Get a list of file names for which we need to get tone data.
file_names <-  dbGetQuery(pg, "
    SELECT DISTINCT file_name
    FROM streetevents.calls
    WHERE call_type=1")

# Apply function to get tone data. Run on 12 cores.
system.time(temp <- mclapply(file_names$file_name, addToneData, mc.cores=12))
rs <- dbGetQuery(pg, "
    SET maintenance_work_mem='1GB';
    CREATE INDEX ON word_counts (file_name)")
rs <- dbDisconnect(pg)
Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Using JSON to pass dictionaries from Python to PostgreSQL and back

To my (very untrained) eyes, JSON looks like Python dictionaries. I’d wondered if one could use JSON to pass around Python dictionaries. Turns out that it works quite well.


Get the top N words from passages of text (conference calls by companies). The idea is that words that aren’t in the top N by industry are more likely to be considered obscure in some sense. Splitting up raw text into sentences and then into words is a non-trivial task. So, I’d like to use the Python NLTK for the job.


I think this solution is a good illustration of using JSON with PL/Python. Note that I tried using marshal to pack and unpack bytea data into Python dictionaries, but it was no faster (perhaps slower) than using JSON. Also, I tried applying the word counts to each passage and a Python function to aggregate, but it was much, much slower than aggregating the text and passing to Python in one go. Note that I am using Counter objects here, but these subclass the standard Python dictionary.

Function to get word counts (a Python 3 version could omit the .decode('utf-8') step, I think):

CREATE OR REPLACE FUNCTION word_counts(raw text)
    import json, re
    from collections import Counter
    from nltk import sent_tokenize, word_tokenize

    # Use the Natural Language Toolkit to break text into
    # individual words
    text = raw.lower().decode('utf-8')
    token_lists = [word_tokenize(sent) for sent in sent_tokenize(text)]

    # We only want "words" with alphabetical characters.
    # Note that re.match() only looks at the first character
    # of the word.
    tokens = [word for sublist in token_lists for word in sublist
                 if re.match('[a-z]', word)]

    # Construct a counter of the words and return as JSON
    the_dict = Counter(tokens)
    return json.dumps(the_dict)
$CODE$ LANGUAGE plpythonu;

Function to get top N words:

CREATE OR REPLACE FUNCTION top_words(counter json, num integer)
    import json
    from collections import Counter
    return [word for word, count in Counter(json.loads(counter)).most_common(num)]
$$ LANGUAGE plpythonu;

Then some SQL to pull 100,000 passages and get the top 1,000 words by context (either “presentation” or “question and answer” portions of a conference call; see examples here).


raw AS (
    SELECT context, speaker_text
    FROM streetevents.speaker_data
    LIMIT 100000),

by_context AS (
    SELECT context, top_words(word_counts(string_agg(speaker_text, ' ')), 1000)
    FROM raw
    GROUP BY context)

SELECT context, unnest(top_words) AS word
FROM by_context
Posted in PostgreSQL, Python | Tagged , , , , | Leave a comment

Pulling data into Stata from PostgreSQL in Mac OS X

The following is perhaps of interest to Stata users.

This was pretty easy to install on my (OS X) laptop (a few tweaks to the Makefile, then the usual “make; sudo make install”). I suspect one could get it to work on Windows.

I found pgload here and made loadsql myself (a stretch for my limited Stata skills, but it was mostly copy-and-paste from Google). Here localhost is my laptop, while some.url is the computer in my attic.

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   13.1   Copyright 1985-2013 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
     MP - Parallel Edition            College Station, Texas 77845 USA
                                      979-696-4601 (fax)
20-user 8-core Stata network license expires 17 Nov 2014:

. pgload "dbname=crsp host=localhost" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10"
. list
     | permno        date      prc |
  1. |  10147   08aug1994   14.375 |
  2. |  10147   09aug1994    14.75 |
  3. |  10147   10aug1994     15.5 |
  4. |  10147   11aug1994    15.25 |
  5. |  10147   12aug1994     15.5 |
  6. |  10147   15aug1994    15.75 |
  7. |  10147   16aug1994   16.375 |
  8. |  10147   17aug1994   16.375 |
  9. |  10147   18aug1994   16.125 |
 10. |  10147   19aug1994   15.625 |
. clear
. pgload "dbname=crsp host=some.url" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10"
. list
     | permno        date      prc |
  1. |  10000   07jan1986   2.5625 |
  2. |  10000   08jan1986      2.5 |
  3. |  10000   09jan1986      2.5 |
  4. |  10000   10jan1986      2.5 |
  5. |  10000   13jan1986    2.625 |
  6. |  10000   14jan1986     2.75 |
  7. |  10000   15jan1986    2.875 |
  8. |  10000   16jan1986        3 |
  9. |  10000   17jan1986        3 |
 10. |  10000   20jan1986        3 |

. loadsql using "~/some_sql.sql", conn("dbname=crsp host=some.url") 
. list in 1/10
     |  gvkey    datadate       at      ni |
  1. | 001000   31dec1961        .       . |
  2. | 001000   31dec1962        .       . |
  3. | 001000   31dec1963        .    .003 |
  4. | 001000   31dec1964    1.416    .052 |
  5. | 001000   31dec1965     2.31   -.197 |
  6. | 001000   31dec1966     2.43    .164 |
  7. | 001000   31dec1967    2.456    -.09 |
  8. | 001000   31dec1968    5.922    .463 |
  9. | 001001   31dec1982     8.59    .965 |
 10. | 001000   31dec1969   28.712   1.766 |

Here’s the SQL in ~/some_sql.sql:

SELECT gvkey, datadate, at, ni
FROM comp.funda
LIMIT 1000;


Here’s how I did this. Note that I have PostgreSQL installed from MacPorts and I also have wget (sudo port install wget).

1. Get file from Stata and Oxford site

cd ~/Downloads
tar -zxvf pgload-0.1.tar.gz
mkdir ~/Downloads/stata
cd ~/Downloads/stata/

2. Edit Makefile

vim ~/Downloads/pgload-0.1/Makefile 

The last step opens up the Makefile. I make changes to five lines (the last two will differ if you installed PostgreSQL 9.3 from somewhere other than MacPorts).

INSTALL_LOCATION=~/Library/Application\ Support/Stata/ado/personal

I then save and exit. Then

3. Compile and install

cd ~/Downloads/pgload-0.1
sudo make install

4. Add loadsql.ado file
The loadsql.ado file is placed in ~/Library/Application\ Support/Stata/ado/personal and contains the following:

program define loadsql
*! Load the output of an SQL file into Stata, version 1.4 (
version 13.1
syntax using/, CONN(string)
tempname sqlfile exec line;
file open `sqlfile' using `"`using'"', read text;
file read `sqlfile' `line';
while r(eof)==0 {;
    local `exec' `"``exec'' ``line'' 
    file read `sqlfile' `line';
file close `sqlfile';
* display "`conn'";
pgload "`conn'" "``exec''", clear;
* pgload "``dsn''" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10", clear;

This is based on some code I found for MySQL (can’t remember where) and tweaked.

Posted in Mac, PostgreSQL, Stata | Leave a comment

Installing RPostgreSQL with R 3.1.0 Mavericks build

Update (2014-07-17)
Hmm. I have no idea why this seems to have worked in May, but does no longer. I think it worked on one computer, but not the others I have. (I’m now wondering if perhaps I wasn’t running R 3.1.0 Mavericks when I “successfully” installed the package.)

This is flagged as RPostgreSQL issue #61. In the meantime, the Snow Leopard package binary seems to work fine with R 3.1.0 Mavericks (I use it a lot and haven’t seen any issues after a couple of months).

Original post (2014-05-09)

Within R:

# Check that this is where pg_config is for you (I use MacPorts)
install.packages("RPostgreSQL", type="source")

Done! (This assumes you have the necessary development tools for compiling packages.)

As an alternative, one can get the current version from the development site. In Terminal:

cd ~
svn checkout rpostgresql-read-only

In R

install.packages("~/rpostgresql-read-only/RPostgreSQL/", type="source", repos=NULL)
Posted in Uncategorized | Leave a comment

Installing Python

I use MacPorts to install my basic python set-up as follows:

sudo port install python33 py33-zmq py33-pygments py33-ipython 
sudo port install py33-statsmodels py33-pyside py33-nltk3  
sudo port install py33-psycopg2 py33-patsy py33-pandas 
sudo port select --set ipython ipython33

Then it’s just a matter of typing:

ipython qtconsole
Posted in Uncategorized | Leave a comment