Get “tone” from corporate disclosures: PostgreSQL, Python, and R.

The paper by Loughran and McDonald (2011) develops lists of words that are intended to reflect “tone” in financial text. The following code retrieves these lists and applies them to a large body of text (over a billion words) from 250,000+ conference calls.

This provides a nice illustration of R, PostgreSQL and Python working together. I also illustrates the use of JSON data as a way to store Python dictionaries in PostgreSQL.

1. Get tone dictionary

The following code gets the dictionaries from Bill McDonald’s website and puts them into a PostgreSQL table. I store the lists of words for each category (i.e., positive, etc.) as an array, as arrays are compact and easy to work with in Python (they become lists, see the PL/Python function below).

category <- c("positive", "negative", "uncertainty", 
                "litigious", "modal_strong", "modal_weak")
url <- c("",

df <- data.frame(category, url, stringsAsFactors=FALSE)

getWords <- function(url) {
    words <- read.csv(url,    
    paste(words[,1], collapse=",")
df$words <- unlist(lapply(df$url, getWords))

pg <- dbConnect(PostgreSQL())
rs <- dbWriteTable(pg, "lm_tone", df, row.names=FALSE, overwrite=TRUE)

rs <- dbGetQuery(pg, "ALTER TABLE lm_tone ADD COLUMN word_list text[];")
rs <- dbGetQuery(pg, "UPDATE lm_tone SET word_list = regexp_split_to_array(words, ',')")
rs <- dbGetQuery(pg, "ALTER TABLE lm_tone DROP COLUMN words;")
rs <- dbDisconnect(pg)

2. Create a function to extract tone counts

The following PL/Python function began life as standalone Python code to count the number of words from coming from lists of words appearing in a passage of text. It was easy to adapt this code to put inside a PostgreSQL function, change out the list of words (from LIWC, as used on another project, to the Loughran and McDonald (2011) tone words) and the run it on the 31 million+ passages of text (running on twelve cores, this took around 5 hours).

Most of the Python code is setting up the regular expressions (storing them in SD, a Python dictionary in PostgreSQL) After the first call to the function in a session, only the last three lines of code actually do anything.

CREATE OR REPLACE FUNCTION tone_count(the_text text)  
    if 're' in SD:
        re = SD['re']
        json = SD['json']
        import re, json
        SD['re'] = re
        SD['json'] = json

    if SD.has_key("regex_list"):
        regex_list = SD["regex_list"]
        categories = SD["categories"]
        rv = plpy.execute("SELECT category FROM lm_tone")

        categories = [ (r["category"]) for r in rv]

        # Prepare the regular expressions.
        plan = plpy.prepare("""
            SELECT word_list
            FROM lm_tone 
            WHERE category = $1""", ["text"])

        mod_word_list = {}
        for cat in categories:
            rows = list(plpy.cursor(plan, [cat]))
            word_list = rows[0]['word_list']
            mod_word_list[cat] = [word.lower() for word in word_list]

        # Pre-compile regular expressions.
        regex_list = {}
        for key in mod_word_list.keys():
            regex = '\\b(?:' + '|'.join(mod_word_list[key]) + ')\\b'
            regex_list[key] = re.compile(regex)
        SD["regex_list"] = regex_list
        SD["categories"] = categories

    # rest of function
    """Function to return number of matches against a LIWC category in a text"""
    text = re.sub(u'\u2019', "'", the_text).lower()
    the_dict = {category: len(re.findall(regex_list[category], text)) for category in categories}
    return json.dumps(the_dict)

$CODE$ LANGUAGE plpythonu;

3. Create an R function to get tone data for a file

So I have a Python function that I can call within PostgreSQL. But I’m more facile with R, especially when it comes to parallel processing. Additionally, the RJSONIO package makes it easy to convert each key of the dictionary Python creates into a column of a data frame.

# Need to run word_count.sql first.
addToneData <- function(file_name) {
    # Function to get word count data for all utterances in a call

    pg <- dbConnect(PostgreSQL())

    # Get tone data. Data is JSON converted to text.
    tone_raw <- dbGetQuery(pg, paste0("
        SELECT file_name, speaker_name, tone_count(string_agg(speaker_text, ' '))
        FROM streetevents.speaker_data
        WHERE file_name ='", file_name, "'
        GROUP BY file_name, speaker_name"))

    # Convert JSON-as-text to records where each key becomes a column
    tone_data <-,
                                       lapply(tone_raw$tone_count, fromJSON)))

    # Convert JSON data from numeric to integers
    for (i in names(tone_data)) {
        tone_data[,i] <- as.integer(tone_data[,i])

    # Combine converted JSON data with other fields
    tone_data <- cbind(subset(tone_raw, TRUE, select=-tone_count), tone_data)

    # Put data back in database.
    dbWriteTable(pg, "word_counts", tone_data, append=TRUE, row.names=FALSE)

    rs <- dbDisconnect(pg)

4. Apply R function.

pg <- dbConnect(PostgreSQL())

dbGetQuery(pg, "DROP TABLE IF EXISTS word_counts")

# Get a list of file names for which we need to get tone data.
file_names <-  dbGetQuery(pg, "
    SELECT DISTINCT file_name
    FROM streetevents.calls
    WHERE call_type=1")

# Apply function to get tone data. Run on 12 cores.
system.time(temp <- mclapply(file_names$file_name, addToneData, mc.cores=12))
rs <- dbGetQuery(pg, "
    SET maintenance_work_mem='1GB';
    CREATE INDEX ON word_counts (file_name)")
rs <- dbDisconnect(pg)
Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Using JSON to pass dictionaries from Python to PostgreSQL and back

To my (very untrained) eyes, JSON looks like Python dictionaries. I’d wondered if one could use JSON to pass around Python dictionaries. Turns out that it works quite well.


Get the top N words from passages of text (conference calls by companies). The idea is that words that aren’t in the top N by industry are more likely to be considered obscure in some sense. Splitting up raw text into sentences and then into words is a non-trivial task. So, I’d like to use the Python NLTK for the job.


I think this solution is a good illustration of using JSON with PL/Python. Note that I tried using marshal to pack and unpack bytea data into Python dictionaries, but it was no faster (perhaps slower) than using JSON. Also, I tried applying the word counts to each passage and a Python function to aggregate, but it was much, much slower than aggregating the text and passing to Python in one go. Note that I am using Counter objects here, but these subclass the standard Python dictionary.

Function to get word counts (a Python 3 version could omit the .decode('utf-8') step, I think):

CREATE OR REPLACE FUNCTION word_counts(raw text)
    import json, re
    from collections import Counter
    from nltk import sent_tokenize, word_tokenize

    # Use the Natural Language Toolkit to break text into
    # individual words
    text = raw.lower().decode('utf-8')
    token_lists = [word_tokenize(sent) for sent in sent_tokenize(text)]

    # We only want "words" with alphabetical characters.
    # Note that re.match() only looks at the first character
    # of the word.
    tokens = [word for sublist in token_lists for word in sublist
                 if re.match('[a-z]', word)]

    # Construct a counter of the words and return as JSON
    the_dict = Counter(tokens)
    return json.dumps(the_dict)
$CODE$ LANGUAGE plpythonu;

Function to get top N words:

CREATE OR REPLACE FUNCTION top_words(counter json, num integer)
    import json
    from collections import Counter
    return [word for word, count in Counter(json.loads(counter)).most_common(num)]
$$ LANGUAGE plpythonu;

Then some SQL to pull 100,000 passages and get the top 1,000 words by context (either “presentation” or “question and answer” portions of a conference call; see examples here).


raw AS (
    SELECT context, speaker_text
    FROM streetevents.speaker_data
    LIMIT 100000),

by_context AS (
    SELECT context, top_words(word_counts(string_agg(speaker_text, ' ')), 1000)
    FROM raw
    GROUP BY context)

SELECT context, unnest(top_words) AS word
FROM by_context
Posted in PostgreSQL, Python | Tagged , , , , | Leave a comment

Pulling data into Stata from PostgreSQL in Mac OS X

The following is perhaps of interest to Stata users.

This was pretty easy to install on my (OS X) laptop (a few tweaks to the Makefile, then the usual “make; sudo make install”). I suspect one could get it to work on Windows.

I found pgload here and made loadsql myself (a stretch for my limited Stata skills, but it was mostly copy-and-paste from Google). Here localhost is my laptop, while some.url is the computer in my attic.

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   13.1   Copyright 1985-2013 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
     MP - Parallel Edition            College Station, Texas 77845 USA
                                      979-696-4601 (fax)
20-user 8-core Stata network license expires 17 Nov 2014:

. pgload "dbname=crsp host=localhost" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10"
. list
     | permno        date      prc |
  1. |  10147   08aug1994   14.375 |
  2. |  10147   09aug1994    14.75 |
  3. |  10147   10aug1994     15.5 |
  4. |  10147   11aug1994    15.25 |
  5. |  10147   12aug1994     15.5 |
  6. |  10147   15aug1994    15.75 |
  7. |  10147   16aug1994   16.375 |
  8. |  10147   17aug1994   16.375 |
  9. |  10147   18aug1994   16.125 |
 10. |  10147   19aug1994   15.625 |
. clear
. pgload "dbname=crsp host=some.url" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10"
. list
     | permno        date      prc |
  1. |  10000   07jan1986   2.5625 |
  2. |  10000   08jan1986      2.5 |
  3. |  10000   09jan1986      2.5 |
  4. |  10000   10jan1986      2.5 |
  5. |  10000   13jan1986    2.625 |
  6. |  10000   14jan1986     2.75 |
  7. |  10000   15jan1986    2.875 |
  8. |  10000   16jan1986        3 |
  9. |  10000   17jan1986        3 |
 10. |  10000   20jan1986        3 |

. loadsql using "~/some_sql.sql", conn("dbname=crsp host=some.url") 
. list in 1/10
     |  gvkey    datadate       at      ni |
  1. | 001000   31dec1961        .       . |
  2. | 001000   31dec1962        .       . |
  3. | 001000   31dec1963        .    .003 |
  4. | 001000   31dec1964    1.416    .052 |
  5. | 001000   31dec1965     2.31   -.197 |
  6. | 001000   31dec1966     2.43    .164 |
  7. | 001000   31dec1967    2.456    -.09 |
  8. | 001000   31dec1968    5.922    .463 |
  9. | 001001   31dec1982     8.59    .965 |
 10. | 001000   31dec1969   28.712   1.766 |

Here’s the SQL in ~/some_sql.sql:

SELECT gvkey, datadate, at, ni
FROM comp.funda
LIMIT 1000;


Here’s how I did this. Note that I have PostgreSQL installed from MacPorts and I also have wget (sudo port install wget).

1. Get file from Stata and Oxford site

cd ~/Downloads
tar -zxvf pgload-0.1.tar.gz
mkdir ~/Downloads/stata
cd ~/Downloads/stata/

2. Edit Makefile

vim ~/Downloads/pgload-0.1/Makefile 

The last step opens up the Makefile. I make changes to five lines (the last two will differ if you installed PostgreSQL 9.3 from somewhere other than MacPorts).

INSTALL_LOCATION=~/Library/Application\ Support/Stata/ado/personal

I then save and exit. Then

3. Compile and install

cd ~/Downloads/pgload-0.1
sudo make install

4. Add loadsql.ado file
The loadsql.ado file is placed in ~/Library/Application\ Support/Stata/ado/personal and contains the following:

program define loadsql
*! Load the output of an SQL file into Stata, version 1.4 (
version 13.1
syntax using/, CONN(string)
tempname sqlfile exec line;
file open `sqlfile' using `"`using'"', read text;
file read `sqlfile' `line';
while r(eof)==0 {;
    local `exec' `"``exec'' ``line'' 
    file read `sqlfile' `line';
file close `sqlfile';
* display "`conn'";
pgload "`conn'" "``exec''", clear;
* pgload "``dsn''" "SELECT permno, date, abs(prc) AS prc FROM crsp.dsf LIMIT 10", clear;

This is based on some code I found for MySQL (can’t remember where) and tweaked.

Posted in Mac, PostgreSQL, Stata | Leave a comment

Installing RPostgreSQL with R 3.1.0 Mavericks build

Update (2014-07-17)
Hmm. I have no idea why this seems to have worked in May, but does no longer. I think it worked on one computer, but not the others I have. (I’m now wondering if perhaps I wasn’t running R 3.1.0 Mavericks when I “successfully” installed the package.)

This is flagged as RPostgreSQL issue #61. In the meantime, the Snow Leopard package binary seems to work fine with R 3.1.0 Mavericks (I use it a lot and haven’t seen any issues after a couple of months).

Original post (2014-05-09)

Within R:

# Check that this is where pg_config is for you (I use MacPorts)
install.packages("RPostgreSQL", type="source")

Done! (This assumes you have the necessary development tools for compiling packages.)

As an alternative, one can get the current version from the development site. In Terminal:

cd ~
svn checkout rpostgresql-read-only

In R

install.packages("~/rpostgresql-read-only/RPostgreSQL/", type="source", repos=NULL)
Posted in Uncategorized | Leave a comment

Installing Python

I use MacPorts to install my basic python set-up as follows:

sudo port install python33 py33-zmq py33-pygments py33-ipython 
sudo port install py33-statsmodels py33-pyside py33-nltk3  
sudo port install py33-psycopg2 py33-patsy py33-pandas 
sudo port select --set ipython ipython33

Then it’s just a matter of typing:

ipython qtconsole
Posted in Uncategorized | Leave a comment

Using knitr with TeXShop

Using guidance provided here, I created a file with the following contents

Rscript -e "library(knitr); knit('$1')"
pdflatex "${1%.*}"

and saved it as ~/Library/TeXShop/Engines/Knitr.engine.

Note that both Rscript and pdflatex are on my path, so I did not include anything for the PATH variable.

I then (following a tip from here) changed permissions on the file (otherwise I got some error message about “bit set”) as follows:

chmod +x ~/Library/TeXShop/Engines/Knitr.engine
Posted in Uncategorized | Leave a comment

Install Shiny Server on OS X 10.8.4

The instructions here work pretty much without modification.

The only tweak I made was to how I installed the shiny package. From Terminal, I ran R as root (sudo R)
and then installed shiny using something like install.packages('shiny', repos='', type='source', lib='/Library/Frameworks/R.framework/Versions/3.0/Resources/library').

Specifying the type as source ensured that I got the latest version of the package (the binary was a little older) and specifying the library location
ensured that it went into a system-wide location (for some reason even with sudo R, R wanted to install in my user directory).

Oddly enough, my first application was a simple lookup for my database. But it seems to work nicely.

Posted in Mac, R | 3 Comments