Get “tone” from corporate disclosures: PostgreSQL, Python, and R.

The paper by Loughran and McDonald (2011) develops lists of words that are intended to reflect “tone” in financial text. The following code retrieves these lists and applies them to a large body of text (over a billion words) from 250,000+ conference calls.

This provides a nice illustration of R, PostgreSQL and Python working together. I also illustrates the use of JSON data as a way to store Python dictionaries in PostgreSQL.

1. Get tone dictionary

The following code gets the dictionaries from Bill McDonald’s website and puts them into a PostgreSQL table. I store the lists of words for each category (i.e., positive, etc.) as an array, as arrays are compact and easy to work with in Python (they become lists, see the PL/Python function below).

category <- c("positive", "negative", "uncertainty", 
                "litigious", "modal_strong", "modal_weak")
url <- c("http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_Positive.csv",
          "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_Negative.csv",
          "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_Uncertainty.csv",
          "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_Litigious.csv",
          "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_ModalStrong.csv",
          "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists/LoughranMcDonald_ModalWeak.csv")

df <- data.frame(category, url, stringsAsFactors=FALSE)

getWords <- function(url) {
    words <- read.csv(url, as.is=TRUE)    
    paste(words[,1], collapse=",")
}
df$words <- unlist(lapply(df$url, getWords))
library(RPostgreSQL)

pg <- dbConnect(PostgreSQL())
rs <- dbWriteTable(pg, "lm_tone", df, row.names=FALSE, overwrite=TRUE)

rs <- dbGetQuery(pg, "ALTER TABLE lm_tone ADD COLUMN word_list text[];")
rs <- dbGetQuery(pg, "UPDATE lm_tone SET word_list = regexp_split_to_array(words, ',')")
rs <- dbGetQuery(pg, "ALTER TABLE lm_tone DROP COLUMN words;")
rs <- dbDisconnect(pg)

2. Create a function to extract tone counts

The following PL/Python function began life as standalone Python code to count the number of words from coming from lists of words appearing in a passage of text. It was easy to adapt this code to put inside a PostgreSQL function, change out the list of words (from LIWC, as used on another project, to the Loughran and McDonald (2011) tone words) and the run it on the 31 million+ passages of text (running on twelve cores, this took around 5 hours).

Most of the Python code is setting up the regular expressions (storing them in SD, a Python dictionary in PostgreSQL) After the first call to the function in a session, only the last three lines of code actually do anything.

CREATE OR REPLACE FUNCTION tone_count(the_text text)  
RETURNS json AS
$CODE$
    if 're' in SD:
        re = SD['re']
        json = SD['json']
    else:
        import re, json
        SD['re'] = re
        SD['json'] = json

    if SD.has_key("regex_list"):
        regex_list = SD["regex_list"]
        categories = SD["categories"]
    else:
        rv = plpy.execute("SELECT category FROM lm_tone")

        categories = [ (r["category"]) for r in rv]

        # Prepare the regular expressions.
        plan = plpy.prepare("""
            SELECT word_list
            FROM lm_tone 
            WHERE category = $1""", ["text"])

        mod_word_list = {}
        for cat in categories:
            rows = list(plpy.cursor(plan, [cat]))
            word_list = rows[0]['word_list']
            mod_word_list[cat] = [word.lower() for word in word_list]

        # Pre-compile regular expressions.
        regex_list = {}
        for key in mod_word_list.keys():
            regex = '\\b(?:' + '|'.join(mod_word_list[key]) + ')\\b'
            regex_list[key] = re.compile(regex)
        SD["regex_list"] = regex_list
        SD["categories"] = categories

    # rest of function
    """Function to return number of matches against a LIWC category in a text"""
    text = re.sub(u'\u2019', "'", the_text).lower()
    the_dict = {category: len(re.findall(regex_list[category], text)) for category in categories}
    return json.dumps(the_dict)

$CODE$ LANGUAGE plpythonu;

3. Create an R function to get tone data for a file

So I have a Python function that I can call within PostgreSQL. But I’m more facile with R, especially when it comes to parallel processing. Additionally, the RJSONIO package makes it easy to convert each key of the dictionary Python creates into a column of a data frame.

# Need to run word_count.sql first.
addToneData <- function(file_name) {
    # Function to get word count data for all utterances in a call

    library(RPostgreSQL)
    pg <- dbConnect(PostgreSQL())

    # Get tone data. Data is JSON converted to text.
    tone_raw <- dbGetQuery(pg, paste0("
        SELECT file_name, speaker_name, tone_count(string_agg(speaker_text, ' '))
        FROM streetevents.speaker_data
        WHERE file_name ='", file_name, "'
        GROUP BY file_name, speaker_name"))

    # Convert JSON-as-text to records where each key becomes a column
    require(RJSONIO)
    tone_data <- as.data.frame(do.call(rbind,
                                       lapply(tone_raw$tone_count, fromJSON)))

    # Convert JSON data from numeric to integers
    for (i in names(tone_data)) {
        tone_data[,i] <- as.integer(tone_data[,i])
    }

    # Combine converted JSON data with other fields
    tone_data <- cbind(subset(tone_raw, TRUE, select=-tone_count), tone_data)

    # Put data back in database.
    dbWriteTable(pg, "word_counts", tone_data, append=TRUE, row.names=FALSE)

    rs <- dbDisconnect(pg)
}

4. Apply R function.

library(RPostgreSQL)
pg <- dbConnect(PostgreSQL())

dbGetQuery(pg, "DROP TABLE IF EXISTS word_counts")

# Get a list of file names for which we need to get tone data.
file_names <-  dbGetQuery(pg, "
    SELECT DISTINCT file_name
    FROM streetevents.calls
    WHERE call_type=1")

# Apply function to get tone data. Run on 12 cores.
library(parallel)
system.time(temp <- mclapply(file_names$file_name, addToneData, mc.cores=12))
rs <- dbGetQuery(pg, "
    SET maintenance_work_mem='1GB';
    CREATE INDEX ON word_counts (file_name)")
rs <- dbDisconnect(pg)
Advertisements
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

4 Responses to Get “tone” from corporate disclosures: PostgreSQL, Python, and R.

  1. Kleopatra says:

    Hello Ian, great work ! Nevertheless I would like to ask you, since I am not familiar at all with R but I am using instead Stata, is it possible to convert R code into Stata so as to apply the last part of your coding? Probably that’s a silly question but believe me I am quite new to all of this.
    Thank you in advance.

  2. Pingback: Get “tone” from financial texts | Kai Chen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s