Using JSON to pass dictionaries from Python to PostgreSQL and back

To my (very untrained) eyes, JSON looks like Python dictionaries. I’d wondered if one could use JSON to pass around Python dictionaries. Turns out that it works quite well.

Task

Get the top N words from passages of text (conference calls by companies). The idea is that words that aren’t in the top N by industry are more likely to be considered obscure in some sense. Splitting up raw text into sentences and then into words is a non-trivial task. So, I’d like to use the Python NLTK for the job.

Solution

I think this solution is a good illustration of using JSON with PL/Python. Note that I tried using marshal to pack and unpack bytea data into Python dictionaries, but it was no faster (perhaps slower) than using JSON. Also, I tried applying the word counts to each passage and a Python function to aggregate, but it was much, much slower than aggregating the text and passing to Python in one go. Note that I am using Counter objects here, but these subclass the standard Python dictionary.

Function to get word counts (a Python 3 version could omit the .decode('utf-8') step, I think):

CREATE OR REPLACE FUNCTION word_counts(raw text)
RETURNS json AS
$CODE$
    import json, re
    from collections import Counter
    from nltk import sent_tokenize, word_tokenize

    # Use the Natural Language Toolkit to break text into
    # individual words
    text = raw.lower().decode('utf-8')
    token_lists = [word_tokenize(sent) for sent in sent_tokenize(text)]

    # We only want "words" with alphabetical characters.
    # Note that re.match() only looks at the first character
    # of the word.
    tokens = [word for sublist in token_lists for word in sublist
                 if re.match('[a-z]', word)]

    # Construct a counter of the words and return as JSON
    the_dict = Counter(tokens)
    return json.dumps(the_dict)
$CODE$ LANGUAGE plpythonu;

Function to get top N words:

CREATE OR REPLACE FUNCTION top_words(counter json, num integer)
RETURNS text[] AS
$$
    import json
    from collections import Counter
    return [word for word, count in Counter(json.loads(counter)).most_common(num)]
$$ LANGUAGE plpythonu;

Then some SQL to pull 100,000 passages and get the top 1,000 words by context (either “presentation” or “question and answer” portions of a conference call; see examples here).

WITH

raw AS (
    SELECT context, speaker_text
    FROM streetevents.speaker_data
    LIMIT 100000),

by_context AS (
    SELECT context, top_words(word_counts(string_agg(speaker_text, ' ')), 1000)
    FROM raw
    GROUP BY context)

SELECT context, unnest(top_words) AS word
FROM by_context
Advertisements
This entry was posted in PostgreSQL, Python and tagged , , , , . Bookmark the permalink.

One Response to Using JSON to pass dictionaries from Python to PostgreSQL and back

  1. having a problem connecting NLTK to Postgres.app on OSX… any advice?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s