To my (very untrained) eyes, JSON looks like Python dictionaries. I’d wondered if one could use JSON to pass around Python dictionaries. Turns out that it works quite well.
Get the top N words from passages of text (conference calls by companies). The idea is that words that aren’t in the top N by industry are more likely to be considered obscure in some sense. Splitting up raw text into sentences and then into words is a non-trivial task. So, I’d like to use the Python NLTK for the job.
I think this solution is a good illustration of using JSON with PL/Python. Note that I tried using
marshal to pack and unpack
bytea data into Python dictionaries, but it was no faster (perhaps slower) than using JSON. Also, I tried applying the word counts to each passage and a Python function to aggregate, but it was much, much slower than aggregating the text and passing to Python in one go. Note that I am using
Counter objects here, but these subclass the standard Python dictionary.
Function to get word counts (a Python 3 version could omit the
.decode('utf-8') step, I think):
CREATE OR REPLACE FUNCTION word_counts(raw text) RETURNS json AS $CODE$ import json, re from collections import Counter from nltk import sent_tokenize, word_tokenize # Use the Natural Language Toolkit to break text into # individual words text = raw.lower().decode('utf-8') token_lists = [word_tokenize(sent) for sent in sent_tokenize(text)] # We only want "words" with alphabetical characters. # Note that re.match() only looks at the first character # of the word. tokens = [word for sublist in token_lists for word in sublist if re.match('[a-z]', word)] # Construct a counter of the words and return as JSON the_dict = Counter(tokens) return json.dumps(the_dict) $CODE$ LANGUAGE plpythonu;
Function to get top N words:
CREATE OR REPLACE FUNCTION top_words(counter json, num integer) RETURNS text AS $$ import json from collections import Counter return [word for word, count in Counter(json.loads(counter)).most_common(num)] $$ LANGUAGE plpythonu;
Then some SQL to pull 100,000 passages and get the top 1,000 words by context (either “presentation” or “question and answer” portions of a conference call; see examples here).
WITH raw AS ( SELECT context, speaker_text FROM streetevents.speaker_data LIMIT 100000), by_context AS ( SELECT context, top_words(word_counts(string_agg(speaker_text, ' ')), 1000) FROM raw GROUP BY context) SELECT context, unnest(top_words) AS word FROM by_context