Calculating fog using PostgreSQL and Perl

A measure of the complexity of text that has become popular in accounting research is “fog”; this equals (number of words per sentence + percentage of words that are complex) * 0.4.

The following function run in PostgreSQL gives access to a database function fog().

CREATE OR REPLACE FUNCTION fog(text) RETURNS float8 AS $$ 

  # Load Perl modules that calculate fog, etc.
  use Lingua::EN::Fathom;
  use Lingua::EN::Sentence qw( get_sentences add_acronyms );

  # Calculate and return fog
  my $text = new Lingua::EN::Fathom;
  if (defined($_[0])) {
    $text->analyse_block($_[0]);
    return($text->fog);
  }

$$ LANGUAGE plperlu;

This function can then be used a part of regular SQL. For example, this SQL

SELECT file_name, role, speaker_number AS num, 
  context, fog(speaker_text)
FROM streetevents.speaker_data
LIMIT 10;

yields

-----------+---------------+-----+---------+------------------
 file_name |     role      | num | context |       fog        
-----------+---------------+-----+---------+------------------
 1475440_T | Chairman, CEO |  47 | qa      | 21.7658536585366
 1475440_T | Analyst       |   7 | qa      | 33.6727272727273
 1475440_T | Chairman, CEO |  36 | qa      |            21.55
 1475440_T |               |  54 | qa      |                4
 1475440_T | Analyst       |  31 | qa      |                8
 1475440_T | CFO           |  50 | qa      | 24.9524752475248
 1475440_T | Chairman, CEO |  23 | qa      | 8.04444444444444
 1475440_T | CFO           |  27 | qa      |              2.4
 1475440_T | Analyst       |  45 | qa      | 16.9714285714286
 1475440_T | CFO           |  30 | qa      | 31.8545454545455
(10 rows)

One small glitch in getting this to work is that it seems PostgreSQL is running an older version of Perl (5.12) on my database. I suspect this was the Perl I had installed when I compiled PostgreSQL. So I had to install the modules I needed by entering perl5.12 -MCPAN -e shell and then install Lingua::EN::Fathom and install Lingua::EN::Sentence. Another issue is that it seems that calling the Perl code above is far slower than loading text from the file system using Perl and calculating fog there.

Update (2014-07-18): More recently, I have obtained the modules using the following:

sudo cpan App::cpanminus
sudo cpanm Lingua::EN::Sentence
sudo cpanm Lingua::EN::Fathom

An alternative approach gets several statistics in one run:

CREATE TYPE fog_stats AS (fog float8, num_words integer, percent_complex float8, num_sentences integer);

CREATE OR REPLACE FUNCTION fog_data(text) RETURNS fog_stats AS $$ 

  # Load Perl module that calculates fog, etc.
  use Lingua::EN::Fathom;
  use Lingua::EN::Sentence qw( get_sentences add_acronyms );

  my $text = new Lingua::EN::Fathom;
  if (defined($_[0])) {
    $text->analyse_block($_[0]);
    $fog   = $text->fog;
    $num_words = $text->num_words;
    $percent_complex = $text->percent_complex_words;  
    $num_sentences = $text->num_sentences;
  }

  return {fog => $fog, num_words => $num_words, percent_complex => $percent_complex, 
	num_sentences => $num_sentences};

$$ LANGUAGE plperlu;

This can then be used as follows (syntax is a little awkward):

WITH raw_data AS (
  SELECT *, fog_data(speaker_text) AS f
  FROM streetevents.speaker_data
  LIMIT 10)
SELECT file_name, role, speaker_number, context, 
  (f).fog, (f).num_words, (f).percent_complex, (f).num_sentences 
FROM raw_data
About these ads
This entry was posted in Perl, PostgreSQL and tagged , , . Bookmark the permalink.

One Response to Calculating fog using PostgreSQL and Perl

  1. Pingback: Calculating fog using PostgreSQL and Perl « DbRunas – Noticias y Recursos sobre Bases de Datos

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s