Calculating fog using PostgreSQL and Perl

A measure of the complexity of text that has become popular in accounting research is “fog”; this equals (number of words per sentence + percentage of words that are complex) * 0.4.

The following function run in PostgreSQL gives access to a database function fog().

CREATE OR REPLACE FUNCTION fog(text) RETURNS float8 AS $$ 

  # Load Perl modules that calculate fog, etc.
  use Lingua::EN::Fathom;
  use Lingua::EN::Sentence qw( get_sentences add_acronyms );

  # Calculate and return fog
  my $text = new Lingua::EN::Fathom;
  if (defined($_[0])) {
    $text->analyse_block($_[0]);
    return($text->fog);
  }

$$ LANGUAGE plperlu;

This function can then be used a part of regular SQL. For example, this SQL

SELECT file_name, role, speaker_number AS num, 
  context, fog(speaker_text)
FROM streetevents.speaker_data
LIMIT 10;

yields

-----------+---------------+-----+---------+------------------
 file_name |     role      | num | context |       fog        
-----------+---------------+-----+---------+------------------
 1475440_T | Chairman, CEO |  47 | qa      | 21.7658536585366
 1475440_T | Analyst       |   7 | qa      | 33.6727272727273
 1475440_T | Chairman, CEO |  36 | qa      |            21.55
 1475440_T |               |  54 | qa      |                4
 1475440_T | Analyst       |  31 | qa      |                8
 1475440_T | CFO           |  50 | qa      | 24.9524752475248
 1475440_T | Chairman, CEO |  23 | qa      | 8.04444444444444
 1475440_T | CFO           |  27 | qa      |              2.4
 1475440_T | Analyst       |  45 | qa      | 16.9714285714286
 1475440_T | CFO           |  30 | qa      | 31.8545454545455
(10 rows)

One small glitch in getting this to work is that it seems PostgreSQL is running an older version of Perl (5.12) on my database. I suspect this was the Perl I had installed when I compiled PostgreSQL. So I had to install the modules I needed by entering perl5.12 -MCPAN -e shell and then install Lingua::EN::Fathom and install Lingua::EN::Sentence. Another issue is that it seems that calling the Perl code above is far slower than loading text from the file system using Perl and calculating fog there.

An alternative approach gets several statistics in one run:

CREATE TYPE fog_stats AS (fog float8, num_words integer, percent_complex float8, num_sentences integer);

CREATE OR REPLACE FUNCTION fog_data(text) RETURNS fog_stats AS $$ 

  # Load Perl module that calculates fog, etc.
  use Lingua::EN::Fathom;
  use Lingua::EN::Sentence qw( get_sentences add_acronyms );

  my $text = new Lingua::EN::Fathom;
  if (defined($_[0])) {
    $text->analyse_block($_[0]);
    $fog   = $text->fog;
    $num_words = $text->num_words;
    $percent_complex = $text->percent_complex_words;  
    $num_sentences = $text->num_sentences;
  }

  return {fog => $fog, num_words => $num_words, percent_complex => $percent_complex, 
	num_sentences => $num_sentences};

$$ LANGUAGE plperlu;

This can then be used as follows (syntax is a little awkward):

WITH raw_data AS (
  SELECT *, fog_data(speaker_text) AS f
  FROM streetevents.speaker_data
  LIMIT 10)
SELECT file_name, role, speaker_number, context, 
  (f).fog, (f).num_words, (f).percent_complex, (f).num_sentences 
FROM raw_data
About these ads
This entry was posted in Perl, PostgreSQL and tagged , , . Bookmark the permalink.

One Response to Calculating fog using PostgreSQL and Perl

  1. Pingback: Calculating fog using PostgreSQL and Perl « DbRunas – Noticias y Recursos sobre Bases de Datos

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s