Prototyping PL/Python functions

I’ve seen a variable called numerical intensity crop up in a couple of papers recently (one example is Lundholm, Rogo, and Zhang (2014)).

I spent a little time coming up with a Python program to extract numbers from text.


the_text = """
Thanks, Steve. Good morning everyone. Yesterday we reported second quarter revenues of $617 million -- a 17% increase over the second quarter of 2004. We reported net earnings of $24.1 million or $0.78 per diluted share for the quarter compared with net earnings of $15.3 million or $0.50 per diluted share for the same period last year.
 Our reported net earnings for both periods included special charges.
 For the 2005 period, our reported earnings include a charge of $118,000 net of tax (ph) related to the early extinguishment of debt. Excluding this charge, earnings for the 2005 second quarter would be $0.80 per share which is above the Wall Street average estimate.
 For the first 6 months net sales increased 23% to $1.15 billion from $938 million in 2004. Net earnings were $30.5 million or $1.00 per diluted share compared with $21 million or $0.77 per diluted share.
 Excluding special items earnings per diluted share for the first half of 2005 were $1.19 up 53% from the first half of 2004. Our tax rate for the quarter was 21% down from 30% in the first quarter of 2005. The reduction in the effective tax rate for the quarter was due to clarity we received concerning the realization of certain tax benefits that were previously reserved against due to their uncertainty. We anticipate having an effective tax rate for the full fiscal year of 27%.
 Now I will review our segment financial results. Crane segment sales for the second quarter were $427 million, an increase of 29% compared to the same quarter last year. Operating earnings were $35.5 million, up 97% and operating margin improved to 8.3% from 5.4% a year ago. This improved (indiscernible) performances illustrates this significant operating leverage of our Crane business.
 Crane Boot (ph) backlog of $530 million is up sharply from the year ago quarter and we are seeing continued strong demand for our Tower Crane, mobile telescopic crane and crawler cranes in all markets except for crawler cranes in North America.
 Crane backlog has been increasing steadily since (indiscernible) in 2003 and has continued to be strong through the positive outlook in many of our end-user markets. Our Crane backlog is not dependent on any one geography or product line.
 Foodservice sales of $126.5 million declined 3.4% from the June 16, 2004 quarter as a result of cooler than normal weather that dampened demand for refrigeration and icemaking equipment in April and May. Lower sales volumes at Manitowoc Ice coupled with continued margin pressures at diversified refrigeration -- DRI -- our contract manufacturing unit combined to reduce Foodservice earnings and margins for the quarter.
 Tim Kraus will provide more detail on our Foodservice results later in the call.
 In our Marine segment, revenue for the quarter was essentially flat from 2004 at $63.4 million. Operating results were a loss of $2.7 million during the quarter compared with Operating earnings of $2.7 million in the 2004 quarter. As we indicated during our first quarter conference call we anticipated that Marine was both disappointing results for the second quarter and the loss is directly related to 3 specific projects.
 One of the projects has now been delivered to the customer, one is currently on route for delivery and the third project will leave our yard next week.
 Among the issues creating the extreme inefficiencies on these projects were considerable revision and were previously inspected and certified by a third party. Overruns in delivery costs and additional hours devoted to bringing the vessel into weight-compliance requirements. These projects are now complete and we anticipate that the Marine segment will return to modest profitability in the second half of 2005.
 Cash from operations in the quarter was positive $13.1 million compared with the use of cash of $1.5 million in the same quarter last year.
 We remain confident that we will achieve our objective of generating net debt reduction of $50 million for the full year. Manitowoc's normal seasonal pattern is to use cash in the first half of the year followed by strong cash generation in the second half. We expect this pattern to continue to hold true this year.
 Manitowoc improved its EVA performance in the first half of 2005 by more than fivefold over the same period last year. We expect this positive trend to continue for the full year, which continues the year-over-year improvement we've achieved since 2003. This improvement in EVA performance validates the strategy and execution of our acquisition.
 My final comments are in earnings guidance which we are tightening to $2.15 to $2.30 per share. This is roughly a 50% increase over last year despite worse than expected year-to-date results in our Foodservice and Marine segments.
 Though market drivers in the North American Lattice-boom Crawler Crane industry continued to improve this is one market sector -- truly the only one in Crane -- that has not experienced an appreciable upturn from (indiscernible) levels. A meaningful recovery in the North American crawler crane revenues is not anticipated until 2006.
 Clearly, the Crane business is driving our improved results. And our visibility is for continued strong market condition in cranes for the foreseeable future. We are seeing good indicators in most of the construction and infrastructure market and we believe we are still several quarters away from achieving historic peak cycle earnings.
 With that I will turn the call over to Tim Kraus to talk about our Foodservice business. Tim."""

def numbers(text):

	import re
	
	# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
	months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
					'August', 'September', 'October', 
					'November', 'December']		
	months_abbrev = [ month[0:3] + '\.?' for month in months]
	month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

	# Years are four-digit numbers without commas.
	# I have a better rule using the fact that years will begin with 19 or 20, etc.
	years = '\s*\d{4}\.?'

	# Keep matching numbers, as well as month portion, if any
	matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

	# return matches that aren't years and aren't preceded by months
	return [ a_match for a_match in matches
				if not re.match(years, a_match) 
					and not re.match(month_re, a_match)]
	
print(numbers(the_text))

Once I had the code above working (you could copy-and-paste it into any program that can execute Python code), I simply pasted it into a simple SQL wrapper and changed the return value to be the number of numbers (rather than the numbers themselves):

CREATE OR REPLACE FUNCTION num_count(text text) 
RETURNS integer AS
$BODY$
import re

# Regular expression to pick up months as described in Lundholm, Rogo and Zhang (2014)
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
                'August', 'September', 'October', 
                'November', 'December']		
months_abbrev = [ month[0:3] + '\.?' for month in months]
month_re = '(?:' + '|'.join(months + months_abbrev) + ')'

# Years are four-digit numbers without commas.
# I have a better rule using the fact that years will begin with 19 or 20, etc.
years = '\s*\d{4}\.?'

# Keep matching numbers, as well as month portion, if any
matches =  re.findall('(?:' + month_re + '\s+)?[0-9][0-9,\.]*', text)

# return matches that aren't years and aren't preceded by months
return len([ a_match for a_match in matches
            if not re.match(years, a_match) 
                and not re.match(month_re, a_match)])
$BODY$ LANGUAGE plpythonu;

Now I can call this function from SQL:

WITH raw_data AS (
    SELECT word_count(speaker_text), num_count(speaker_text)
    FROM streetevents.speaker_data
    LIMIT 10)
SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
FROM raw_data;

Here is the output:

=# WITH raw_data AS (
(#     SELECT word_count(speaker_text), num_count(speaker_text)
(#     FROM streetevents.speaker_data
(#     LIMIT 10)
-# SELECT *, CASE WHEN word_count > 0 THEN num_count::float8/word_count END AS prop_nums
-# FROM raw_data;

 word_count | num_count |      prop_nums      
------------+-----------+---------------------
        804 |         8 | 0.00995024875621891
        753 |        12 |  0.0159362549800797
       1010 |        40 |  0.0396039603960396
        373 |         5 |  0.0134048257372654
         67 |         0 |                   0
         32 |         3 |             0.09375
         13 |         0 |                   0
          4 |         0 |                   0
        126 |         0 |                   0
         12 |         0 |                   0
(10 rows)

Advertisements
This entry was posted in PostgreSQL, Python and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s