<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>iangow</title>
	<atom:link href="https://iangow.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://iangow.wordpress.com</link>
	<description>Just another WordPress.com site</description>
	<lastBuildDate>Wed, 22 Feb 2012 16:03:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='iangow.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>https://s-ssl.wordpress.com/i/buttonw-com.png</url>
		<title>iangow</title>
		<link>https://iangow.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="https://iangow.wordpress.com/osd.xml" title="iangow" />
	<atom:link rel='hub' href='https://iangow.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Extracting files from SEC &#8220;complete submission&#8221; text filings</title>
		<link>https://iangow.wordpress.com/2012/02/16/extracting-files-from-sec-complete-submission-text-filings/</link>
		<comments>https://iangow.wordpress.com/2012/02/16/extracting-files-from-sec-complete-submission-text-filings/#comments</comments>
		<pubDate>Thu, 16 Feb 2012 16:40:35 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[edgar data]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=351</guid>
		<description><![CDATA[The index files that can be downloaded from the SEC website (see here for more information) provide the location of &#8220;complete submission&#8221; text filings. Using the SEC header (SGML) files, I then identify filings of interest (e.g., 8-K filings with &#8230; <a href="https://iangow.wordpress.com/2012/02/16/extracting-files-from-sec-complete-submission-text-filings/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=351&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The index files that can be downloaded from the SEC website (see <a href="http://iangow.wordpress.com/2011/08/26/getting-sec-filing-index-files/" title="Getting SEC filing index files">here</a> for more information) provide the location of &#8220;complete submission&#8221; text filings. </p>
<p>Using the SEC header (<a href="http://iangow.wordpress.com/2011/08/29/getting-sec-filing-header-files/" title="Getting SEC filing header files">SGML</a>) files, I then identify filings of interest (e.g., 8-K filings with Item 2.02 information within 7 days of a quarterly reporting date per Compustat). What I&#8217;ve  then done is to download these complete submissions to a folder on my computer (/edgar/data/) and then extract the files (e.g., PDFs, JPEGs) embedded in these complete submissions into a folder that is indexed using <a href="http://recoll.org">Recoll</a> (based on <a href="http://www.xapian.org/">Xapian</a>) for fast full-text searches of the underlying files (this is on Linux, other solutions likely exist for Mac OS X or Windows). Here is some code that I used to extract the component files. This mostly works (the Excel files don&#8217;t seem to work either using this or the SEC links directly), but I&#8217;m sure there&#8217;s a better way to do this, however.</p>
<p>Note that, with the exception of putting the complete submission files in a different location, I try to retain the directory structure of the SEC website. My reason for putting the complete submissions in a separate location is to prevent <a href="http://www.lesbonscomptes.com/recoll/">Recoll</a> from indexing meaningless binary data embedded in these files.</p>
<p><pre class="brush: r;">
# Go through local copies of SEC text filings and extract the component
# documents from each filing
extract.filings &lt;- function(file) {
    require(XML)
    if (is.na(file)) return(NA)
    
    # Parse the file as an XML file containing multiple documents
    webpage &lt;- readLines(file.path(&quot;/edgar/data/&quot;, file)) #  encoding=&quot;Latin-1&quot;)
    
    file.name &lt;- gsub(&quot;&lt;FILENAME&gt;&quot;,&quot;&quot;, 
                      grep(&quot;&lt;FILENAME&gt;.*$&quot;, webpage,  perl=TRUE, value=TRUE))
   
    if (length(file.name)==0) { return(NA) }
    
    partial.file.dir &lt;- gsub(&quot;-(\\d{2})-(\\d{6})\\.txt$&quot;, &quot;\\1\\2&quot;, file, perl=TRUE)
    file.dir &lt;- file.path(directory, partial.file.dir)
       
    dir.create(file.dir, showWarnings=FALSE)
    
    file.name &lt;- unique(file.path(file.dir, file.name))
    start.line &lt;- grep(&quot;&lt;DOCUMENT&gt;.*$&quot;, webpage,  perl=TRUE) 
    end.line &lt;- grep(&quot;&lt;/DOCUMENT&gt;.*$&quot;, webpage,  perl=TRUE)     
    
    get.ext &lt;- function(path) { gsub(&quot;.*\\.(.*?)$&quot;, &quot;\\1&quot;, path) }
    
    for (i in 1:length(file.name)) {
         if(file.exists(file.name[i])) {
             next
         }
        
        if (get.ext(file.name[i])==&quot;txt&quot;) {
            temp &lt;- webpage[start.line[i]:end.line[i]]
            writeLines(temp, con=file.name[i])
        }
        
        if (get.ext(file.name[i])==&quot;pdf&quot;) {
            temp &lt;- webpage[start.line[i]:end.line[i]]
            pdf.start &lt;- grep(&quot;^&lt;PDF&gt;&quot;, temp,  perl=TRUE) +1
            pdf.end &lt;- grep(&quot;^&lt;/PDF&gt;&quot;, temp,  perl=TRUE) -1  
            t &lt;- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste(&quot;uudecode -o&quot;, file.name[i], t))
            system(paste(&quot;uudecode -o&quot;, file.name[i], t))
            unlink(t)
        }
        
        if (get.ext(file.name[i]) %in% c(&quot;js&quot;, &quot;css&quot;, &quot;paper&quot;, &quot;xsd&quot;)) {
            temp &lt;- webpage[start.line[i]:end.line[i]]
            pdf.start &lt;- grep(&quot;^&lt;TEXT&gt;&quot;, temp,  perl=TRUE) +1
            pdf.end &lt;- grep(&quot;^&lt;/TEXT&gt;&quot;, temp,  perl=TRUE) -1  
            t &lt;- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=file.name[i])
            unlink(t)
        }
        
        if (get.ext(file.name[i]) %in% c(&quot;zip&quot;, &quot;xls&quot;, &quot;jpg&quot;, &quot;gif&quot;)) {
            temp &lt;- webpage[start.line[i]:end.line[i]]
            pdf.start &lt;- grep(&quot;^begin&quot;, temp,  perl=TRUE)
            pdf.end &lt;- grep(&quot;^end&quot;, temp,  perl=TRUE)  
            t &lt;- tempfile()
            writeLines(temp[pdf.start:pdf.end], con=t)
            print(paste(&quot;uudecode -o&quot;, file.name[i], t))
            system(paste(&quot;uudecode -o&quot;, file.name[i], t))
            unlink(t)
        }
    }
    return(partial.file.dir) 
}
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/351/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/351/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=351&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2012/02/16/extracting-files-from-sec-complete-submission-text-filings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>Code for 5.2.5 Extended Example: A Salary Study from &#8220;The Art of R Programming&#8221;</title>
		<link>https://iangow.wordpress.com/2012/02/12/code-for-5-2-5-extended-example-a-salary-study-from-the-art-of-r-programming/</link>
		<comments>https://iangow.wordpress.com/2012/02/12/code-for-5-2-5-extended-example-a-salary-study-from-the-art-of-r-programming/#comments</comments>
		<pubDate>Sun, 12 Feb 2012 15:07:25 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[The Art of R Programming]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=340</guid>
		<description><![CDATA[Here is code for 5.2.5 Extended Example: A Salary Study from &#8220;The Art of R Programming&#8220;. The primary addition to the code in the book is the addition of code to get the data from the Department of Labor&#8217;s website.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=340&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is code for <strong><em>5.2.5 Extended Example: A Salary Study</em></strong> from &#8220;<a href="http://nostarch.com/artofr.htm">The Art of R Programming</a>&#8220;. The primary addition to the code in the book is the addition of code to get the data from the Department of Labor&#8217;s website.</p>
<p><pre class="brush: r;">
# Get data from internet, read into R
zipped_data &lt;- &quot;Perm_FY_2006_TEXT.zip&quot;
download.file(&quot;http://www.flcdatacenter.com/download/Perm_FY2006_TEXT.zip&quot;,
              destfile=zipped_data)
raw_data &lt;- unzip(zipped_data)
all2006 &lt;- read.csv(raw_data, as.is=TRUE, header=TRUE)
unlink(raw_data); rm(raw_data)

# A little data-cleaning
all2006 &lt;- within(all2006, {
  Wage_Offered_From &lt;- as.numeric(gsub(&quot;\\$&quot;,&quot;&quot;, Wage_Offered_From))
  Prevailing_Wage_Amount &lt;- as.numeric(gsub(&quot;\\$&quot;,&quot;&quot;, Prevailing_Wage_Amount))
  rat &lt;- Wage_Offered_From/Prevailing_Wage_Amount
})

# Some more data-cleaning (per p.108)
all2006 &lt;- subset(all2006, 
                  Wage_Per==&quot;Year&quot; &amp;              # Exclude hourly-wagers
                    Wage_Offered_From &gt; 20000 &amp;   # Exclude weird cases
                    Prevailing_Wage_Amount &gt; 200) # Exclude hourly prv wages

# Subsetting as on p.109 (I changed the code, as the code given in the book
# behaves strangely for me)
se2006 &lt;- subset(all2006, grepl(&quot;Software Engineer&quot;, Prevailing_Wage_Job_Title))
prg2006 &lt;- subset(all2006, grepl(&quot;Programmer&quot;, Prevailing_Wage_Job_Title))
ee2006 &lt;- subset(all2006, grepl(&quot;Electronics Engineer&quot;, Prevailing_Wage_Job_Title))

medrat &lt;- function(dataframe) {
  return(median(dataframe$rat, na.rm=TRUE))
}

makecorp &lt;- function(corpname) {
  return(subset(all2006, Employer_Name == corpname))
}

corplist &lt;- c(&quot;MICROSOFT CORPORATION&quot;, &quot;ms&quot;, 
              &quot;INTEL CORPORATION&quot;, &quot;intel&quot;,
              &quot;SUN MICROSYSTEMS, INC.&quot;, &quot;sun&quot;,
              &quot;GOOGLE INC.&quot;, &quot;google&quot;)

for (i in 1:(length(corplist)/2)) {
  corp &lt;- corplist[2*i-1]
  newdtf &lt;- paste(corplist[2*i], &quot;2006&quot;, sep=&quot;&quot;)
  assign(newdtf, makecorp(corp), pos=.GlobalEnv)
}
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/340/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=340&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2012/02/12/code-for-5-2-5-extended-example-a-salary-study-from-the-art-of-r-programming/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>Code to get data from &#8220;Extended Example&#8221; from Chapter 5 of &#8220;The Art of R Programming&#8221;</title>
		<link>https://iangow.wordpress.com/2012/02/02/code-to-get-data-from-extended-example-from-chapter-5-of-the-art-of-r-programming/</link>
		<comments>https://iangow.wordpress.com/2012/02/02/code-to-get-data-from-extended-example-from-chapter-5-of-the-art-of-r-programming/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 00:51:02 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=330</guid>
		<description><![CDATA[I agree with the review here that &#8220;The Art of R Programming&#8221; is a nice book, but the lack of data for some of the examples is a downside (I find it nice to work along with real examples). One &#8230; <a href="https://iangow.wordpress.com/2012/02/02/code-to-get-data-from-extended-example-from-chapter-5-of-the-art-of-r-programming/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=330&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I agree with the review <a href="http://xianblog.wordpress.com/2012/01/31/the-art-of-r-programming-guest-post/">here</a> that &#8220;<a href="http://nostarch.com/artofr.htm">The Art of R Programming</a>&#8221; is a nice book, but the lack of data for some of the examples is a downside (I find it nice to work along with real examples).</p>
<p>One example that is hard to get the full value of without the underlying data is the extended example using data on the pronunciation of Chinese characters in Cantonese and Mandarin at the end of Chapter 5. Here is some code to pull together data that can be used for that example.</p>
<p><pre class="brush: r;">
# Get the raw data from Matloff's website
chinese.raw &lt;- readLines(paste(&quot;http://www.cs.ucdavis.edu/~matloff&quot;,
                               &quot;matloff/public_html/145/Handouts/R2&quot;, 
                               &quot;CanManB5.utf8&quot;, sep=&quot;/&quot;))
save(file=&quot;chinese.raw.Rdata&quot;, chinese.raw )

# Create a tab-delimited version of the raw data
chinese &lt;- sub(&quot; &quot;, &quot;\t&quot;, chinese.raw)
chinese &lt;- sub(&quot; &quot;, &quot;\t&quot;, chinese)
chinese &lt;- sub(&quot; &quot;, &quot;\t&quot;, chinese)

# Function to split the data into fields and turn into a data frame
process.row &lt;- function(string) {    
    temp &lt;- unlist(strsplit(string, &quot;\t&quot;))
    return(data.frame(char=temp[1], Can=temp[2], Man=temp[3], Eng=temp[4], 
                      stringsAsFactors=FALSE))
}

# Create a data frame from the raw data
chinese.list &lt;- lapply(chinese, process.row)
chinese.data &lt;- do.call(&quot;rbind&quot;, chinese.list)
names(chinese.data)[1] &lt;- &quot;Ch char&quot;

# Fix some cases with multiple Chinese pronunciations (I implicitly
# assume these are Cantonese pronunciations)
to.fix &lt;- grep(&quot;^[[:alpha:]]+\\d&quot;, chinese.data$Eng, perl=TRUE)
chinese.data$Man[to.fix] &lt;-
  gsub(&quot;^([[:alpha:]]+\\d) (.*)&quot;, &quot;\\1&quot;, chinese.data$Eng[to.fix], perl=TRUE)
chinese.data$Eng[to.fix] &lt;-
  gsub(&quot;^([[:alpha:]]+\\d) (.*)&quot;, &quot;\\2&quot;, chinese.data$Eng[to.fix], perl=TRUE)

# Make two datasets: one for Cantonese, the other for Mandarin
can8 &lt;- subset(chinese.data, select=c(&quot;Ch char&quot;, &quot;Can&quot;))
man8 &lt;- subset(chinese.data, select=c(&quot;Ch char&quot;, &quot;Man&quot;))

# Ditch unneeded variables and save data to feed into the Chapter 5 example
rm(chinese.list, chinese, chinese.raw, chinese.data)
save(can8, man8, file=&quot;chinese.Rdata&quot;)
</pre> (This is not elegant, but it seems to work.)</p>
<p>Here is some code (closely based on that in the book) that uses this data.<br />
<pre class="brush: r;">
load(&quot;chinese.Rdata&quot;)

# merges data for 2 fangyans
merge2fy &lt;- function(fy1, fy2) {
  outdf &lt;- merge(fy1, fy2)

  # Separate tone from sound, and create new columns
  for (fy in list(fy1, fy2)) {
    # saplout will be a matrix, init cons in row 1, remainders in row
    # 2 and tones in row 3
    saplout &lt;- sapply((fy[[2]]), sepsoundtone)
    
    # convert it to a data frame
    tmpdf &lt;- data.frame(fy[, 1], t(saplout), row.names=NULL,
                        stringsAsFactors=FALSE)
    
    # Add names to the columns
    consname &lt;- paste(names(fy)[[2]], &quot; cons&quot;, sep=&quot;&quot;)
    restname &lt;- paste(names(fy)[[2]], &quot; sound&quot;, sep=&quot;&quot;)
    tonename &lt;- paste(names(fy)[[2]], &quot; tone&quot;, sep=&quot;&quot;)
    names(tmpdf) &lt;- c(&quot;Ch char&quot;, consname, restname, tonename)
    
    # Need to use merge, not cbind(), dues to possibly different
    # ordering of fy, outdf
    outdf &lt;- merge(outdf, tmpdf)
  }
  return(outdf)
}

# Separates romanized pronunciation pronun into initial consonant, if any
# the remaninder of the sound, and the tone, if any
sepsoundtone &lt;- function(pronun) {
  nchr &lt;- nchar(pronun)
  vowels &lt;- c(&quot;a&quot;, &quot;e&quot;, &quot;i&quot;, &quot;o&quot;, &quot;u&quot;)
  
  # How many initial consononants?
  numcons &lt;- 0
  for (i in 1:nchr) {
    ltr &lt;- substr(pronun, i, i)
    if (!ltr %in% vowels) numcons &lt;- numcons + 1 else break
  }
  cons &lt;- if (numcons &gt; 0) substr(pronun, 1, numcons) else NA
  tone &lt;- substr(pronun, nchr, nchr)
  numtones &lt;- 1 - as.integer(tone %in% letters) # TRUE is 1, FALSE is 0
  if (numtones == 0) tone &lt;- NA
  therest &lt;- substr(pronun, numcons+1, nchr - numtones)
  return(c(cons, therest, tone))
}

system.time(canman8 &lt;- merge2fy(can8, man8))
</pre></p>
<p>Here is an alternative way to tackle the same problem (seems to produce the same results, though I didn&#8217;t check this carefully) using regular expressions. This is an illustration that the Perl <a href="http://en.wikipedia.org/wiki/There's_more_than_one_way_to_do_it">motto</a> that &#8220;there&#8217;s more than one way to do it&#8221; applies to R too.<br />
<pre class="brush: r;">

load(&quot;chinese.Rdata&quot;)

# merges data for 2 fangyans
merge2fy &lt;- function(fy1, fy2) {
  outdf &lt;- merge(fy1, fy2)
  
  # Separate tone from sound, and create new columns
  for (fy in list(fy1, fy2)) {
    
    # Matching on pronunciation requires this step to prevent
    # duplicate matches.
    # (Perhaps save some time by only processing unique pronunciations.)
    pronun &lt;- unique(fy[[2]])
    
    # Regular expression separates romanized pronunciation pronun into initial 
    # consonant, if any; the remainder of the sound, if any; and the tone, if any
    
    # tmpdf will be a matrix, init cons in column 2, remainders in column
    # 3 and tones in column 4
    # Three components to the match:
    #   - String at the beginning that does not contain vowels or digits
    #   - String beginning with a vowel, followed by letters
    #   - String consisting of a single digit
    matches &lt;- regexec(&quot;^([^aeiou0-9]*)([aeiou]\\D*)?(\\d)?$&quot;, pronun)
    tmpdf &lt;- do.call(&quot;rbind&quot;, regmatches(pronun, matches))
    
    # convert it to a data frame
    tmpdf &lt;- as.data.frame(tmpdf, stringsAsFactors=FALSE)
    
    # Add names to the columns
    names(tmpdf) &lt;- 
      paste(names(fy)[[2]], c(&quot;&quot;, &quot; cons&quot;, &quot; sound&quot;, &quot; tone&quot;), sep=&quot;&quot;)
    
    # Need to use merge, not cbind(), dues to possibly different
    # ordering of fy, outdf
    outdf &lt;- merge(outdf, tmpdf, all.x=TRUE)
  }
  return(outdf)
}

system.time(canman8 &lt;- merge2fy(can8, man8))
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/330/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=330&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2012/02/02/code-to-get-data-from-extended-example-from-chapter-5-of-the-art-of-r-programming/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>IV regression and two-way cluster-robust standard errors</title>
		<link>https://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/</link>
		<comments>https://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/#comments</comments>
		<pubDate>Thu, 19 Jan 2012 14:32:00 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cluster-robust]]></category>
		<category><![CDATA[IV regression]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=317</guid>
		<description><![CDATA[As a follow-up to an earlier post, I was pleasantly surprised to discover that the code to handle two-way cluster-robust standard errors in R that I blogged about earlier worked out of the box with the IV regression routine available &#8230; <a href="https://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=317&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>As a follow-up to an earlier post, I was pleasantly surprised to discover that the <a href="http://iangow.wordpress.com/2011/05/19/r-code-for-two-way-cluster-robust-standard-errors/" title="R code for two-way cluster-robust standard errors">code</a> to handle two-way cluster-robust standard errors in R that I blogged about earlier worked out of the box with the IV regression routine available in the AER package (<em>ivreg</em>).</p>
<p><pre class="brush: r;">
# Tests of R function using Mitchell Petersen's test-data

# Read the data             
test &lt;- read.table(
      url(paste(&quot;http://www.kellogg.northwestern.edu/&quot;,
            &quot;faculty/petersen/htm/papers/se/&quot;,
            &quot;test_data.txt&quot;,sep=&quot;&quot;)),
    col.names=c(&quot;firmid&quot;, &quot;year&quot;, &quot;x&quot;, &quot;y&quot;))

# Create a silly instrument (x plus noise)
test &lt;- within(test, z &lt;- x + rnorm(length(x), mean=0, sd=sd(x)/3))

# The fitted model
library(AER)
fm &lt;- ivreg(y ~ x | z, data=test)

# Tests
library(sandwich); library(lmtest)
coeftest(fm)                                    # OLS
coeftest(fm, vcov=vcovHC(fm, type=&quot;HC0&quot;))       # White
source(&quot;~/Dropbox/AGL/Code/R/cluster2.R&quot;)
coeftest.cluster(test,fm, cluster1=&quot;firmid&quot;)    # Clustered by firm
coeftest.cluster(test,fm, cluster1=&quot;year&quot;)      # Clustered by year
coeftest.cluster(test,fm, cluster1=&quot;firmid&quot;,
                          cluster2=&quot;year&quot;)      # Clustered by firm and year

# Save the data to a Stata file for comparison
library(foreign)
write.dta(dataframe=test, file=&quot;~/Dropbox/WRDS/petersen.dta&quot;)
</pre></p>
<p>Here&#8217;s the output<br />
<pre class="brush: r;">
&gt; test &lt;- within(test, z &lt;- x + rnorm(length(x), mean=0, sd=sd(x)/3))
Error in within(test, z &lt;- x + rnorm(length(x), mean = 0, sd = sd(x)/3)) : 
  object 'test' not found
&gt; test &lt;- read.table(
+       url(paste(&quot;http://www.kellogg.northwestern.edu/&quot;,
+             &quot;faculty/petersen/htm/papers/se/&quot;,
+             &quot;test_data.txt&quot;,sep=&quot;&quot;)),
+     col.names=c(&quot;firmid&quot;, &quot;year&quot;, &quot;x&quot;, &quot;y&quot;))
&gt; 
&gt; test &lt;- within(test, z &lt;- x + rnorm(length(x), mean=0, sd=sd(x)/3))
&gt; fm &lt;- ivreg(y ~ x | z, data=test)
Error: could not find function &quot;ivreg&quot;
&gt; library(AER)
Loading required package: car
Loading required package: MASS
Loading required package: nnet
Loading required package: survival
Loading required package: splines
Loading required package: Formula
Loading required package: lmtest
Loading required package: zoo

Attaching package: ‘zoo’

The following object(s) are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
Loading required package: strucchange
&gt; fm &lt;- ivreg(y ~ x | z, data=test)
&gt; library(sandwich); library(lmtest)
&gt; coeftest(fm)                                    # OLS

t test of coefficients:

            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 0.029685   0.028359  1.0468   0.2953    
x           1.033802   0.030142 34.2982   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

&gt; coeftest(fm, vcov=vcovHC(fm, type=&quot;HC0&quot;))       # White

t test of coefficients:

            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 0.029685   0.028354  1.0469   0.2952    
x           1.033802   0.029851 34.6326   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

&gt; coeftest.cluster(test,fm, cluster1=&quot;firmid&quot;)    # Clustered by firm
Error: could not find function &quot;coeftest.cluster&quot;
&gt; coeftest.cluster(test,fm, cluster1=&quot;year&quot;)      # Clustered by year
Error: could not find function &quot;coeftest.cluster&quot;
&gt; source(&quot;~/Dropbox/AGL/Code/R/cluster2.R&quot;)
&gt; coeftest.cluster(test,fm, cluster1=&quot;firmid&quot;)    # Clustered by firm

t test of coefficients:

            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 0.029685   0.067011   0.443   0.6578    
x           1.033802   0.051423  20.104   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

&gt; coeftest.cluster(test,fm, cluster1=&quot;year&quot;)      # Clustered by year

t test of coefficients:

            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 0.029685   0.023395  1.2688   0.2046    
x           1.033802   0.036246 28.5219   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

&gt; coeftest.cluster(test,fm, cluster1=&quot;firmid&quot;,
+                           cluster2=&quot;year&quot;)      # Clustered by firm and year

t test of coefficients:

            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 0.029685   0.065065  0.4562   0.6482    
x           1.033802   0.055377 18.6684   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

&gt; library(foreign)
&gt; write.dta(dataframe=test, file=&quot;~/Dropbox/WRDS/petersen.dta&quot;)
</pre></p>
<p>Now, compare with Stata. First, I use the in-built <em>ivreg</em> routine, then I consider the <em>ivreg2</em> routine available <a href="http://ideas.repec.org/c/boc/bocode/s425401.html">here</a>.<br />
It seems that the R routine produces identical estimates of standard errors to Stata&#8217;s <em>ivreg</em> routine, which only handles one-way clustering, but both produce different estimates from the <em>ivreg2</em> routine. My guess is that it&#8217;s some differences in the degrees-of-freedom correction used; the numbers are fairly close.</p>
<p><pre class="brush: plain;">
. use &quot;~/Dropbox/WRDS/petersen.dta&quot;
(Written by R.              )

. ivreg y (x=z)

Instrumental variables (2SLS) regression

      Source |       SS       df       MS              Number of obs =    5000
-------------+------------------------------           F(  1,  4998) = 1176.37
       Model |  5270.65875     1  5270.65875           Prob &gt; F      =  0.0000
    Residual |  20097.6441  4998  4.02113727           R-squared     =  0.2078
-------------+------------------------------           Adj R-squared =  0.2076
       Total |  25368.3028  4999   5.0746755           Root MSE      =  2.0053

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0301416    34.30   0.000     .9747116    1.092893
       _cons |   .0296853   .0283594     1.05   0.295    -.0259115    .0852821
------------------------------------------------------------------------------
Instrumented:  x
Instruments:   z
------------------------------------------------------------------------------

. ivreg y (x=z), cluster(firmid)

Instrumental variables (2SLS) regression               Number of obs =    5000
                                                       F(  1,   499) =  404.17
                                                       Prob &gt; F      =  0.0000
                                                       R-squared     =  0.2078
                                                       Root MSE      =  2.0053

                               (Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0514227    20.10   0.000     .9327707    1.134834
       _cons |   .0296853   .0670107     0.44   0.658    -.1019726    .1613431
------------------------------------------------------------------------------
Instrumented:  x
Instruments:   z
------------------------------------------------------------------------------

. ivreg y (x=z), cluster(year)

Instrumental variables (2SLS) regression               Number of obs =    5000
                                                       F(  1,     9) =  813.50
                                                       Prob &gt; F      =  0.0000
                                                       R-squared     =  0.2078
                                                       Root MSE      =  2.0053

                                  (Std. Err. adjusted for 10 clusters in year)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0362459    28.52   0.000     .9518085    1.115796
       _cons |   .0296853   .0233954     1.27   0.236    -.0232389    .0826094
------------------------------------------------------------------------------
Instrumented:  x
Instruments:   z
------------------------------------------------------------------------------

. ivreg2 y (x=z), cluster(year)

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on year

Number of clusters (year) =         10                Number of obs =     5000
                                                      F(  1,     9) =   813.50
                                                      Prob &gt; F      =   0.0000
Total (centered) SS     =  25368.30284                Centered R2   =   0.2078
Total (uncentered) SS   =  25374.51146                Uncentered R2 =   0.2080
Residual SS             =  20097.64409                Root MSE      =    2.005

------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0343824    30.07   0.000     .9664141    1.101191
       _cons |   .0296853   .0221926     1.34   0.181    -.0138115     .073182
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic):              9.987
                                                   Chi-sq(1) P-val =    0.0016
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):              4.5e+04
                         (Kleibergen-Paap rk Wald F statistic):        6.1e+04
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments):         0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         x
Excluded instruments: z
------------------------------------------------------------------------------

. ivreg2 y (x=z), cluster(firmid)

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on firmid

Number of clusters (firmid) =      500                Number of obs =     5000
                                                      F(  1,   499) =   404.17
                                                      Prob &gt; F      =   0.0000
Total (centered) SS     =  25368.30284                Centered R2   =   0.2078
Total (uncentered) SS   =  25374.51146                Uncentered R2 =   0.2080
Residual SS             =  20097.64409                Root MSE      =    2.005

------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0513661    20.13   0.000     .9331267    1.134478
       _cons |   .0296853   .0669369     0.44   0.657    -.1015087    .1608792
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic):            308.025
                                                   Chi-sq(1) P-val =    0.0000
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):              4.5e+04
                         (Kleibergen-Paap rk Wald F statistic):        3.2e+04
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments):         0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         x
Excluded instruments: z
------------------------------------------------------------------------------

. ivreg2 y (x=z), cluster(firmid fyear)
variable fyear not found
in option cluster()
r(111);

. ivreg2 y (x=z), cluster(firmid year)

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on firmid and year

Number of clusters (firmid) =      500                Number of obs =     5000
Number of clusters (year) =         10                F(  1,     9) =   328.27
                                                      Prob &gt; F      =   0.0000
Total (centered) SS     =  25368.30284                Centered R2   =   0.2078
Total (uncentered) SS   =  25374.51146                Uncentered R2 =   0.2080
Residual SS             =  20097.64409                Root MSE      =    2.005

------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033802   .0541255    19.10   0.000     .9277184    1.139886
       _cons |   .0296853   .0645686     0.46   0.646     -.096867    .1562375
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic):              9.731
                                                   Chi-sq(1) P-val =    0.0018
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic):              4.5e+04
                         (Kleibergen-Paap rk Wald F statistic):        4.0e+04
Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                         15% maximal IV size              8.96
                                         20% maximal IV size              6.66
                                         25% maximal IV size              5.53
Source: Stock-Yogo (2005).  Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments):         0.000
                                                 (equation exactly identified)
------------------------------------------------------------------------------
Instrumented:         x
Excluded instruments: z
------------------------------------------------------------------------------


</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/317/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/317/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/317/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=317&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2012/01/19/iv-regression-and-two-way-cluster-robust-standard-errors/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>Jumbling words</title>
		<link>https://iangow.wordpress.com/2012/01/03/jumbling-words/</link>
		<comments>https://iangow.wordpress.com/2012/01/03/jumbling-words/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 19:14:56 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=308</guid>
		<description><![CDATA[Here&#8217;s a function (written in R) to jumble a sentence by rearranging letters other than the first and last of each word. The idea comes from a birthday card I received from my aunt in Australia. Some sample output &#62; &#8230; <a href="https://iangow.wordpress.com/2012/01/03/jumbling-words/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=308&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a function (written in R) to jumble a sentence by rearranging letters other than the first and last of each word. The idea comes from a birthday card I received from my aunt in <a href="http://en.wikipedia.org/wiki/Hill_Top,_New_South_Wales">Australia</a>.</p>
<p><pre class="brush: r;">
# A function to jumble a vector with no chance of
# returning the original vector
vector_jumble &lt;- function(vector) {
    
    # Create an initial sample
    jumble &lt;- sample(vector)
   
    # Keep going until the jumbled vector differs from the original
    while(identical(vector,jumble)) {
        jumble &lt;- sample(vector)
    }
    return(jumble)
}

# A function to jumble a word
word_jumble &lt;- function(word) {
    
    # Split word into individuanl letters
    j &lt;- unlist(strsplit(word,&quot;&quot;))
    
    # If there are only three or fewer letters (i.e., only one middle letter), 
    # there is nothing to do. Otherwise, jumble the middle letters.
    if (length(j) &lt; 4) {
        return(word)
    } else {
        first &lt;- 1
        middle &lt;- 2:(length(j)-1)
        last &lt;- length(j)
        return(paste(j[c(first, vector_jumble(middle), last)], collapse=&quot;&quot;))    
    }
}

# A function to jumble a sentence
sentence_jumble &lt;- function(sentence) {
    # Initialize a variable that will store the jumbled sentence
    y &lt;- NULL
    
    # Break the sentence into words (separated by spaces)
    x &lt;- unlist(strsplit(sentence, &quot; &quot;))
    
    # Jumble each word in the sentence
    for (i in x) {
        y &lt;- paste(y,word_jumble(i))
    }
    
    # Return the result, after deleting leading spaces
    return(gsub(&quot;^[ ]+&quot;,&quot;&quot;,y, perl=TRUE))
}
   
# A test sentence
a &lt;- &quot;The quick brown fox jumps over the lazy dog&quot;
sentence_jumble(a)
</pre></p>
<p>Some sample output<br />
<code><br />
&gt; sentence_jumble(a)<br />
[1] "The qicuk bworn fox jpmus oevr the lzay dog"<br />
</code></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/308/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/308/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/308/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=308&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2012/01/03/jumbling-words/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>New feature of PL/R: WINDOW functions</title>
		<link>https://iangow.wordpress.com/2011/08/31/new-feature-of-plr-window-functions/</link>
		<comments>https://iangow.wordpress.com/2011/08/31/new-feature-of-plr-window-functions/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 20:31:16 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=296</guid>
		<description><![CDATA[PL/R is module that allows PostgreSQL to access R functionality. Here is an illustration of a feature recently added to PL/R. This is meant to represent a &#8220;panel&#8221; of ten-year time series for 100 firms numbered 1 through 100. I &#8230; <a href="https://iangow.wordpress.com/2011/08/31/new-feature-of-plr-window-functions/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=296&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.joeconway.com/plr/doc/index.html">PL/R</a> is module that allows PostgreSQL to access R functionality.</p>
<p>Here is an illustration of a feature recently added to PL/R. This is meant to represent a &#8220;panel&#8221; of ten-year time series for 100 firms numbered 1 through 100.</p>
<p><pre class="brush: sql;">
DROP TABLE IF EXISTS test_data;

CREATE TABLE test_data (
  fyear integer,
  firm float8,
  eps float8
);

INSERT INTO test_data
SELECT (b.f + 1) % 10 + 2000 AS fyear,
	floor((b.f+1)/10) + 50 AS firm,
       f::float8/100 + random()/10 AS eps
FROM generate_series(-500,499,1) b(f);
</pre></p>
<p>I then run the following queries. The SQL version takes 115ms on my<br />
hardware, the PL/R version 1528ms (this is pretty unscientific, as I don&#8217;t<br />
require 9 years of data for the SQL version); so there&#8217;s a potential tradeoff between performance and the flexibility of R.<br />
The good news is that most of the 1528ms is R cranking away on the regression; if I replace the $BODY$ of<br />
the function with &#8220;return(1)&#8221;, the time comes down to 97ms for the PL/R<br />
version. So standard approaches to speeding up on the R side would seem<br />
to apply. (I used &#8220;float8&#8221; for the types, but didn&#8217;t notice a performance<br />
effect.)</p>
<p><pre class="brush: sql;">
CREATE OR REPLACE
FUNCTION r_regr_slope(float8, float8)
RETURNS float8 AS
$BODY$
  slope &lt;- NA
  y &lt;- farg1
  x &lt;- farg2
  if (fnumrows==9) try (slope &lt;- lm(y ~ x)$coefficients[2])
  return(slope)
$BODY$
LANGUAGE plr WINDOW;

SELECT *,
  regr_slope(eps, lag_eps) OVER w AS slope
FROM (SELECT firm, fyear, eps,
  lag(eps) OVER (ORDER BY firm, fyear) AS lag_eps
FROM test_data) AS a
WHERE eps IS NOT NULL
WINDOW w AS (ORDER BY firm, fyear ROWS 8 PRECEDING);

SELECT *, r_regr_slope(eps, lag_eps) OVER w AS slope_R
FROM (SELECT firm, fyear, eps,
  lag(eps) OVER (ORDER BY firm, fyear) AS lag_eps
FROM test_data) AS a
WHERE eps IS NOT NULL
WINDOW w AS (ORDER BY firm, fyear ROWS 8 PRECEDING);
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/296/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=296&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2011/08/31/new-feature-of-plr-window-functions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>Getting SEC filing header files</title>
		<link>https://iangow.wordpress.com/2011/08/29/getting-sec-filing-header-files/</link>
		<comments>https://iangow.wordpress.com/2011/08/29/getting-sec-filing-header-files/#comments</comments>
		<pubDate>Mon, 29 Aug 2011 14:07:04 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=283</guid>
		<description><![CDATA[This post provides some code for downloading header (*.sgml) files associated with SEC filings by firms. The first piece of code is a function that takes the path to the text filing, which is found on the SEC index files, &#8230; <a href="https://iangow.wordpress.com/2011/08/29/getting-sec-filing-header-files/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=283&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This post provides some code for downloading header (*.sgml) files associated with SEC filings by firms.  The first piece of code is a function that takes the path to the text filing, which is found on the SEC index files, transforms that path into the path for the SGML file, checks whether the file has already been downloads, and, if not, downloads it.</p>
<p><pre class="brush: r;">
get_sgml_file &lt;- function(path) {
    
    # The name of the local file to be created and the remote file from which
    # it will be created.
    directory &lt;- &quot;/home/iangow/Documents/WRDS/filings/the_filings&quot;
    local_filename &lt;- gsub(&quot;^edgar\\/data(.*)\\.txt$&quot;, 
                          paste(directory,&quot;\\1&quot;,&quot;.hdr.sgml&quot;,sep=&quot;&quot;), 
                          path, perl=TRUE)
    remote_filename &lt;- gsub(&quot;\\.txt$&quot;, &quot;.hdr.sgml&quot;, path, perl=TRUE)                        
    
    # Only download the file if we don't already have a local copy
    if (!file.exists(local_filename)) {
        
        ftp &lt;- paste(&quot;http://www.sec.gov/Archives&quot;,
                     dirname(path),
                     gsub(&quot;-|(.txt$)&quot;,&quot;&quot;,basename(path),perl=TRUE),
                     basename(remote_filename), sep=&quot;/&quot;) 
        dir.create(dirname(local_filename), showWarnings=FALSE)
        try(download.file(url=ftp, destfile=local_filename) )
    }                      
    
    # Return the local filename if the file exists
    if (file.exists(local_filename)) { 
        return(local_filename) 
    } else { return(NA) } 
}
</pre></p>
<p>Now, to test this code, I pull together announcements from First Call&#8217;s CIG (company-issued guidance) database, and then a list of 8-K filings made within the five-day window beginning with each announcement. (This takes around 50 seconds to pull together.)<br />
<pre class="brush: r;">
# Connect to my database
library(RPostgreSQL)
drv &lt;- dbDriver(&quot;PostgreSQL&quot;)
pg &lt;- dbConnect(drv, db=&quot;crsp&quot;)

# Get FirstCall identifier, announcement date, and CIK for all observations on 
# the FirstCall CIG file. I can get CIK matches for almost all observations and
# there are very few (about 6) Security_IDs with more than one CIK match; 
# I guess that the correct CIK match will be arrived at by the limited time
# window used below.
dbGetQuery(pg, &quot;DROP TABLE IF EXISTS cig_ciks&quot;)
dbGetQuery(pg,&quot;CREATE TABLE cig_ciks AS SELECT * FROM
    (SELECT DISTINCT \&quot;Security_ID\&quot; AS security_id, anndate, cusip AS cusip8
        FROM fc.cig) AS a
    LEFT JOIN (SELECT DISTINCT substr(cusip, 1,8) AS cusip8, 
           (cik::integer)::text AS cik
        FROM (SELECT DISTINCT gvkey, cusip FROM comp.secm) AS b
        INNER JOIN (SELECT DISTINCT gvkey, cik FROM comp.company 
                   WHERE cik IS NOT NULL) AS c
        USING (gvkey)) AS d
    USING (cusip8)&quot;)

# Pull together a list of 8-K filings within the five-day window beginning with
# the announcement on CIG; I should be more careful here with weekends, etc.
file.list &lt;- dbGetQuery(pg, &quot;
    SELECT * 
    FROM cig_ciks AS a
    INNER JOIN filings.filings AS b
    USING (cik)
    WHERE b.date_filed BETWEEN a.anndate AND a.anndate + interval '4 days'
                 AND form_type='8-K'&quot;)
</pre></p>
<p>Now, download the SGMLs for each filing. This step takes some time: over 5 hours for the 58,000+ filings identified in the previous step. However, this is a one-off cost.<br />
<pre class="brush: r;">
# Get the files
file.list$sgml_file &lt;- unlist(lapply(file.list$file_name, get_sgml_file))
</pre></p>
<p>Now, extract the &#8220;items&#8221; associated with each filing. See <a href="http://www.sec.gov/answers/form8k.htm">here</a> for a description of each item. Finally, scan each filing for the presence of certain kinds of items. (This probably isn&#8217;t the most efficient code, but it gets through the 58,000+ filings in 11 seconds.)<br />
<pre class="brush: r;">

# Extract the list of items for each filing
item_scan &lt;- function(sgml_file) {
    con &lt;- file(sgml_file)
    items &lt;- grep(&quot;^&lt;ITEMS&gt;&quot;, readLines(con=con), value=TRUE, perl=TRUE)
    close(con)
    items &lt;- gsub(&quot;^&lt;ITEMS&gt;&quot;,&quot;&quot;, items, perl=TRUE)
    return(items)
}

file.list$items &lt;- lapply(file.list$sgml_file, item_scan)

# A function to create an indicator for the presence of a given item
has_item &lt;- function(item_list, item) {
    is.element(item, unlist(item_list))
}

# Create an indicator for the presence of Item 9.01 on each 8-K filing
file.list$has_2.02 &lt;- unlist(lapply(file.list$items, has_item, &quot;2.02&quot;))
file.list$has_7.01 &lt;- unlist(lapply(file.list$items, has_item, &quot;7.01&quot;))
file.list$has_9.01 &lt;- unlist(lapply(file.list$items, has_item, &quot;9.01&quot;))
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/283/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/283/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/283/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=283&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2011/08/29/getting-sec-filing-header-files/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>Getting SEC filing index files</title>
		<link>https://iangow.wordpress.com/2011/08/26/getting-sec-filing-index-files/</link>
		<comments>https://iangow.wordpress.com/2011/08/26/getting-sec-filing-index-files/#comments</comments>
		<pubDate>Fri, 26 Aug 2011 19:11:02 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[RpostgreSQL]]></category>
		<category><![CDATA[SEC]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=268</guid>
		<description><![CDATA[Here is some R code to download SEC index files and put them into a database. This is an alternative to Perl code provided by Andrew Leone here. First, a function to download the zipped index file from the SEC &#8230; <a href="https://iangow.wordpress.com/2011/08/26/getting-sec-filing-index-files/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=268&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is some R code to download SEC index files and put them into a database. This is an alternative to Perl code provided by Andrew Leone <a href="http://sbaleone.bus.miami.edu/PERLCOURSE/Perl_Resources.html">here</a>.</p>
<p>First, a function to download the zipped index file from the SEC website, then parse it into an R dataframe:<br />
<pre class="brush: r;">
getSECIndexFile &lt;- function(year, quarter) {
    
    # Download the zipped index file from the SEC website
    tf &lt;- tempfile()
    result &lt;- try(download.file(
        url=paste(&quot;ftp://anonymous:your_id@ftp.sec.gov/edgar/full-index/&quot;,
                  year,&quot;/QTR&quot;, quarter, &quot;/company.zip&quot;,sep=&quot;&quot;),
        destfile=tf))
    
    # If we didn't encounter and error downloading the file, parse it
    # and return as a R data frame
    if (!inherits(result, &quot;try-error&quot;)) {
        
        # Small function to remove leading and trailing spaces
        trim &lt;- function (string) {
            gsub(&quot;^\\s*(.*?)\\s*$&quot;,&quot;\\1&quot;, string, perl=TRUE)
        }

        # Read the downloaded file
        raw.data &lt;- readLines(con=(zz&lt;- unz(description=tf,
                                            filename=&quot;company.idx&quot;)))
        close(zz)
        raw.data &lt;- raw.data[11:length(raw.data)] # Remove the first 10 rows.

        # Parse the downloaded file and return the extracted data as a data frame
        company_name &lt;- trim(substr(raw.data,1,62))
        form_type &lt;- trim(substr(raw.data,63,74))
        cik &lt;- trim(substr(raw.data,75,86))
        date_filed &lt;- as.Date(substr(raw.data,87,98))
        file_name &lt;- trim(substr(raw.data,99,150))
        rm(raw.data)
        return(data.frame(company_name, form_type, cik, date_filed, file_name))
    } else { return(NULL)} 
}
</pre></p>
<p>Second, a function to add the file to my database:<br />
<pre class="brush: r;">
addIndexFileToDatabase &lt;- function(data) {
    if (is.null(data)) return(NULL)
    library(RPostgreSQL)
    drv &lt;- dbDriver(&quot;PostgreSQL&quot;)
    pg &lt;- dbConnect(drv, db=&quot;crsp&quot;)
    
    rs &lt;- dbWriteTable(pg, c(&quot;filings&quot;, &quot;filings&quot;), data, append=TRUE)
    dbDisconnect(pg)
    return(rs)
}   
</pre><br />
<strong>Following sentence is out of date (see below).</strong> Note that some of the SEC index files contain embedded backslashes, which cause problems for the version of RPostgreSQL on CRAN. I instead downloaded the version available on the <a href="http://code.google.com/p/rpostgresql/source/checkout">Google project site</a>, then installed it by issuing <code>install.packages("/home/iangow/rpostgresql-read-only/RPostgreSQL/", type="source", repos=NULL)</code> within R. </p>
<p>Finally, a few lines of code to delete the filings table if it already exists, then download all the filing files from 1993 to 2011 and post them in the database.<br />
<pre class="brush: r;">
library(RPostgreSQL)
drv &lt;- dbDriver(&quot;PostgreSQL&quot;)
pg &lt;- dbConnect(drv, db=&quot;crsp&quot;)
dbGetQuery(pg, &quot;DROP TABLE IF EXISTS filings.filings&quot;)

for (year in 1993:2011) {
    for (quarter in 1:4) {
        addIndexFileToDatabase(getSECIndexFile(year, quarter))
    }
}
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/268/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=268&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2011/08/26/getting-sec-filing-index-files/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
		<item>
		<title>10-K filings by day of the month</title>
		<link>https://iangow.wordpress.com/2011/08/25/10-k-filings-by-day-of-the-month/</link>
		<comments>https://iangow.wordpress.com/2011/08/25/10-k-filings-by-day-of-the-month/#comments</comments>
		<pubDate>Thu, 25 Aug 2011 15:48:06 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[10-K]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[SEC]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=258</guid>
		<description><![CDATA[Here&#8217;s a cute picture. I was looking to sample 10-K filings and wondered whether I could assume they were well distributed over days of the month. This picture required surprisingly little code to produce:<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=258&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a cute picture. I was looking to sample 10-K filings and wondered whether I could assume they were well distributed over days of the month.</p>
<p><a href="http://iangow.files.wordpress.com/2011/08/day_filed_10ks1.png"><img class="alignnone size-medium wp-image-262" title="day_filed_10ks" src="http://iangow.files.wordpress.com/2011/08/day_filed_10ks1.png?w=600&#038;h=600" alt="" width="600" height="600" /></a></p>
<p>This picture required surprisingly little code to produce:<br />
<pre class="brush: r;">
# Some parameters
min.year &lt;- 1999
max.year &lt;- 2010

# Pull data on number of filings by day of the month
library(RPostgreSQL)
drv &lt;- dbDriver(&quot;PostgreSQL&quot;)
pg &lt;- dbConnect(drv, db=&quot;crsp&quot;)

filing.data &lt;- dbGetQuery(pg,paste(&quot;
    SELECT extract(year FROM date_filed) AS year_filed, 
        extract(day FROM date_filed) AS day_filed, count(*) AS num_10Ks_filed 
    FROM filings.filings 
    WHERE form_type='10-K'
        AND extract(year FROM date_filed) BETWEEN &quot;,min.year, &quot; AND &quot;,max.year,
    &quot;GROUP BY year_filed, day_filed ORDER BY year_filed, day_filed&quot;))
filing.data$year_filed &lt;- as.factor(filing.data$year_filed)

# Make a bar chart
library(lattice)
png(file=&quot;~/Dropbox/day_filed_10ks.png&quot;)

chart.title &lt;- paste(&quot;Distribution of 10-K filings by day of month (&quot;,
                    min.year,&quot;-&quot;,max.year,&quot;)&quot;, sep=&quot;&quot;)
barchart(day_filed ~ num_10ks_filed | year_filed, data=filing.data, 
         xlab=&quot;Number of 10-K filings&quot;,
         ylab=&quot;Day of month&quot;,
         main=chart.title,
        scales = list(y=list(at = seq(from=1,to=31, by=5))))
dev.off()
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/258/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/258/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/258/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=258&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2011/08/25/10-k-filings-by-day-of-the-month/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>

		<media:content url="http://iangow.files.wordpress.com/2011/08/day_filed_10ks1.png?w=600" medium="image">
			<media:title type="html">day_filed_10ks</media:title>
		</media:content>
	</item>
		<item>
		<title>Follow up to a MySQL to SAS comparison</title>
		<link>https://iangow.wordpress.com/2011/08/07/follow-up-to-a-mysql-to-sas-comparison/</link>
		<comments>https://iangow.wordpress.com/2011/08/07/follow-up-to-a-mysql-to-sas-comparison/#comments</comments>
		<pubDate>Sun, 07 Aug 2011 17:08:27 +0000</pubDate>
		<dc:creator>iangow</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://iangow.wordpress.com/?p=251</guid>
		<description><![CDATA[The document here compares MySQL with SAS for some basic data tasks and gives the impression that MySQL is slower than SAS, but can be close enough in terms of performance to be worthy of consideration. I did a fairly &#8230; <a href="https://iangow.wordpress.com/2011/08/07/follow-up-to-a-mysql-to-sas-comparison/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=251&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The document <a href="http://www.mathworks.com/matlabcentral/fx_files/12027/1/Matlab%20and%20MySQL.pdf">here</a> compares MySQL with SAS for some basic data tasks and gives the impression that MySQL is slower than SAS, but can be close enough in terms of performance to be worthy of consideration.</p>
<p>I did a fairly unscientific comparison with PostgreSQL. I don&#8217;t have the same hardware (the document above is from 2007), but I used my slow-as-a-wet-week-for-database-stuff two-year-old MacBook Pro (5400 RPM drive with whole-disk encryption).</p>
<p><b>Conclusion:</b> PostgreSQL surely leaves SAS in its dust (the results posted in the source above are all in minutes). There was one query where PostgreSQL didn&#8217;t perform well, but one would actually use &#8220;window function&#8221; for that particular query (not available in MySQL or SAS, as far as I know) and then PostgreSQL is much, much faster.</p>
<p><pre class="brush: sql;">
CREATE TABLE test AS SELECT * FROM crsp.msf LIMIT 1000000; -- 10.6 seconds 

CREATE TABLE sub1 AS SELECT * FROM test WHERE vol=0; -- 5.5 seconds
CREATE TABLE sub2 AS SELECT * FROM test WHERE vol&gt;0; -- 9.0 seconds

INSERT INTO sub1 SELECT * FROM sub2; -- 23.2 seconds

CREATE TABLE crop AS SELECT permno, date FROM test; -- 2.7 seconds

CREATE TABLE sort AS SELECT * FROM test ORDER BY permno, date; -- 17.6 seconds

CREATE TABLE stat AS SELECT date, avg(ret) FROM test GROUP BY date; -- 1.3 seconds

CREATE TABLE joined AS 
	SELECT a.permno, a.date, a.prc, b.prc AS lprc
	FROM test AS a 
	LEFT JOIN test AS b 
	ON a.permno=b.permno AND
		b.date BETWEEN a.date - interval '31 days' AND a.date - interval '1 day'; 
-- 259.2 seconds (higher than 182 seconds listed for SAS!)

-- Let's create an index to see what that does
CREATE INDEX test_idx ON test (permno, date); -- 3.5 seconds

CREATE TABLE joined2 AS 
	SELECT a.permno, a.date, a.prc, b.prc AS lprc
	FROM test AS a 
	LEFT JOIN test AS b 
	ON a.permno=b.permno AND
		b.date BETWEEN a.date - interval '31 days' AND a.date - interval '1 day'; 
-- 258.9 seconds (like SAS, PostgreSQL didn't use this index).

-- But you would never do that using PostgreSQL. Instead, use a window function:
CREATE TEMP TABLE joined3 AS 
	SELECT permno, date, prc, lag(prc) OVER w AS lprc 
	FROM test 
	WINDOW w AS (PARTITION BY permno ORDER BY date); -- 4.0 seconds!!!	
</pre></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/iangow.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/iangow.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/iangow.wordpress.com/251/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=iangow.wordpress.com&amp;blog=19864647&amp;post=251&amp;subd=iangow&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>https://iangow.wordpress.com/2011/08/07/follow-up-to-a-mysql-to-sas-comparison/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="https://secure.gravatar.com/avatar/691c38b9bb82d8e6a53f831815bd3bb6?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">iangow</media:title>
		</media:content>
	</item>
	</channel>
</rss>
