Code to get data from “Extended Example” from Chapter 5 of “The Art of R Programming”

I agree with the review here that “The Art of R Programming” is a nice book, but the lack of data for some of the examples is a downside (I find it nice to work along with real examples).

One example that is hard to get the full value of without the underlying data is the extended example using data on the pronunciation of Chinese characters in Cantonese and Mandarin at the end of Chapter 5. Here is some code to pull together data that can be used for that example.

# Get the raw data from Matloff's website
chinese.raw <- readLines(paste("http://www.cs.ucdavis.edu/~matloff",
                               "matloff/public_html/145/Handouts/R2", 
                               "CanManB5.utf8", sep="/"))
save(file="chinese.raw.Rdata", chinese.raw )

# Create a tab-delimited version of the raw data
chinese <- sub(" ", "\t", chinese.raw)
chinese <- sub(" ", "\t", chinese)
chinese <- sub(" ", "\t", chinese)

# Function to split the data into fields and turn into a data frame
process.row <- function(string) {    
    temp <- unlist(strsplit(string, "\t"))
    return(data.frame(char=temp[1], Can=temp[2], Man=temp[3], Eng=temp[4], 
                      stringsAsFactors=FALSE))
}

# Create a data frame from the raw data
chinese.list <- lapply(chinese, process.row)
chinese.data <- do.call("rbind", chinese.list)
names(chinese.data)[1] <- "Ch char"

# Fix some cases with multiple Chinese pronunciations (I implicitly
# assume these are Cantonese pronunciations)
to.fix <- grep("^[[:alpha:]]+\\d", chinese.data$Eng, perl=TRUE)
chinese.data$Man[to.fix] <-
  gsub("^([[:alpha:]]+\\d) (.*)", "\\1", chinese.data$Eng[to.fix], perl=TRUE)
chinese.data$Eng[to.fix] <-
  gsub("^([[:alpha:]]+\\d) (.*)", "\\2", chinese.data$Eng[to.fix], perl=TRUE)

# Make two datasets: one for Cantonese, the other for Mandarin
can8 <- subset(chinese.data, select=c("Ch char", "Can"))
man8 <- subset(chinese.data, select=c("Ch char", "Man"))

# Ditch unneeded variables and save data to feed into the Chapter 5 example
rm(chinese.list, chinese, chinese.raw, chinese.data)
save(can8, man8, file="chinese.Rdata")

(This is not elegant, but it seems to work.)

Here is some code (closely based on that in the book) that uses this data.

load("chinese.Rdata")

# merges data for 2 fangyans
merge2fy <- function(fy1, fy2) {
  outdf <- merge(fy1, fy2)

  # Separate tone from sound, and create new columns
  for (fy in list(fy1, fy2)) {
    # saplout will be a matrix, init cons in row 1, remainders in row
    # 2 and tones in row 3
    saplout <- sapply((fy[[2]]), sepsoundtone)
    
    # convert it to a data frame
    tmpdf <- data.frame(fy[, 1], t(saplout), row.names=NULL,
                        stringsAsFactors=FALSE)
    
    # Add names to the columns
    consname <- paste(names(fy)[[2]], " cons", sep="")
    restname <- paste(names(fy)[[2]], " sound", sep="")
    tonename <- paste(names(fy)[[2]], " tone", sep="")
    names(tmpdf) <- c("Ch char", consname, restname, tonename)
    
    # Need to use merge, not cbind(), dues to possibly different
    # ordering of fy, outdf
    outdf <- merge(outdf, tmpdf)
  }
  return(outdf)
}

# Separates romanized pronunciation pronun into initial consonant, if any
# the remaninder of the sound, and the tone, if any
sepsoundtone <- function(pronun) {
  nchr <- nchar(pronun)
  vowels <- c("a", "e", "i", "o", "u")
  
  # How many initial consononants?
  numcons <- 0
  for (i in 1:nchr) {
    ltr <- substr(pronun, i, i)
    if (!ltr %in% vowels) numcons <- numcons + 1 else break
  }
  cons <- if (numcons > 0) substr(pronun, 1, numcons) else NA
  tone <- substr(pronun, nchr, nchr)
  numtones <- 1 - as.integer(tone %in% letters) # TRUE is 1, FALSE is 0
  if (numtones == 0) tone <- NA
  therest <- substr(pronun, numcons+1, nchr - numtones)
  return(c(cons, therest, tone))
}

system.time(canman8 <- merge2fy(can8, man8))

Here is an alternative way to tackle the same problem (seems to produce the same results, though I didn’t check this carefully) using regular expressions. This is an illustration that the Perl motto that “there’s more than one way to do it” applies to R too.


load("chinese.Rdata")

# merges data for 2 fangyans
merge2fy <- function(fy1, fy2) {
  outdf <- merge(fy1, fy2)
  
  # Separate tone from sound, and create new columns
  for (fy in list(fy1, fy2)) {
    
    # Matching on pronunciation requires this step to prevent
    # duplicate matches.
    # (Perhaps save some time by only processing unique pronunciations.)
    pronun <- unique(fy[[2]])
    
    # Regular expression separates romanized pronunciation pronun into initial 
    # consonant, if any; the remainder of the sound, if any; and the tone, if any
    
    # tmpdf will be a matrix, init cons in column 2, remainders in column
    # 3 and tones in column 4
    # Three components to the match:
    #   - String at the beginning that does not contain vowels or digits
    #   - String beginning with a vowel, followed by letters
    #   - String consisting of a single digit
    matches <- regexec("^([^aeiou0-9]*)([aeiou]\\D*)?(\\d)?$", pronun)
    tmpdf <- do.call("rbind", regmatches(pronun, matches))
    
    # convert it to a data frame
    tmpdf <- as.data.frame(tmpdf, stringsAsFactors=FALSE)
    
    # Add names to the columns
    names(tmpdf) <- 
      paste(names(fy)[[2]], c("", " cons", " sound", " tone"), sep="")
    
    # Need to use merge, not cbind(), dues to possibly different
    # ordering of fy, outdf
    outdf <- merge(outdf, tmpdf, all.x=TRUE)
  }
  return(outdf)
}

system.time(canman8 <- merge2fy(can8, man8))
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to Code to get data from “Extended Example” from Chapter 5 of “The Art of R Programming”

  1. brucedai says:

    Thanks, iangow, I was looking for the data for this example. Could you tell me what R text editor are you using? Available for Mac, too? Appreciated!

  2. Kerapi says:

    Hi, is there any other way of getting the files? I’m dying to try this out, but I can’t find the files anywhere, and the links you provided don’t seem to be active anymore, or they’ve been moved.
    If you have them downloaded on your hdd could you pass them on to me?

  3. EG says:

    Actually I found a link with the downloadable data here: http://www.nostarch.com/artofr.htm

  4. palani says:

    unable to get the dataset from the url, Please update.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s