Friday, December 21, 2012

A simple web application using Rook

by Ben Ogorek

Rook

I'm grateful to Rook for helping me, a simple statistician, learn a few fundamentals of web technology. For R web application development, there are increasingly polished methods available (most notably Shiny [1]), but you can build one using Rook, and you might just learn something if you do.

Jeffrey Horner's Rook [2] is both a web server interface and an R package. The idea behind the former is to separate application development from server implementation. Thus, when a web server supports a web server interface, an application written to its specifications is guaranteed to run on that server.

This concept is not unique to R; there is WSGI for Python, Rack for Ruby (Horner's own inspiration [3]), and PSGI for Perl. In a world of modern web development frameworks, they do not appear to be losing steam. PSGI and its associated Perl module Plack are even said to be the “superglue interface between perl web application frameworks and web servers, just like Perl is the duct tape of the internet” [4]. Unlike the PSGI / Plack distinction, the name Rook is used for both the specification and the R package.

While there are nice examples of Rook on the web, I was unable to find a tutorial that guided me from the ground up. This is meant to be that tutorial. In the sections that follow, a Rook web application is built from scratch and explained at each stage. From a web browser, it takes user inputs, performs (simple) R calculations, and displays graphics.

Rook Application Basics

A Rook application is literally an R function that

  • takes an R environment as input,
  • returns a list of HTTP-relevant items as output.

Input. As for R environments, consider .GlobalEnv.

z <- "hello"
x.sq <- function(x) x * 2
ls.str(.GlobalEnv)
## x.sq : function (x)  
## z :  chr "hello"

Given their official purpose, it's easy to forget a simple fact: R environments hold key-value pairs. This makes them perfect for storing HTTP headers and CGI environment variables.

Output. The following is an example Rook application's output.

$status
[1] 200

$headers
$headers$`Content-Length`
[1] 500

$headers$`Content-Type`
[1] "text/html"

$body
[1] "<i>***Times through the Rook.app function: 6 ***</i><form ..."

The Example

Note: this example will not work in RStudio's IDE, and perhaps other IDEs where the default R web server has been modified.

In this section we will build a simple web application using Rook. It won't be much to look at, but working with user inputs and displaying graphics can take you quite a ways.

Initial web page image

First, install Rook from CRAN if you haven't already.

setInternet2(TRUE)  # if you're behind a firewall
install.packages("Rook")

Load it into memory.

library(Rook)

The server

We'll start by creating an R server object from the Rook class Rhttpd.

R.server <- Rhttpd$new()
cat("Type:", typeof(R.server), "Class:", class(R.server))
## Type: S4 Class: Rhttpd

The “httpd” in Rhttpd stands for “Hypertext Transfer Protocol Daemon,” also known as a web server.

First application: file hosting

Our first Rook application will host static .png files that are stored locally. This is necessary because, once we're up and running, all paths must be relative to the server's root directory. Luckily, the Rook function File$new() creates the application we need. To use the function, pass a directory as input and “add” the application to the server.

R.server$add(app = File$new(getwd()), name = "pic_dir")
print(R.server)
## Server stopped
## [1] pic_dir http://127.0.0.1:0/custom/pic_dir
## 
## Call browse() with an index number or name to run an application.

Printing our server object shows us that (1) our server is not currently running, and (2) a single application is loaded. We're about to add another application, but there's no reason why we can't start the server now.

R.server$start()
## 
## Server started on host 127.0.0.1 and port 23103 . App urls are:
## 
##  http://127.0.0.1:23103/custom/pic_dir

Keeping track of state

Below we define a global variable, iter, which will count how many times our next app is run.

iter <- 0

The feature application

This application, whose skeleton appears below, performs a simple calculation or plots the Iris data set, depending on which button is pushed. Take a moment to study the function names and their inputs.

Rook.app <- function(env) {
    iter <<- iter + 1

    request <- Request$new(env)
    response <- Response$new()

    write.initial.HTML(response, iter)
    write.icebreaker.HTML(request, response)
    write.iris.HMTL(request, response, R.server)
    write.final.HTML(response)

    response$finish()
}

Request and Response

The Rook helper classes Request and Response are for communication between the client and server.

Inside Rook.app, only Request takes an environment as input. In fact, Request's very purpose is to work with Rook environments. And other than new, every method of Request is a getter, returning those important values from the environment. The most useful of these are GET and POST (i.e., the HTTP methods) that return user data as an R list.

The Response class is where the action happens. At the end of a Rook application, the finish method returns a list with HTTP information according to the Rook specification. This includes the HTML previously included with the write method of the Response class. We'll see in the next section how this leads to dynamic HTML content.

The HTML content

The function write.initial.HTML is the first that uses the write method to write HTML. Note that the second call is dynamic, since the R variable iter is combined with the HTML string.

write.initial.HTML <- function(response, iter) {

    response$write("<h1>A Simple Web Application</h1>")
    response$write(paste("<i>***Times through the Rook.app function:", as.character(iter), 
        "***</i>"))
    response$write("<form method=\"POST\">")
    response$write("<h3>Icebreaker Survey</h3>")
    response$write("First name: <input type=\"text\" name=\"firstname\"><br><br>")
    response$write("Favorite number: <input type=\"text\" name=\"favnumber\"><br>")
    response$write("<h3>What do you want to do?</h3>")
    response$write("<input type=\"submit\" value=\"Generate Icebreaker from Survey\" name=\"submit_button\">")
    response$write("<button value=\"Plot\" name=\"iris_button\">Plot the Iris data set </button> <br>")

}

Also note the form tags, a common way of collecting user data on the web, and that the method “POST” has been specified. We could have just as easily chosen “GET.”

Performing a calculation

The “icebreaker” functionality encapsulated in write.icebreaker.HTML generates a brief conversational introduction, loosely modeled after my own. This function is the first to take a Request as well as a Response object.

Icebreaker


write.icebreaker.HTML <- function(request, response) {

    if ("firstname" %in% names(request$POST()) & "favnumber" %in% names(request$POST())) {

        response$write("<h3>Your personalized icebreaker</h3>")
        response$write(paste("Hi, my name is ", request$POST()$firstname, ".", 
            sep = ""))

        fav.number = as.numeric(request$POST()$favnumber)

        response$write(paste("<br>My favorite number, plus 1, is ", as.character(fav.number + 
            1), ". <br>", sep = ""))
    }
}

When the user fills in the input fields and clicks the button, a request is made to the server. The user input is contained in the list request$POST(), where it can be accessed in the R session. Displaying the results of any processing can be accomplished with Response$write.

Displaying a plotted image

The last major function within our Rook application, write.iris.HMTL, is the first to take an Rhttpd object in addition to a Request and Response object. This is because our Rhttpd object contains the file hosting application needed to display the iris data set.

write.iris.HMTL <- function(request, response, server) {
    if ("iris_button" %in% names(request$POST())) {
        png(file.path(getwd(), "iris.png"))
        plot(iris)
        dev.off()

        response$write(paste("<img src =", server$full_url("pic_dir"), "/iris.png", 
            ">", sep = ""))
    }
}

The file iris.png is saved to the location specified in File$new (the working directory).

While the image is stored locally, we can use the full_url method of our Rhttpd object to get a file path that makes sense to the server (something like 127.0.0.1:23333/pic_dir/iris.png). If you try to access the file in a browser, you will get a “Forbidden” message, but we can still use it with response$write and an img tag.

Rook plot

And finally, since we can't have the form closing tag in one of the if statements, there is one remaining piece.

write.final.HTML <- function(response) {
    response$write("</form>")
}

As before, add the application to the server object.

# Add your Rook app to the Rhttp object
R.server$add(app = Rook.app, name = "My rook app")
print(R.server)
## Server started on 127.0.0.1:23103
## [1] pic_dir     http://127.0.0.1:23103/custom/pic_dir
## [2] My rook app http://127.0.0.1:23103/custom/My rook app
## 
## Call browse() with an index number or name to run an application.

Using the application

To interact with the application, use the server object's browse method with the name given in the last step.

# view your web app in a browser
R.server$browse("My rook app")

As you interact with the application, pay attention to the iter variable, which counts the total number of times through the Rook.app function. And notice what happens to the HTML input fields every time a button is pressed. HTTP is truly a stateless protocol!

times

Cleaning up

To stop the server, use the stop method of the Rhttpd object.

# view your web app in a browser
R.server$stop()

Sometimes I have to restart R, especially after many changes and browsing attempts.

Conclusion

I've had a great time learning Rook, and I hope that readers of this article have enjoyed the journey as well. As one final point, while we have developed locally, note that Rook applications can also be deployed [5].

Acknowledgements

The markdown [6] and knitr [7] packages, in conjunction with RStudio's IDE [8], were used to create this document. Sincere thanks to Josh Mills for his sound advice and feedback. Keep up with ours and other great articles on R-Bloggers, and follow me on Twitter (@baogorek) for my latest research updates.

References

  1. RStudio Shiny. URL http://www.rstudio.com/shiny/

  2. Jeffrey Horner (2012). Rook: Rook - a web server interface for R. R package version 1.0-8. URL http://CRAN.R-project.org/package=Rook

  3. Introducing Rook, an article from the Jeffrey Horner blog by Jeffrey Horner. Published on April 18, 2011. URL http://jeffreyhorner.tumblr.com/post/4723187316/introducing-rook

  4. PSGI/Plack webpage. URL http://plackperl.org/

  5. Deploy Rook Apps with rApache: Part I, an article from the Jeffrey Horner blog by Jeffrey Horner. Published on July 23, 2012. URL http://jeffreyhorner.tumblr.com/post/27861973339/deploy-rook-apps-with-rapache-part-i

  6. JJ Allaire, Jeffrey Horner, Vicent Marti and Natacha Porte (2012). markdown: Markdown rendering for R. R package version 0.5.3. http://CRAN.R-project.org/package=markdown

  7. Yihui Xie (2012). knitr: A general-purpose package for dynamic report generation in R. R package version 0.6. http://CRAN.R-project.org/package=knitr

  8. RStudio IDE for Windows. URL http://www.rstudio.com/ide/

  9. R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

Wednesday, October 31, 2012

Hierarchical linear models and lmer

Hierarchical linear models and lmer

Article by Ben Ogorek
Graphics by Bob Forrest

Background

Hierarchical Linear Models
My last article [1] featured linear models with random slopes. For estimation and prediction, we used the lmer function from the lme4 package[2].

Today we'll consider another level in the hierarchy, one where slopes and intercepts are themselves linked to a linear predictor. We'll simulate data to build intuition, derive the lmer formula using the linear mixed model $$ y = \mathbf{X \phi} + \mathbf{Z b} + \mathbf{\epsilon}, $$ and recover the system parameters.

Article sections

Examples

A realistic example

The nlme [3] package contains the data set MathAchieve, consisting of students, their socio-economic statuses (SES) and math achievement (MathAch) scores.
library(nlme)
head(MathAchieve, 3)
## Grouped Data: MathAch ~ SES | School
##   School Minority    Sex    SES MathAch MEANSES
## 1   1224       No Female -1.528   5.876  -0.428
## 2   1224       No Female -0.588  19.708  -0.428
## 3   1224       No   Male -0.528  20.349  -0.428

If you inquired about the relationship between SES and MathAch scores, you'd have to acknowledge that the student isn't the only unit. There is also the school. Notice how the variable MEANSES is constant over rows; it is a property of a school, not of a student. If the relationship between SES and MathAch was fundamentally different for schools according to MEANSES, what would that mean?

You can see Jon Fox's analysis of these data [4]. We won't be analyzing them. We'll be making up our own.

A not-so-realistic example

For the remainder of this article, we will simulate data from a hierarchical linear model and then analyze it. This may sound circular, but it can also be informative. First, we'll know if we are recovering the parameters (we chose them). We can quickly ask "what-if" questions, adjusting the sample sizes and parameter values. And finally, creating the expected data for a technique is a tactile way to understand its assumptions.

Our not-so-realistic data will feature $N$ abstract "units" at the top of the hierarchy, analogous to the schools in our realistic example and indexed by $i=1,\ldots, N$. These units will have an attribute (e.g., MEANSES) called $a_i$. Instead of students, we'll call our second level "measurements" with index $j=1,\ldots,J$. Our response and predictor variables (e.g. MathAch and SES) will be called $y_{ij}$ and $x_{ij}$, respectively. As we will have $J$ measurements for every unit, there will be $M = NJ$ rows of simulated data.

Simulating data

The unit-level

We first generate data at the unit level, generating those characteristics from the standard normal distribution. That is, $a_i \sim N(0,1).$
rm(list = ls())
set.seed(2345)

N <- 30
unit.df = data.frame(unit = c(1:N), a = rnorm(N))

head(unit.df, 3)
##   unit        a
## 1    1 -1.19142
## 2    2  0.54930
## 3    3 -0.06241

Every unit will have its own slope, $\alpha_i$, and intercept, $\beta_i$. The key is, these are now linearly related to $a_i$.
unit.df <-  within(unit.df, {
    E.alpha.given.a <-  1 - 0.15 * a
    E.beta.given.a <-  3 + 0.3 * a
})
head(unit.df, 3)
##   unit        a E.beta.given.a E.alpha.given.a
## 1    1 -1.19142          2.643          1.1787
## 2    2  0.54930          3.165          0.9176
## 3    3 -0.06241          2.981          1.0094

Adding noise

In a realistic scenario, a variable like $a_i$ probably won't tell you everything about unit $i$, including its exact slope and intercept. Thus for units $i=1,\ldots,N$, we'll need to simulate random disturbances that displace the slopes and intercepts from their conditional expectations given $a_i$.

Below we generate these random disturbances from the bivariate normal distribution, made easy thanks to the mvtnorm package [5,6].
library(mvtnorm)
q = 0.2
r = 0.9
s = 0.5
cov.matrix <-  matrix(c(q^2, r * q * s, r * q * s, s^2), nrow = 2,
    byrow = TRUE)
random.effects <-  rmvnorm(N, mean = c(0, 0), sigma = cov.matrix)

These disturbances are the random effects. Our first three (out of $N=30$) look like this:
##          [,1]    [,2]
## [1,]  0.19373  0.7516
## [2,]  0.07904  0.1776
## [3,] -0.30462 -0.5849

For the random effects covariance matrix, we used
##      [,1] [,2]
## [1,] 0.04 0.09
## [2,] 0.09 0.25

Based on our choices of $q$ and $s$ above, slopes vary more than intercepts. Also we've specified a correlation of $.9$ (i.e., $.09/ \sqrt{.04 *.25}$), so when the intercept's random effect is above its mean (of zero), the same is likely true for the slope.

Finally, we add the random effects to arrive at the unit-level intercepts ($\alpha_i$s) and slopes ($\beta_i$s).
unit.df$alpha = unit.df$E.alpha.given.a + random.effects[, 1]
unit.df$beta = unit.df$E.beta.given.a + random.effects[, 2]
head(unit.df, 3)
##   unit        a E.beta.given.a E.alpha.given.a  alpha  beta
## 1    1 -1.19142          2.643          1.1787 1.3724 3.394
## 2    2  0.54930          3.165          0.9176 0.9966 3.342
## 3    3 -0.06241          2.981          1.0094 0.7047 2.396

The relationship of $\alpha_i$ and $\beta_i$ to $a_i$ is depicted below.
Linear relationships among coefficients

The within-unit level

We now move to the within-unit level, generating a fixed $x$-grid to be shared among units.
J = 30
M = J * N  #Total number of observations
x.grid = seq(-4, 4, by = 8/J)[0:30]

within.unit.df <-  data.frame(unit = sort(rep(c(1:N), J)), j = rep(c(1:J),
    N), x =rep(x.grid, N))

Next, we generate data from the individual linear models based on $\alpha_i$, $\beta_i$, and within-unit level noise $\epsilon_{ij} \sim N(0, .75^2)$.
flat.df = merge(unit.df, within.unit.df)

flat.df <-  within(flat.df, y <-  alpha + x * beta + 0.75 * rnorm(n = M))

Below is a depiction of the $N=30$ linear systems.
Process

Using lmer

Before moving back to lmer, we boil down our data set (which currently includes fields we wouldn't know in practice) to the essentials.
simple.df <-  flat.df[, c("unit", "a", "x", "y")]
head(simple.df, 3)
##   unit      a      x      y
## 1    1 -1.191 -4.000 -10.95
## 2    1 -1.191 -3.733 -11.68
## 3    1 -1.191 -3.467 -10.43

If we ignore $a_i$, we could still get random lines with the following syntax.
library(lme4)
my.lmer <-  lmer(y ~ x + (1 + x | unit), data = simple.df)
cat("AIC =", AIC(my.lmer))
## AIC = 2226

For the purpose of comparison, we'll keep track of the Akaike information criterion (AIC), a general-purpose criterion for model comparison. Smaller is better, and there is good reason to believe we will find a model with a smaller AIC. For while $\alpha_i$ and $\beta_i$ vary, they vary not about fixed means, but rather about conditional means given $a_i$.

Separating fixed from random

To skip the math that's coming, see the next code block. But knowing what to put in lmer means understanding how to frame your problem in terms of the linear mixed model $$ \mathbf{y} = \mathbf{X \phi} + \mathbf{Z b} + \mathbf{\epsilon}. $$ In it, $\mathbf{X}$ and $\mathbf{Z}$ are $M$-row design matrices, and $\mathbf{y}$ and $\mathbf{\epsilon}$ are $M$ length vectors. In terms of explainable effects on $\mathbf{y}$, $\mathbf{X \phi}$ is the contribution from the fixed effects, and $\mathbf{Z b}$ is the is the contribution from the random effects.

As the relationship of $(\alpha_i, \beta_i)$ to $a_i$ depends on four quantities, so will $\mathbf{\phi}$ be a $4\times 1$ vector. As there are two random effects for every unit, $\mathbf{b}$ is a $2N \times 1$ vector.

The fixed effects vector is called $\mathbf{\phi}$ instead of the usual $\mathbf{\beta}$ because the latter is reserved for $\mathbf{\beta}_i = (\alpha_i, \beta_i)^T$ in the unit-level processes $$ \mathbf{y}_i = \mathbf{X}_i \mathbf{\beta_i} + \mathbf{\epsilon}_i. $$ It turns out that $\mathbf{\beta_i}$ is itself a function of $\phi$ and $\mathbf{b}_i$, written as $$ \mathbf{\beta_i} = \mathbf{A}_i \phi + \mathbf{b}_i, $$ where $$\mathbf{A}_i = \begin{bmatrix} 1 & a_i & 0 & 0 \\ 0 & 0 & 1 & a_i \end{bmatrix}, $$ $\mathbf{\phi} = (\phi_0, \phi_1, \phi_2, \phi_3)$, and $\mathbf{b}_i$ is the $2 \times 1$ random effect for the $i^{th}$ unit. Review your matrix multiplication if necessary, and convince yourself that this indeed what we've been working with.

Distributing $\mathbf{X}_i$ over the terms in the $\mathbf{\beta_i}$ equation leads to $$ \mathbf{y_i} = \mathbf{X}_i \mathbf{A}_i \phi + \mathbf{X}_i \mathbf{b}_i + \mathbf{\epsilon}_i. $$ We complete the linear mixed effects model picture by setting $$ \mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \mathbf{A}_1 \\ \mathbf{X}_2 \mathbf{A}_2 \\ ... \\\mathbf{X}_N \mathbf{A}_N \end{bmatrix} $$ and $$ \mathbf{Z} = \begin{bmatrix} \mathbf{X}_1 & 0 & \ldots & 0\\ 0 & \mathbf{X}_2 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 &\ldots & \mathbf{X}_N \end{bmatrix}. $$

The lmer formula

Textbook problems often go from regression model to design matrix. We go the other way, from design matrix to linear model.

Fixed Effects

The design matrix for the fixed effects, $\mathbf{X}$, is seen above as a vertical stack of $\mathbf{X}_i \mathbf{A}_i$ matrices. If we understand $\mathbf{X}_i \mathbf{A}_i$, we understand $\mathbf{X}$. It is $$ \mathbf{X}_i \mathbf{A}_i = \begin{bmatrix} \mathbf{1}_J & \mathbf{x}_i\end{bmatrix} \begin{bmatrix} 1 & a_i & 0 & 0 \\ 0 & 0 & 1 & a_i \end{bmatrix}, $$ where $\mathbf{1}_J$ is a $J$-vector of ones and $\mathbf{x}_i$ is the within-unit covariate vector $(x_{i1}, x_{i2}, \ldots, x_{iJ})^T$. Multiplying these matrices, we arrive at $$ \mathbf{X}_i \mathbf{A}_i = \begin{bmatrix} 1 & a_i & x_{i1} & x_{i1} a_i\\ 1 & a_i & x_{i2} & x_{i2} a_i \\ \ldots & \dots & \ldots & \ldots \\ 1 & a_i & x_{iJ} & x_{iJ} a_i \end{bmatrix}. $$ These must be stacked vertically, but the structure is clear. This is the design matrix for the linear model $\text{E}(y|x, a) = \phi_0 + \phi_1 a + \phi_2 x + \phi_3 x a$. The fixed effects portion is just a simple linear model with interaction!

The random effects

After accounting for the fixed effects, the random effects are specified as if the coefficients were completely random. This part of the syntax is (1 + x | unit), specifying random effects for both the intercept and slope after adjusting for their conditional expectations given $a_i$.

The lmer code

The lmer formula is a concatenation of the linear model with interaction syntax and the random effects syntax.
my.lmer <-  lmer(y ~ x + a + x * a + (1 + x | unit), data = simple.df)
summary(my.lmer)
## Linear mixed model fit by REML 
## Formula: y ~ x + a + x * a + (1 + x | unit) 
##    Data: simple.df 
##   AIC  BIC logLik deviance REMLdev
##  2206 2244  -1095     2174    2190
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr  
##  unit     (Intercept) 0.0315   0.177          
##           x           0.2842   0.533    0.915 
##  Residual             0.5641   0.751          
## Number of obs: 900, groups: unit, 30
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   0.9776     0.0410   23.85
## x             3.0670     0.0980   31.31
## a            -0.1108     0.0475   -2.34
## x:a           0.3577     0.1134    3.15
## 
## Correlation of Fixed Effects:
##     (Intr) x      a     
## x    0.723              
## a   -0.025 -0.018       
## x:a -0.018 -0.025  0.723
Notice that this model has indeed achieved a lower AIC (2206) by incorporating $a_i$. Furthermore, you can find noisy versions of all our key parameters in the output including
  • the random effects correlation of .9 (.915) and standard deviations of .2 (.117) for the intercept and .5 (.533) for slope,
  • the standard deviation of $\epsilon_{ij}$ of .75 (.751),
  • the regression parameters for $\alpha_i$ and $\beta_i$ versus $a_i$ of 1.0 (.9776), 3.0 (3.0670), -0.15 (-0.1108), and 0.30 (.3577).
You can increase the samples sizes to get closer estimates on average.

Discussion

This article did not discuss lmer's predictions of $\alpha_i$ and $\beta_i$ or their relationship to the individual estimates $\hat{\alpha}_i$ and $\hat{\beta}_i$ from lm. Nor did we investigate what happens one of the distributional assumptions is violated.

To readers who have made it this far, I hope your familiarity with random effects models has increased in general, and that linear mixed modeling tools, such as lmer, are more available to your specific hierarchical applications.

Acknowledgements

All images created by Bob Forrest using ggplot2 [7], Cairo [8] and R functions for base graphics [9]. Bob also helped choose the colors for the R syntax highlighting theme used. Custom syntax highlighting and html rendering made possible with the knitr package [10], and made enjoyable via RStudio's Rhtml editor [11]. Thanks to Josh Mills, Chris Nicholas, and Mike McCaslin for their helpful comments. Keep up with ours and other great articles on R-Bloggers, and follow Ben on Twitter (@baogorek) for the latest research updates.


References

  1. Ben Ogorek (2012). Random regression coefficients using lme4. Anything but R-bitrary. URL http://anythingbutrbitrary.blogspot.com/2012/06/random-regression-coefficients-using.html

  2. Douglas Bates, Martin Maechler and Ben Bolker (2011). lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-42. URL http://CRAN.R-project.org/package=lme4

  3. Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar and the R Development Core Team (2012). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-103.

  4. John Fox (2002). Linear Mixed Models. Appendix to An R and S-PLUS Companion to Applied Regression. URL http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-mixed-models.pdf

  5. Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2012). mvtnorm: Multivariate Normal and t Distributions. R package version 0.9-9992. URL http://CRAN.R-project.org/package=mvtnorm

  6. Alan Genz, Frank Bretz (2009), Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics, Vol. 195., Springer-Verlage, Heidelberg. ISBN 978-3-642-01688-2.

  7. H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

  8. Simon Urbanek and Jeffrey Horner (2011). Cairo: R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output. R package version 1.5-1. URL http://CRAN.R-project.org/package=Cairo

  9. R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

  10. Yihui Xie (2012). knitr: A general-purpose package for dynamic report generation in R. R package version 0.6. http://CRAN.R-project.org/package=knitr

  11. RStudio IDE for Windows. URL http://www.rstudio.com/ide/

Wednesday, August 8, 2012

Manipulating Data Frames Using sqldf - A Brief Overview

By Josh Mills

Introduction

For those who are learning R and who may be well-versed in SQL, the sqldf package provides a mechanism to manipulate R data frames using SQL. Even for experienced R programmers, sqldf can be a useful tool for data manipulation. This site provides a useful introduction to SQL. [SQLCourse.com 2012]

The following packages will be used in this document:
  • sqldf - A package that allows manipulation of R data frames with SQL (as well as connectivity with a limited set of database engines). [Grothendieck 2012]
  • plyr - A useful package for aggregating and summarizing data over multiple subgroups, with more advanced applications. [Wickham 2011]
Load these packages into memory.


Data Sets Used

Highway Data (crashes.csv and roads.csv - Click to Download)

These are fictional data sets containing crash and highway data with the following variables:
Crash data (crashes.csv)
  • Year - The year in which the observation was taken
  • Road - The name of the road being studied
  • N_Crashes - The number of crashes on the road during that year
  • Volume - Average annual daily traffic (AADT) volumes on the road during that year. See this site for a formal definition. [North Carolina Department of Transportation 2012]
Road data (roads.csv)
  • Road - The name of the road being studied
  • District - The administrative district responsible for the road's upkeep and maintenance
  • Length - Length of the road in miles


Joins and Merges with sqldf

Read in and explore the data to get a feel for the data's structure.

setwd("W:/Data Mining and Modeling/Applied Analytics - R Discussion/Related Files")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
head(crashes)
##   Year          Road N_Crashes Volume
## 1 1991 Interstate 65        25  40000
## 2 1992 Interstate 65        37  41000
## 3 1993 Interstate 65        45  45000
## 4 1994 Interstate 65        46  45600
## 5 1995 Interstate 65        46  49000
## 6 1996 Interstate 65        59  51000
tail(crashes)
##     Year           Road N_Crashes Volume
## 105 2007 Interstate 275        32  21900
## 106 2008 Interstate 275        21  21850
## 107 2009 Interstate 275        25  22100
## 108 2010 Interstate 275        24  21500
## 109 2011 Interstate 275        23  20300
## 110 2012 Interstate 275        22  21200
print(roads)
##            Road       District Length
## 1 Interstate 65     Greenfield    262
## 2 Interstate 70      Vincennes    156
## 3         US-36 Crawfordsville    139
## 4         US-40     Greenfield    150
## 5         US-52 Crawfordsville    172

Performing joins is one of the most common operations in SQL. Left joins return all rows in the “left-hand” table - the crash data set in this case, whereas right joins return all rows in the “right-hand” table - the road data set in this case. Inner joins return only rows with matching data for the common variable, and full outer joins return all rows in all data sets, even if there are rows without matches. Currently, sqldf does not support right joins or full outer joins.

It is useful to format SQL statements with spaces and line breaks for readability and to store the query in a character string. The following statement will perform a left join of the crash data set to the road data set based on the common variable Road.

join_string <- "select
                crashes.*
              , roads.District
              , roads.Length
              from crashes
                left join roads
                on crashes.Road = roads.Road"

A new data frame, crashes_join_roads, will be created using the sqldf statement. The sqldf statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors argument will force categorical variables (like Road and District) to have the class character rather than factor.

crashes_join_roads <- sqldf(join_string,stringsAsFactors = FALSE)
## Loading required package: tcltk
head(crashes_join_roads)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65        25  40000 Greenfield    262
## 2 1992 Interstate 65        37  41000 Greenfield    262
## 3 1993 Interstate 65        45  45000 Greenfield    262
## 4 1994 Interstate 65        46  45600 Greenfield    262
## 5 1995 Interstate 65        46  49000 Greenfield    262
## 6 1996 Interstate 65        59  51000 Greenfield    262
tail(crashes_join_roads)
##     Year           Road N_Crashes Volume District Length
## 105 2007 Interstate 275        32  21900     <NA>     NA
## 106 2008 Interstate 275        21  21850     <NA>     NA
## 107 2009 Interstate 275        25  22100     <NA>     NA
## 108 2010 Interstate 275        24  21500     <NA>     NA
## 109 2011 Interstate 275        23  20300     <NA>     NA
## 110 2012 Interstate 275        22  21200     <NA>     NA

By using an inner join, only matching rows will be kept.

join_string2 <- "select
                crashes.*
              , roads.District
              , roads.Length
              from crashes
                inner join roads
                on crashes.Road = roads.Road"
crashes_join_roads2 <- sqldf(join_string2, stringsAsFactors = FALSE)
head(crashes_join_roads2)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65        25  40000 Greenfield    262
## 2 1992 Interstate 65        37  41000 Greenfield    262
## 3 1993 Interstate 65        45  45000 Greenfield    262
## 4 1994 Interstate 65        46  45600 Greenfield    262
## 5 1995 Interstate 65        46  49000 Greenfield    262
## 6 1996 Interstate 65        59  51000 Greenfield    262
tail(crashes_join_roads2)
##    Year  Road N_Crashes Volume       District Length
## 83 2007 US-36        49  24000 Crawfordsville    139
## 84 2008 US-36        52  24500 Crawfordsville    139
## 85 2009 US-36        55  24700 Crawfordsville    139
## 86 2010 US-36        35  23000 Crawfordsville    139
## 87 2011 US-36        33  21000 Crawfordsville    139
## 88 2012 US-36        31  20500 Crawfordsville    139

The merge statement in base R can perform the equivalent of inner and left joins, as well as right and full outer joins, which are unavailable in sqldf.

crashes_merge_roads <- merge(crashes, roads, by = c("Road"))
head(crashes_merge_roads)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000        95  74000 Greenfield    262
## 2 Interstate 65 1997        76  52000 Greenfield    262
## 3 Interstate 65 1998        90  58000 Greenfield    262
## 4 Interstate 65 1999        95  65000 Greenfield    262
## 5 Interstate 65 1991        25  40000 Greenfield    262
## 6 Interstate 65 1992        37  41000 Greenfield    262
tail(crashes_merge_roads)
##     Road Year N_Crashes Volume   District Length
## 83 US-40 2003        94  55200 Greenfield    150
## 84 US-40 2004        25  55300 Greenfield    150
## 85 US-40 2009        67  65000 Greenfield    150
## 86 US-40 2010       102  67000 Greenfield    150
## 87 US-40 2011        87  67500 Greenfield    150
## 88 US-40 2012        32  67500 Greenfield    150
crashes_merge_roads2 <- merge(crashes, roads, by = c("Road"), all.x = TRUE)
head(crashes_merge_roads2)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994        21  21200     <NA>     NA
## 2 Interstate 275 1995        28  23200     <NA>     NA
## 3 Interstate 275 1996        22  20000     <NA>     NA
## 4 Interstate 275 1997        27  18000     <NA>     NA
## 5 Interstate 275 1998        21  19500     <NA>     NA
## 6 Interstate 275 1999        22  21000     <NA>     NA
tail(crashes_merge_roads2)
##      Road Year N_Crashes Volume   District Length
## 105 US-40 2003        94  55200 Greenfield    150
## 106 US-40 2004        25  55300 Greenfield    150
## 107 US-40 2009        67  65000 Greenfield    150
## 108 US-40 2010       102  67000 Greenfield    150
## 109 US-40 2011        87  67500 Greenfield    150
## 110 US-40 2012        32  67500 Greenfield    150
crashes_merge_roads3 <- merge(crashes, roads, by = c("Road"), all.y = TRUE)
head(crashes_merge_roads3)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000        95  74000 Greenfield    262
## 2 Interstate 65 1997        76  52000 Greenfield    262
## 3 Interstate 65 1998        90  58000 Greenfield    262
## 4 Interstate 65 1999        95  65000 Greenfield    262
## 5 Interstate 65 1991        25  40000 Greenfield    262
## 6 Interstate 65 1992        37  41000 Greenfield    262
tail(crashes_merge_roads3)
##     Road Year N_Crashes Volume       District Length
## 84 US-40 2004        25  55300     Greenfield    150
## 85 US-40 2009        67  65000     Greenfield    150
## 86 US-40 2010       102  67000     Greenfield    150
## 87 US-40 2011        87  67500     Greenfield    150
## 88 US-40 2012        32  67500     Greenfield    150
## 89 US-52   NA        NA     NA Crawfordsville    172
crashes_merge_roads4 <- merge(crashes, roads, by = c("Road"), all.x = TRUE, 
    all.y = TRUE)
head(crashes_merge_roads4)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994        21  21200     <NA>     NA
## 2 Interstate 275 1995        28  23200     <NA>     NA
## 3 Interstate 275 1996        22  20000     <NA>     NA
## 4 Interstate 275 1997        27  18000     <NA>     NA
## 5 Interstate 275 1998        21  19500     <NA>     NA
## 6 Interstate 275 1999        22  21000     <NA>     NA
tail(crashes_merge_roads4)
##      Road Year N_Crashes Volume       District Length
## 106 US-40 2004        25  55300     Greenfield    150
## 107 US-40 2009        67  65000     Greenfield    150
## 108 US-40 2010       102  67000     Greenfield    150
## 109 US-40 2011        87  67500     Greenfield    150
## 110 US-40 2012        32  67500     Greenfield    150
## 111 US-52   NA        NA     NA Crawfordsville    172

Note how the order of the rows in the data frames were rearranged when using the merge statement.

The sqldf statement can process SQLite commands, which include most of the standard syntax used in ANSI SQL, except for some of the join operations outlined previously mentioned. [SQLite.org 2012]

Modifying the inner join query to include a where is the equivalent of combining merge and subset statements.

join_string2 <- "select
                crashes.*
              , roads.District
              , roads.Length
                from crashes
                    inner join roads
                    on crashes.Road = roads.Road
                where crashes.Road = 'US-40'"                
crashes_join_roads4 <- sqldf(join_string2,stringsAsFactors = FALSE)
head(crashes_join_roads4)
##   Year  Road N_Crashes Volume   District Length
## 1 1991 US-40        46  21000 Greenfield    150
## 2 1992 US-40       101  21500 Greenfield    150
## 3 1993 US-40        76  23000 Greenfield    150
## 4 1994 US-40        72  21000 Greenfield    150
## 5 1995 US-40        75  24000 Greenfield    150
## 6 1996 US-40       136  23500 Greenfield    150
tail(crashes_join_roads4)
##    Year  Road N_Crashes Volume   District Length
## 17 2007 US-40        45  59500 Greenfield    150
## 18 2008 US-40        23  61000 Greenfield    150
## 19 2009 US-40        67  65000 Greenfield    150
## 20 2010 US-40       102  67000 Greenfield    150
## 21 2011 US-40        87  67500 Greenfield    150
## 22 2012 US-40        32  67500 Greenfield    150


Aggregation Functions and Limitations of sqldf

Aggregate functions available using SQLite can be used through the use of a group by clause.

group_string <- "select
                  crashes.Road
                 , avg(crashes.N_Crashes) as Mean_Crashes
                 from crashes
                    left join roads
                    on crashes.Road = roads.Road
                 group by 1"
sqldf(group_string)
##             Road Mean_Crashes
## 1 Interstate 275        24.95
## 2  Interstate 65       107.82
## 3  Interstate 70        65.18
## 4          US-36        48.00
## 5          US-40        68.68

The available aggregation functions within SQLite or ANSI SQL are limited, however. While sqldf can make certain data manipulation operations easier, more advanced data manipulation tasks and calculations must be performed in R, such as using Hadley Wickham's plyr package.

ddply(crashes_merge_roads,
      c("Road"),
      function(X) data.frame(Mean_Crashes = mean(X$N_Crashes),
                             Q1_Crashes = quantile(X$N_Crashes, 0.25),
                             Q3_Crashes = quantile(X$N_Crashes, 0.75),
                             Median_Crashes = quantile(X$N_Crashes, 0.50))
      )
##            Road Mean_Crashes Q1_Crashes Q3_Crashes Median_Crashes
## 1 Interstate 65       107.82      63.25     140.25          108.5
## 2 Interstate 70        65.18      52.00      75.50           66.5
## 3         US-36        48.00      42.00      57.25           47.0
## 4         US-40        68.68      45.25      90.75           70.0

In short, the sqldf package can make it easy for SQL users to begin making the transition to R. The package provides a convenient mechanism for data manipulation in R using SQL. While there are limitations to the use of SQL within R, the added convenience provides a useful alternative to using standard R functions. In addition, SQL statements used can easily be modified to function in a large-scale database environment such as Teradata or Netezza.

Download this document in R Markdown format.

References
  • SQLCourse.com (2012). SQLCourse.com®: Interactive Online SQL Training. Link
  • G. Grothendieck (2012). sqldf: Perform SQL Selects on R Data Frames. R package version 0.4-6.4. Link
  • H. Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. Link
  • North Carolina Department of Transportation (2012). Training Material for Traffic Engineering Accident Analysis System (TEAAS). Link
  • SQLite.org (2012). SQL As Understood By SQLite. Link
  • R Development Core Team (2012). R: A language and environment for statistical  computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN  3-900051-07-0. Link
This document was generated primarily using R Markdown and knitr.

Friday, July 27, 2012

ggplot2: A little twist on back-to-back bar charts

Sangyoon Lee

Background

While thinking about ways to represent incoming and outgoing flows in a business process, I thought about using export-import charts like the one shown here in the Learning R blog. However, as the author acknowledges, it is difficult to compare individual values using these charts. Regardless, I still wanted to have this graph for an at-a-glance view before breaking it into facets and comparing individual values.

My Solution

In the Learning R article, the author chooses to show multiple categories of import and export using stacked bars. Instead of representing multiple categories, I decided to use the color intensity on the bars’ fill as visual reinforcement of information the graph already contains. Import and export are represented by red and blue, respectively, and the transparency facilitates the visual comparison the reader must make between bars that are not side by side.

In the below example, I use the same subset of data as in the motivating post. Please refer to the linked article for the data used in this example. Make sure to click on the link that says "Access the subset used in this post in here." rather than going to the Eurostat website. Save the file as "trade.csv" in the working directory. These are monthly trade data for the 27 European Union countries by broad economic categories (BEC) in millions of Euros.

First, load the necessary packages.


For convenient and powerful data manipulation, plyr and reshape provide functions like ddply and melt. A relatively new package, scales is required for scale functions to format the numbers to specific scales within ggplot2.

Next, import the data, calculate the trade balance (export - import), and melt the data for ggplot2.

trade <- read.csv("trade.csv", header = TRUE, 
   stringsAsFactors = FALSE)
balance <- ddply(trade, .(Time), summarise, 
   balance = sum(EXP - IMP))
trade.m <- melt(trade, id.vars = c("BEC","Time"))

After the melt step, add another line to aggregate over BEC. This will further simplify the structure.
trade.a <- ddply(trade.m, c("Time", "variable"), summarise,
                 value = sum(value))
At this point, the data will look like this:

> head(trade.a)
     Time variable    value
1 2008M05      EXP 273153.2
2 2008M05      IMP 260789.1
3 2008M06      EXP 284994.7
4 2008M06      IMP 273033.0
5 2008M07      EXP 284681.6
6 2008M07      IMP 271122.2

We step through one layer at a time.

Layer 1: Start with export bars. We will add import data on the bottom of this graph.

ggplot(trade.a, aes(x=Time)) + 
   geom_bar(data = subset(trade.a, variable == "EXP"),
   aes(y=value, fill = value), stat = "identity")

Layer 2: Add the import data and attach it back-to-back to the export data. Label the x-axis and the y-axis accordingly.

last_plot() + geom_bar(data = subset(trade.a, variable == "IMP"), 
  aes(y=-value, fill = -value), stat = 'identity') +
  scale_y_continuous(labels = comma) + xlab("") + 
  ylab("Export  -  Import") + 
  scale_fill_gradient2(low = muted("red"),
  mid = "white", high = muted("blue"), midpoint = 0,space = "rgb")

Layer 3. Now add the balance trend line, remove the meaningless legend, and format the y-axis with commas. 

last_plot() + geom_line(data = balance, aes(Time, balance, group = 1), size = 1) + geom_hline(yintercept = 0,colour = "grey90") + opts(legend.position = "none") 

Layer 4: Finally, change the x-axis to make it easy for viewers to read. The following result is my final product.
labels <- gsub("20([0-9]{2})M([0-9]{2})", "\\2\n\\1",trade.m$Time)
last_plot() + scale_x_discrete(labels = labels)

The resulting plot shows the overall export and import trend, with different color intensities to reinforce the size of each bar. This eases the cognitive burden placed on readers when they visually compare export versus import.

While the overall trend shows that there are more exports than imports, the story might be more complicated when there are subcategories. An example is the United States economy: an aggregated USA import-export chart will show significantly larger import bars than exports bars, but when it is broken into different categories, especially in agricultural goods, the graph will show a different story from the overall trend. 

In the meantime, this graph provides a quick at-a-glance look at exports and imports before digging deeper into various categories for further analysis. 

All highlighted R-code segments were created by Pretty R at inside-R.org. Keep up with ours and other great articles relating to R on R-bloggers.

References