Friday, July 20, 2012

Modeling Permanent and Gradual Process Changes with CDFs

Spencer Herath
Special thanks to Ben Ogorek

Background
I recently faced a process with a structural change resulting in an increase in the process mean.  The jump to the new mean was not immediate; rather, there was a gradual increase in values over time.  I had previously benefited from multi-staged process-behavior charts when encountering immediate process shifts, but now I needed a way to account for this gradual increase. This paper describes a technique for developing a smooth curve to model permanent and gradual process changes.



















First Attempt: A Process-Behavior Chart/Linear Model
My first attempt at modeling this behavior involved staging into three phases: a “before” phase, a “change” phase, and an “after” phase. I ended the “before” phase as soon as the gradual increase became visually evident.  During the “change” phase, I calculated a line of best fit using simple linear regression.

It’s fairly easy for the human eye to determine when the data points have stopped increasing and settled around the new mean. However, I was working with hundreds of these charts and needed a way to automate the process.  Thus I developed an ad-hoc technique based on trend line slope and R-Squared. The technique was iterative; new data points were sequentially added and the stages were inferred by the rise and fall of the R-Squared statistic.

The staged Process-Behavior Chart/Linear model worked fairly well in practice, but it left some things to be desired.  First, there is a lack of cohesion, as three separate charts are merged into one.  Second, if you do not have the time to visually determine when the stages begin and end, you must rely on some algorithm to automate this decision process for you.  From my experience, it is difficult to find an airtight algorithm for this decision process, as they often produce undesirable stage cutoffs.

But perhaps most compelling is that a straight line is not a particularly appealing model of gradual change.  Both anthropologists and marketing professionals have found that a population’s adoption of innovations or trends typically follows the shape of an S-curve (Henrich, 2001).  In most cases, the initial growth is small; then it gradually increases until it eventually levels off.  It seems intuitive that a gradual shift in a process would likely follow this same pattern.  We will produce a gradual change model that is similar in shape to the class S-curve by exploiting the properties of a cumulative distribution function.

Properties of a CDF
A cumulative distribution function (CDF) describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to x (DeGroot, 2001).  It has been referred to as the “area so far” function of the probability distribution function (Gentle, 2009).

Examples of some CDF curves of the normal distribution
http://en.wikipedia.org/wiki/File:Normal_Distribution_CDF.svg

















Every CDF is monotone non-decreasing, which means that the function never decreases as it moves to the right.  Notice too that every CDF curve approaches 0 as it moves to the left and to 1 as it moves to the right.

For convenience, we will only deal with CDF of the normal distribution.  The CDF of a normal distribution relies on two parameters:  the mean and standard deviation of the distribution.  Though the curve never reaches either 0 or 1, notice that it approaches those limits very rapidly for small standard deviations.  Eventually, the CDF approaches a point on the x-axis where the curve is virtually horizontal.  In our application, we can take for granted that the curve eventually reaches 0 on the left and 1 on the right, even though it technically never does.

Our application of the CDF actually has little to do with probability distributions.  We will manipulate the normal CDF so that it no longer approaches 0 and 1, but rather any two numbers we desire.  In our case, these two numbers will be the old mean and the new mean of the process.  Notice that the CDF is similar in shape to the S-curve, which is often used in business for consumer adoption models (Bass, 1969).

Generating Data in R
Start with a clean R session.  For our data generation, we will produce a process with 61 discreet time intervals (0 to 60).

rm(list = ls())
set.seed(4321)

t <- seq(0, 60)

Now, let us randomly generate the mean before the process change as well as the mean after the process change.  The mean delta is simply the difference between the two.

mean.before <- rnorm(1, 20, 10)
mean.after <- mean.before + 10 + .3*rnorm(1) + .6*rnorm(1)
mean.delta <- mean.after - mean.before 

Now, we will generate mu and sigma—the two parameters that determine the shape of a normally distributed CDF.  Then we can build a CDF out of our sequence t using the pnorm function in R.

mu <- rnorm(1, 22, 5)
sigma <- rnorm(1, 4, .5)

cdf.1 <- pnorm(t, mean=mu, sd=sigma)

The next step is to scale the CDF so that the asymptotes are no longer 0 and 1, but rather the before mean and the after mean.  This can be accomplished quite easily using the formula below.  Then, all that is left is to add some random noise to the data.

scaled.up.cdf <- mean.before + mean.delta*cdf.1

y <- scaled.up.cdf + rnorm(length(t))

We now have 61 randomly generated data points that can be fitted using a CDF curve with the asymptotes adjusted to the before and after means.  Let’s plot the data:

plot(y ~ t, ylab="Data", main="Random Data", pch=20)


















Note: To avoid confusion, we will only use the word “mean” to describe the process means of our data.  We will use mu and sigma to describe the mean and standard deviation of the normal CDF.

Developing the Scaled CDF Model
Now that we have our randomly generated data, we can dive into an example that revolves around some process.  In the before stage, the process hovers around some mean.  Then the process begins to change. The values increase, and eventually settle around the new process mean. 

The good news is, for this technique, we do not need to know when the process started or stopped changing as we did with the Process-Behavior Chart/Linear Model.  Let’s think for a moment about the end goal: finding a scaled CDF curve that fits our data.  This curve will rely on four parameters:
·  Process means
1.  Mean before (a)
2.  Mean after (b)
·  CDF parameters
3.  Mu
4.  Sigma

We will eventually use R’s Nonlinear Least Squares function nls to determine the values for these four parameters that best fit our data, but this function requires that we first input approximate starting values for the parameters.  Thus our first step is to approximate these parameters.

Approximating the Parameters
Approximating our first two parameters, a and b, is fairly straightforward. We simply take the mean of the first few data points to be an approximation of a (the mean before) and the last few data points to be an approximation of b (the mean after).  The method of approximation may be changed to suit the given application, and in many cases, a simple plot will suffice.  It is important that we come up with decent estimates for the two means, however, as we rely on these two guesses in order to scale down our data.  Poor mean estimates will lead to poor scaling.

a.guess <- mean(y[1:5])
b.guess <- mean(y[(length(y)-4):length(y)])

Now that we have our approximate means, we must now scale our data down so that it is more representative of an actual CDF curve.  Recall that a true CDF curve must have a range from 0 to 1.  Currently we have data that roughly resembles the shape of a CDF, except that the range is from a to b.  To scale our data so that the mean before is close to 0 and the mean after is close to 1, we use the opposite form of the scaling function we used to generate the random data.

y.scaled <- (y-a.guess)/(b.guess-a.guess)

If the newly rescaled data does not have before and after means of roughly 0 and 1, then we have a problem with our estimates for a and b.  The scaled data looks like this:

plot(y.scaled ~ t, ylab="y-scaled", main="Scaled Data", pch=20)


















Now that our data resembles a CDF, the next step is to find the parameters mu and sigma of the CDF curve that best fit this scaled data.  This is a two-step process.  First, we use the loess function to generate a smooth curve that fits the data well, but does not over-fit (we want a fairly smooth curve).  Once we have this curve, we approximate a mu and sigma using theoretical CDF probabilities. 

Recall back to your introductory statistics course. For a normal CDF curve F, F(mu) =0.5.  Similarly, since about 68.2% of normally distributed data falls within one standard deviation from the mean, we can conclude that F(sigma)=.841.  To determine these points on the x-axis, we use the uniroot function in R.  To save computing time, I have set the error tolerance parameter fairly high, since we are only interested in rough estimates right now.

loess.1 <- loess(y.scaled ~ t, span=.2, degree=1)

mu.guess <- uniroot(function(x) predict(loess.1, newdata=x) - .5,
                    c(0, 60), tol=.01)$root
sigma.guess <- uniroot(function(x) predict(loess.1, newdata=x)
                          - .841, c(0, 60), tol=.01)$root
                          - mu.guess

Now let’s plot in order to get a visual of what is going on so far.

plot(y.scaled ~ t, ylab="y-scaled", main="", pch=20,
     ylim =c(-.2, 1.2))
par(new=TRUE)
plot(loess.1$fitted ~ t, ylab="", main="Curve Generated by loess",
     type="l",ylim =c(-.2, 1.2))
abline(h=.5, col="red", lty=2)
abline(h=.841, col="green", lty=2)
segments(0,0,21,0, lty=2)
segments(35,1,61,1, lty=2)


















Visually, the red line at y=0.5 shows that mu is close to 26 and the green line at y=0.841 shows that sigma is close to 4 (30 – 26).

With rough approximations of all four parameters, we can now afford to use nls.  The function inside our nls function is the CDF scaling function:

y ~ a + (b-a)*pnorm(t, mean=mu, sd=sigma)

Remember, this function scales a CDF on the range 0 to 1 up to the range a to b.  To find the best fit parameter values for our nonlinear function, nls does the rest:

nls.1 <- nls(y ~ a + (b-a)*pnorm(t, mean=mu, sd=sigma),
 start = list(a=a.guess, b=b.guess, mu=mu.guess,
              sigma=sigma.guess))

a <- coef(nls.1)["a"]
b <- coef(nls.1)["b"]
mu <- coef(nls.1)["mu"]
sigma <- coef(nls.1)["sigma"]

We now know the parameters that create the curve of best fit based on our nonlinear function.

parameter.summary <- data.frame(Parameters=c(a, b, mu, sigma))

> parameter.summary

     Parameters
a     15.734545
b     26.487421
mu    26.389278
sigma  3.751462

To create our final mean function before, during, and after the process change, simply insert the parameter values into the formula:

cdf.fit <- a + (b-a)*pnorm(t, mu, sigma)

To assess the fit of our mean model, we can plot it against our original data.

y.axis.min <- min(y)
y.axis.max <- max(y)

plot(cdf.fit ~ t, ylim=c(y.axis.min,y.axis.max), ylab="y", type="l")
par(new=TRUE)
plot(y ~ t, ylim=c(y.axis.min,y.axis.max), main="Scaled CDF Model",
     ylab="", pch=20)


















The curve fits the data quite well.

Discussion
The primary advantage of this model is its representation of gradual change using four parameters that are easily estimated using the helpful functions in base R. Now let us stop to think about what each of these parameters tells us specifically about the data.
·  a is the mean of the process before the growth stage
·  b is the mean of the process after the growth stage
·  mu occurs at the midpoint of the growth stage, so it gives us an idea of growth stage timing
·  sigma is gives us an idea of growth stage duration
Thus, we have four parameters that determine the shape of the curve, each of which tells us something about the process and its transition.

When we are dealing with many similar processes, we may benefit from regressing the four curve parameters on unit-level characteristic variables. This could increase knowledge about the process change itself as well as assist in prediction of the functional form when there are impending process changes.


Keep up with ours and other great articles relating to R on R-bloggers.

References
  • Base, Frank M (1969).  A New Product Growth for Model Consumer Durables.  Management Science, Vol. 15, No. 5, Theory Series, 215-227. 
  • DeGroot, Morris H. and Schervish, Mark J (2002).  Probability and Statistics. 3rd Edition.  Addison-Wesley.  Boston, Massachusetts, USA.
  • Gentle, J.E. (2009).  Elements of Computational Statistics. Springer. New York, New York, USA.
  • Henrich, Joseph (2001).  Cultural Transmission and the Diffusion of Innovations.  American Anthropologist.  Vol. 103, No. 4, 992-1013.
  • R Development Core Team (2012). R: A language and environment for statistical  computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN  3-900051-07-0, URL http://www.R-project.org/.

6 comments:

  1. Very impressive, thank you for sharing. I enjoyed the use of CDF curves to help the model deal with gradual change.

    ReplyDelete
  2. How about using a classical 4-parameter logistic sigmoid:

    nls(y ~ c + (d - c)/(1 + exp(b * (t - e))), start = list(c = 16, d = 26, b = 1, e = 30))

    Nonlinear regression model
    model: y ~ c + (d - c)/(1 + exp(b * (t - e)))
    data: parent.frame()
    c d b e
    26.4993 15.7079 0.4535 26.3603
    residual sum-of-squares: 38.84

    Number of iterations to convergence: 9
    Achieved convergence tolerance: 2.16e-06

    ReplyDelete
    Replies
    1. I believe this is equivalent to replacing the Normal CDF with the Logistic CDF in the above context.

      Delete
  3. Very clever!

    Did you try the Exponentially Weighted Moving Average (EWMA) chart to monitor this gradual increase? If yes, how's the performance of your model compared to the EWMA Chart?

    ReplyDelete
  4. Script runs fine until running :
    nls.1 <- nls(y ~ a + (b-a)*pnorm(t, mean=mu, sd=sigma),
    start = list(a=a.guess, b=b.guess, mu=mu.guess,sigma=sigma.guess))
    This produces:

    Error in numericDeriv(form[[3L]], names(ind), env) :
    Missing value or an infinity produced when evaluating the model
    In addition: Warning message:
    In pnorm(t, mean = mu, sd = sigma) : NaNs produced

    Any suggestion on how to fix this.

    ReplyDelete