Quiz  Logo
  • Background

  • Diekmann, Krassnig, and Lorenz (1996) conducted a field study to explore whether driver characteristics are related to an aggressive response (Thanks to Jeff Sklar for pointing us to this article). The study was conducted at a busy intersection in Munich, West Germany, on two afternoons (Sunday and Monday) in 1986. The experimenters sat in a Volkswagen Jetta (the “blocking car”) and did not accelerate after the traffic light turned green, and timed how long before the driver of the blocked car reacted (either by honking or flashing headlights). The response time (in seconds) is our variable of interest. Some values were “censored” in that the researcher stopped timing before the driver actually honked. This can happen if there is a time limit to the observation period and “success” has not been observed within that time period. car waiting at stop light

    Research questions: Suppose we wanted to make claims like "what is a typical wait time?" or "how likely is it someone would honk within 2 seconds?"

    Goals: In this investigation, you will

    • Continue to explore numerical and graphical summaries for a quantitative variable, including "modified" boxplots
    • Explore data transformations and non-normal probability distribution models
  • (a) How long do you think you would wait before you honked? (You should each answer.)
    Student 1: seconds
    Student 2: seconds

  • Describing the distribution

  • The data can be found in honking.txt.  Load these data into R/RStudio or JMP. (This is a small enough dataset, you can copy and paste from the webpage, just watch for extra lines at the bottom of the file.)

    R reminder: honking <- read.table("clipboard", header=T)

  • Overlay a normal distribution on your histogram and also create a normal probability plot (normal quantile plot).  

  • A popular numerical summary of the distribution of a quantitative variable is the five-number summary:

    min smallest value in data set
    lower quartile (Q1) value that has 25% of the observations below it 25th percentile
    median (Q2) value that has 50% of the observations below it 50th percentile
    upper quartile (Q3) value that has 75% of observations below it 75th percentile
    max largest value in the data set

    With skewed data, rather than using the standard deviation as a measure of spread, we might prefer the interquartile range or IQR. The interquartile range is the upper quartile (aka Q3) minus the lower quartile (aka Q1) and measures the width of the middle 50% of the distribution.

  •  
  • Definition: Another graph is based on the five-number summary, called a boxplot (invented by John Tukey in 1970). The box extends from the lower quartile to the upper quartile with a vertical line inside the box at the location of the median. Whiskers then typically extend to the min and max values.

    Create a boxplot

    • In R: boxplot(honking$responsetime, ylab="time until reaction", horizontal=TRUE)
      • or iscamboxplot(honking$responsetime)
    • In JMP: In the Distributions output window, use the hot spot to select Outlier Boxplot.

    By default, these are "modified boxplots" which means they denote observations that are "outliers" according to the "1.5IQR criterion."

    Definition: A value is an outlier according to the 1.5IQR criterion if the value is larger than the upper quartile + 1.5 × box length or smaller than the lower quartile – 1.5 × box length. Note: The box length = upper quartile – lower quartile, is the interquartile range. A modified boxplot will display such outliers separately and then extend the whiskers to the most extreme non-outlier observation.
  • Modelling non-normal data

  • These response times are not well modelled by a normal distribution. So can we still make predictions? There are a couple of strategies.  One approach is to try a different mathematical model.

    Overlay an exponential probability model (often used to model wait times) to these data and/or create a probability plot using the exponential distribution as the reference distribution.

    • In R: (for the qqplot we have to first get the quantiles)
      > theoquant = qexp(ppoints(12))  # Generates 1/n quantiles for 12 observations from exponential distribution
      > hist(theoquant)
      > qqplot(honking$responsetime, theoquant)  # The quantiles vs. your data. Look for a line.
      > iscamaddexp(honking$responsetime) #overlay exponential model
    • In JMP: In the Distribution window, use the hot spot to select Continuous Fit > Exponential.
  • Browse Files
    Drag and drop files here
    Choose a file
    Cancelof
  • The exponential distribution is another continuous probability distribution, so we would determine the probability of someone waiting less than 2 seconds before honking by finding the area under the curve to the left of 2.

    Use technology to calculate the probability of a wait time under 2 seconds using the exponential distribution with mean 4.25 sec:

    • In R: > pexp(2, rate = 1/4.25) # P(X < 2) when x follows an Exp(mean = 4.25) distribution.
    • In JMP: Using the Distribution Calculator and select the Exponential distribution. Specify a scale parameter of 1/4.25 = 0.235. Choose X <= Qa and enter 2 for Qa (we don’t need to worry about strict vs. non-strict inequalities here)
  • Repeat by comparing the data with a “lognormal” probability model and using the lognormal probability model to estimate the probability of someone waiting less than 2 seconds:

    • In R:
      > qqplot(honking$responsetime, qlnorm(ppoints(12)))
      > iscamaddlnorm(honking$responsetime)
      > plnorm(2, meanlog = 1.292, sdlog = 0.5238)
    • In JMP: Scroll down the Distribution list and choose the Lognormal distribution with location = 1.292 and scale = 0.5238.
  • Data transformations

  • Another approach would be to consider whether a rescaling or transformation of the data might create a more normal-looking distribution, allowing us to use the very friendly normal distribution to estimate probabilities. In this case, we need a transformation that will downsize the large values more than the small values. Log transformations are often very helpful in this regard.

    Definition: A data transformation applies a mathematical function to each value to re-express the data on an alternative scale. For example, a one-unit increase on the Richter scale conveys the magnitude of an earthquake is 10 times worse (the amplitude of seismic waves is 10 times greater).

    Create a new variable which is log(responsetime). (You can use either natural log or log base 10, but so we all do the same thing, let’s use natural log here, which is the default in most software when you say “log.”)
    • In R: > lnresponsetime = log(honking$responsetime)
    • In JMP: Create a new column (e.g., double click on next column over) and then open the formula editor for that column (e.g., Cols > Formula). Type or use your mouse to select Transcendental > Log to create Log(responsetime). Press OK.

    Create a histogram of this new variable and a normal probability plot. This time, I do want you to upload your output.

  • Browse Files
    Drag and drop files here
    Choose a file
    Cancelof
  • Use a normal distribution with mean 1.29 ln-sec and standard deviation 0.53 ln-seconds for the logged response times and estimate how often someone will honk within the first 2 seconds (e.g., distribution calculator, pnorm or iscamnormprob, Normal probability calculator applet). (Hint: What are you going to use as the "event" of interest?)

     

  • Summary

  • There are of course, many other probability models we could look into. One limitation of the exponential distribution is that it assumes the same value for the mean and the standard deviation, clearly not the case for these data. There are other more flexible distributions (e.g., Gamma and Weibull) that use two parameters to characterize the distribution rather than only one.

  • Should be Empty: