Quiz

Name*

First NameLast Name
Name 2

First NameLast Name
E-mail*
example@example.com
Section 1 (9-10am)Section 2(10-11am)
Background

Diekmann, Krassnig, and Lorenz (1996) conducted a field study to explore whether driver characteristics are related to an aggressive response (Thanks to Jeff Sklar for pointing us to this article). The study was conducted at a busy intersection in Munich, West Germany, on two afternoons (Sunday and Monday) in 1986. The experimenters sat in a Volkswagen Jetta (the “blocking car”) and did not accelerate after the traffic light turned green, and timed how long before the driver of the blocked car reacted (either by honking or flashing headlights). The response time (in seconds) is our variable of interest. Some values were “censored” in that the researcher stopped timing before the driver actually honked. This can happen if there is a time limit to the observation period and “success” has not been observed within that time period.

Research questions: Suppose we wanted to make claims like "what is a typical wait time?" or "how likely is it someone would honk within 2 seconds?"

Goals: In this investigation, you will

Continue to explore numerical and graphical summaries for a quantitative variable, including "modified" boxplots
Explore data transformations and non-normal probability distribution models

(a) How long do you think you would wait before you honked? (You should each answer.)
Student 1: number seconds
Student 2: number seconds

Describing the distribution
The data can be found in honking.txt. Load these data into R/RStudio or JMP. (This is a small enough dataset, you can copy and paste from the webpage, just watch for extra lines at the bottom of the file.)

R reminder: honking <- read.table("clipboard", header=T)
(b) Create a histogram of the data. Summarize the behavior of the distribution. (Hints: You aren't uploading your figure, so describe it in enough detail that I have a pretty good idea of what you are looking at. Also remember to put your comments into context!)
Overlay a normal distribution on your histogram and also create a normal probability plot (normal quantile plot).
(c) Do these data behave like a normal distribution? If not, how do they deviate from normality? Does the shape make sense in this context? Explain your reasoning.

A popular numerical summary of the distribution of a quantitative variable is the five-number summary:

min	smallest value in data set
lower quartile (Q1)	value that has 25% of the observations below it	25th percentile
median (Q2)	value that has 50% of the observations below it	50th percentile
upper quartile (Q3)	value that has 75% of observations below it	75th percentile
max	largest value in the data set

With skewed data, rather than using the standard deviation as a measure of spread, we might prefer the interquartile range or IQR. The interquartile range is the upper quartile (aka Q3) minus the lower quartile (aka Q1) and measures the width of the middle 50% of the distribution.

(d) Find the following descriptive statistics. Because the measurement units (seconds) are the same for each value, better tables will put that information in the column header.

Rows	Waiting times (seconds)
Mean
Standard deviation
Median
Interquartile range (IQR)

Definition: Another graph is based on the five-number summary, called a boxplot (invented by John Tukey in 1970). The box extends from the lower quartile to the upper quartile with a vertical line inside the box at the location of the median. Whiskers then typically extend to the min and max values.

Create a boxplot

In R: boxplot(honking$responsetime, ylab="time until reaction", horizontal=TRUE)
- or iscamboxplot(honking$responsetime)
In JMP: In the Distributions output window, use the hot spot to select Outlier Boxplot.

By default, these are "modified boxplots" which means they denote observations that are "outliers" according to the "1.5IQR criterion."

Definition: A value is an outlier according to the 1.5IQR criterion if the value is larger than the upper quartile + 1.5 × box length or smaller than the lower quartile – 1.5 × box length. Note: The box length = upper quartile – lower quartile, is the interquartile range. A modified boxplot will display such outliers separately and then extend the whiskers to the most extreme non-outlier observation.

(f) Did the boxplot display any outliers? Low outliers or High outliers?

Modelling non-normal data
These response times are not well modelled by a normal distribution. So can we still make predictions? There are a couple of strategies. One approach is to try a different mathematical model.

Overlay an exponential probability model (often used to model wait times) to these data and/or create a probability plot using the exponential distribution as the reference distribution.
- In R: (for the qqplot we have to first get the quantiles)
  > theoquant = qexp(ppoints(12)) # Generates 1/n quantiles for 12 observations from exponential distribution
  > hist(theoquant)
  > qqplot(honking$responsetime, theoquant) # The quantiles vs. your data. Look for a line.
  > iscamaddexp(honking$responsetime) #overlay exponential model
- In JMP: In the Distribution window, use the hot spot to select Continuous Fit > Exponential.
Include a copy of your probability plot.

Cancelof
(g) Describe the behavior of the exponential probability model. Does it appear to be a reasonable fit for these data? Describe any deviations.
The exponential distribution is another continuous probability distribution, so we would determine the probability of someone waiting less than 2 seconds before honking by finding the area under the curve to the left of 2.

Use technology to calculate the probability of a wait time under 2 seconds using the exponential distribution with mean 4.25 sec:
- In R: > pexp(2, rate = 1/4.25) # P(X < 2) when x follows an Exp(mean = 4.25) distribution.
- In JMP: Using the Distribution Calculator and select the Exponential distribution. Specify a scale parameter of 1/4.25 = 0.235. Choose X <= Qa and enter 2 for Qa (we don’t need to worry about strict vs. non-strict inequalities here)
(h) What is the estimated probability?
Repeat by comparing the data with a “lognormal” probability model and using the lognormal probability model to estimate the probability of someone waiting less than 2 seconds:
- In R:
  > qqplot(honking$responsetime, qlnorm(ppoints(12)))
  > iscamaddlnorm(honking$responsetime)
  > plnorm(2, meanlog = 1.292, sdlog = 0.5238)
- In JMP: Scroll down the Distribution list and choose the Lognormal distribution with location = 1.292 and scale = 0.5238.
(i) What is the estimated probability?

Data transformations

Another approach would be to consider whether a rescaling or transformation of the data might create a more normal-looking distribution, allowing us to use the very friendly normal distribution to estimate probabilities. In this case, we need a transformation that will downsize the large values more than the small values. Log transformations are often very helpful in this regard.

Definition: A data transformation applies a mathematical function to each value to re-express the data on an alternative scale. For example, a one-unit increase on the Richter scale conveys the magnitude of an earthquake is 10 times worse (the amplitude of seismic waves is 10 times greater).

Create a new variable which is log(responsetime). (You can use either natural log or log base 10, but so we all do the same thing, let’s use natural log here, which is the default in most software when you say “log.”)
• In R: > lnresponsetime = log(honking$responsetime)
• In JMP: Create a new column (e.g., double click on next column over) and then open the formula editor for that column (e.g., Cols > Formula). Type or use your mouse to select Transcendental > Log to create Log(responsetime). Press OK.

Create a histogram of this new variable and a normal probability plot. This time, I do want you to upload your output.

Upload the histogram and the normal probability plot.

Cancelof
(j) Does log(responsetime) approximately follow a normal distribution? What are the mean and standard deviation of this distribution?
Use a normal distribution with mean 1.29 ln-sec and standard deviation 0.53 ln-seconds for the logged response times and estimate how often someone will honk within the first 2 seconds (e.g., distribution calculator, pnorm or iscamnormprob, Normal probability calculator applet). (Hint: What are you going to use as the "event" of interest?)
(k) Report your probability. (Hint: It should be in the ballpark of the other values and in fact, should even match one of them...)
(l) Which model (the exponential, the log normal, or the normal with log(response)) would you consider most valid for these data? Explain how you are deciding.

Summary
There are of course, many other probability models we could look into. One limitation of the exponential distribution is that it assumes the same value for the mean and the standard deviation, clearly not the case for these data. There are other more flexible distributions (e.g., Gamma and Weibull) that use two parameters to characterize the distribution rather than only one.
Should be Empty:

Background

Describing the distribution

Modelling non-normal data

Data transformations

Summary