• Background

    Researchers are often interested in analyzing different aspects of text. For example, inaugural speeches of presidents have been analyzed to compare word frequency, sentence length, style, and tone. Statistical analysis has been used to decide authorship of The Federalist Papers and to decide whether other playwrites were responsible for plays attributed to Shakespeare.

    Suppose we didn't want to examine every word in a passage in other to gather data on its characteristics, but instead only had time to look at a subset of the words. On the next page, you will be shown a passage of 268 words. Your task will be to select "10 representative words" from this passage for further analysis. Press Next to begin.

    Alternative: Word sampling app

     

  • Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal.

    Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war.

    We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

    But, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember, what we say here, but it can never forget what they did here. 

    It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us, that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion, that we here highly resolve that these dead shall not have died in vain, that this nation, under God, shall have a new birth of freedom, and that government of the people, by the people, for the people, shall not perish from the earth.

  • Definitions: The collection of all 268 words in the Gettysburg Address is our population of interest, and the ten words you have selected is the sample.  We want the sample to be representative of the population, that is have the same characteristics. 

  • To determine whether the sampling method we asked you to use here tends to produce representative samples of the population, we can take many samples from the population using the same method and then plot the sample means of word length from these many samples. Below are some results we found in a previous class:

    Dotplot of average length of words in different samples

  • Although we don't expect the sample means to always equal the population mean, we will consider a sampling method to be unbiased if the distribution of the statistic (the sample means here) is centered around the population mean.  The population mean of all 268 words in the Gettysburg Address is 4.29 letters.  Because most of the samples produced an average that was larger than 4.29, this sampling method is biased towards longer words.

  • For the case of the Gettysburg Address, we do have access to the entire population.  The lengths of words have been entered in the applet below.  As indicated earlier, the mean of this population of all 268 words is 4.29 letters.

  • Let's use the computer to take a random sample of n = 5 words from this population.

    • Check the Show Sampling Options box.
    • Keep the Number of Samples set to 1.
    • Change the Sample Size from 10 to 5.
    • Press the Draw Samples button.
    • Press the Draw Samples button nine more times for a total of 10 samples.

    The 10 sample means will appear in the Sampled Statistics graph on the right. The most recent sample and its mean appear in blue.

  • Browse Files
    Drag and drop files here
    Choose a file
    Cancelof
  • We expect sample-to-sample variation in these sample means because each sample will containt different words.  But if we take many samples, we may start to see a pattern in the distribution of sample means.

    • In the applet, change the Number of Samples from 1 to 100.
    • Press Draw Samples.

    You should start to see a pattern to the distribution of sample means... 

    • Press Draw Samples 9 more times, for a total of about 1,000 sample means/the distribution begins to settle down to a consistent pattern.
  • Definitions:

    • A number used to summarize a variable in the population is called a parameter.  Parameters are often denoted by greek letters like "mu" m
    • A number used to summarize a variable in a sample is called a statistic. We can use "x-bar"  to refer to a sample mean.

    Key Idea: You have shown that selecting simple random samples (every word is equally likely to be selected) from the population is an unbiased sampling method because the distribution of the statistic is centered around the variable of the parameter.

    Notice that selecting simple random samples of size n = 5 was unbiased unlike our original haphazard sampling method even though that one used a larger sample size n  = 10. 

  • Return to the applet. This time, generate 1,000 samples of size n = 10.
    Number of samples
    Sample size

  • Distribution of Sample Means

    The sample mean is often a good choice of statistic because in repeated random samples from the same population we can expect,

    • The mean of the distribution of sample means to be around the population mean
    • The standard deviation of the sample means to be around (population SD/sqrt(n))
      • For example, when the sample size was 5, you should have found a standard deviation similar to 2.119/sqrt(5) = 0.95 and then for a sample size of 10, this should have decreased to roughly 2.119/sqrt(10) = 0.67.

    sampling distribution for samples of size 5, mean of 100 samples near 4.3, SD of 100 samples near 0.95sampling distribution for samples of size 10, mean near 4.3, standard deviation near 0.67

    This gives us a good amount of predictability for the value of a sample mean when we have selected a simple random sample.

  • Let's explore one more characteristic of the population and how that impacts the distribution of sample means.

    On the left side of the applet, Set the Change population size button to x100. This will put 100 copies of the Gettysburg Address into the population.  You should see the population size change, but not the Mean or SD values.

  • Please submit a conjecture before continuing

  • Summary

    In the future, when you have a sample from a population and don't have access to the entire population, you should have faith that the sample is representative of the population if every member of the population was equally likely to be selected.  There are other techniques as well that are also expected to be unbiased. In particular, there are other probability sampling methods.  In general, convenience sampling methods and using human judgement to select individuals are likely prone to sampling bias. 

    Once you believe the sample is representative, probability rules will also allow us to estimate the amount of sample to sample variation.  For example, the distribution of sample means should have a standard deviation similar to the population standard deviation divided by the square root of the sample size: population standard deviation divided by the square root of n.  There are two important properties of this formula:

    • The sample-to-sample variation decreases with the sample size
    • The sample-to-sample variation does not depend on the size of the population.

    Animation:

  • Should be Empty: