What are Statistics? |  ProbabilityMethods of Collecting, Representing, & Displaying Data | Data Calculations

Statistics involves the gathering and analysis of numerical data. Statistics are used to predict probability, which is the chance, or the odds, of an event happening.

What are Statistics?

Statistics involves the gathering and assessment of data from a particular source. Most often this involves numerical data, such as the population of a country, state or city, business profits and losses, inventories and the like, sports information such as batting averages, earned run averages, and so on, or the results of surveys and polls. 

Probability
CA GR5 SDAP 1.3

Statistics are used to predict probability. Probability is the likelihood of an event happening, or the odds that an event will occur. Examples are the prediction of election results, future performance of the stock market, the winners of sports events, and so on. 

Example 1

You are playing a game with a friend who likes to play tricks on you. You notice that in the game, which involves rolling a six-sided die, he seems to be winning way too often. You take the die and gather some data about what results it gives when tested [this is gathering statistics] to make a decision about whether it is fair and should be expected to have a one-sixth chance for each result [this is predicting probability]. Roll the die 300 times and record the results. 

If it was fair, you should get each possible result about one-sixth of 300, or 50, times. In your experiment, it turns out that the result of five on the die shows up nearly 150 times while the other five results only show up around thirty times each. You predict that the die will roll a five 50% of the time (calculated from 150/300) and will roll each other result 10% of the time (from 30/300). 

Your friend is definitely playing a trick on you by using an unfair die!

Pre-Test
Discovery logo
Post-Test

Methods of Collecting, Representing, & Displaying Data
CA GR5 SDAP 1.3

Usually, the statistics being collected are a little more complicated that those in the example above, but even in that simple example, you have to decide how to record the results of rolling the die 300 times. You certainly wouldn't try to remember how often each result came up! The data is usually kept track of in tables, charts or graphs. 

In the first round of data collection, the information is recorded in a "raw" fashion - as each die is rolled, a tally mark is made to show whether the result was a one, two, three, four, five, or six. For example, after the first 20 rolls of the die, the table might read this way:

Results  
1
2
3
4
5
6

After all 300 rolls are completed and the 300 tally marks made, the totals for each result are added up and a data table is created. The following table shows one possible result of testing your friend's unfair six-sided die. 

Results Frequency
1 28
2 29
3 32
4 33
5 147
6 31

This kind of table is called a frequency table because is shows how often each type of possible result happened. The problem with just plain frequencies is that it is hard to judge the likelihood, or probability, of results. Does the one occurring 28 times indicate that one is very likely, unlikely, or normally likely? You can't tell until you know that there are a total of 300 rolls; then you can make a fraction of 28 and 300 and change it to a decimal, giving the portion of ones rolled during your data collection. 28/300=.09333 or 9.333%. If you change each entry this way, by dividing by the total number of entries, the table is called a probability table. For the data above, it would look like this (rounded to the nearest tenth of a percent):

Results Probability
1 9.3%
2 9.7%
3 10.7%
4 11.0%
5 49.0%
6 10.3%

Sometimes, tables of data don't make your point strongly enough, and it is necessary to create a more obvious way to show the results. Clearly, in this example, the result of five is much more common than any other. However, imagine trying to show this result to a much younger relative who can understand your reasoning but hasn't yet learned how to read. A graphical approach is much more useful.

This kind of chart is called a bar chart, for the obvious reason that the results are shown by the heights of the bars. Another obvious type of chart is the pie chart, where a whole circle, the pie, is sliced into wedges that are thick or thin to indicated how the results came out. This same data comes out as this pie chart:

The dominance of the five seems even more obvious in this pie chart than it was in the bar chart, and it is certainly true that the nearness to one-half is more visually obvious. 

There are multiple ways to show the results of data collection. You want to pick the way that best supports the argument you are making or opinion that you are supporting. For example, if you want to convince someone that your friend's die is unfair, then the pie chart is probably the most vivid and understandable evidence.

Example 2

Let us imagine that we have communicated with a dozen friends from different places and we have all measure the temperature, in degrees Fahrenheit, at Noon, in the shade, on September 1st. We'll identify the dozen places using the first dozen letters of the alphabet (A through L). Suppose the data collected is 54, 62, 69, 77, 83, 89, 78, 62, 44, 39, 32, and 40. There are many ways to display this data. Since the data comes from 12 different sources, and they don't have a particular order associated with them, we simply draw a bar chart with a bar for each location. Because they are listed alphabetically, it's easy to check whether a given location is on the chart, and if it is, what the temperature was. The height of the bar simply indicates the data for that location. This is how it will look:

Another way to organize the same information would be to order the locations by temperature instead of by name. Which way you chose to show the data would depend on your audience - there's no "right" way, just "more appropriate" ways, according to your purpose. The new bar chart would look like this, making it easier to identify cold, hot, and moderate cities:

Yet another way to organize the information is into a table as numbers, instead of into a chart as pictures. One special way to do that is called the "stem-and-leaf" plot. That's a way to organize scores by the tens digit. Write the tens digit along the "stem" like this:

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  

The "leaf" for each part of the stem is composed of the ones digit for each score in that range. So go though the data and enter the digits: 54 puts a 4 to the right of 5, 62 puts a 2 to the right of 6, 69 puts a 9 to the right of the 6-stem, after the 2-leaf you already put there, and so on... 

0  
1  
2  
3 9 2
4 4 0
5 4
6 2 9 2
7 7 8
8 3 9
9  
10  

Usually, the chart is done with the "leaf" entries in ascending order, and you can discard most of the unused stem sections, so fix it up like this:

2  
3 2 9
4 0 4
5 4
6 2 2 9
7 7 8
8 3 9
9  

The advantage of grouping data into a stem-and-leaf table is that you can get a good feel for the top and bottom scores (scores ranged from thirties to eighties), and you can literally see the most common range for the scores (a score in the sixties was most common). 

Another nice thing about a stem-and-leaf chart is that it can be quickly converted into a frequency table. On a stem-and-leaf chart, you can quickly count the number of occurrences in each 10-point interval. If you changed the chart to list the intervals on the left hand side, and the number of occurrences on the right hand side, you'd have a frequency table. I'll show you the previous example, side-by-side with the frequency table:

Stem-and-Leaf Chart

2  
3 29
4 0 4
5 4
6 2 2 9
7 7 8
8 3 9
9  

Frequency Table

Interval

Frequency
19.5 - 29.5 0
29.5 - 39,5 2
39.5 - 49.5 2
49.5 - 59.5 1
59.6 = 69.5 3
69.5 - 79.5 2
79.5 - 89.5 2
89.5 - 99.5 0

Notice that the Frequency Table is less accurate because it has now grouped the data - for example, you can no longer learn the exact scores in the sixties, but you do know how many fell in that range. 

It would be perfectly natural to wonder why all the numbers end in .5, because there is no mention of decimal places in the original data. Going one extra decimal place avoids vagueness! If you hadn't used any decimal places, your first two intervals would have to be 20-30 and 30-40. Where would you put a score of 30? Would you put it in the lower interval? The higher? Split it half and half? Randomly decide which? In order to avoid having to answer these questions, simply make your interval endpoints have one more decimal place of accuracy than the data you will be grouping into the intervals. 

One more way of representing the data graphically is called a histogram, which is just a special kind of bar chart where the horizontal axis consists of a series of intervals that make up a smooth range of values. You need to decide where to start the range - and you choose 19.5 because that is your first interval, and you also need to decide your step size - and you choose 10, which is the width of each of your intervals. Now you create a bar chart where the bars are drawn above each interval, 19.5 to 29.5, 29.5 to 39.5, and so on, until you get to the last of the data. In a histogram, the bars are drawn pressed up against each other, and the endpoints of the intervals are labeled instead of each bar. For the data on temperature we've been working on, the result looks like this:

You could pick different intervals and do a different histogram for the same data, so you'll always have to be thinking about the purpose and audience for the graphic. For example, you could do 15 point intervals, but still start at the same place, so the intervals would be 29.5 to 44.5, 44.5 to 59.5, 59.5 to 74.5, and 74.5 to 89.5. You don't need any more intervals because the range of the data is from 32 to 89; so four intervals covering 29.5 to 89.5 is enough. Then you need to look at the raw data to figure out the frequency for each of these four new intervals. There are lots of other possible interval widths and starting points that you could use. 

Example 3

Let us imagine that we have gone outside our homes on the first of each month starting in March and measured the temperature in degrees Fahrenheit. When we collect data from the real world, we try for consistency, so we measure it in the same place at the same time of day. Suppose the data collected is 54, 62, 69, 77, 83, 89, 78, 62, 44, 39, 32, and 40. Since the data is representing a measurement that evolves over time, a time series graph would be most appropriate. That's a graph where the horizontal axis is time, and the vertical axis is your measurement. Usually the measurements are indicated with dots and connected with line segments. Because the dots are connected, the graph seems to want to flow into the future and let you predict future temperatures. Along with weather, financial records are often presented this way, to help forecast how things will be the near future so you can decide what to do. For the temperature data collected here, the future flow is predictable:

 

The data line will cycle back up and back down as you cycle through the seasons again and again. Go look up some financial records and their time-series graphs and see if you can predict the future. That's how people try to make money on the stock market!

Next Page: Data Calculations