Basic Statistics for dentists
Really, really basic stats. Enough to let you interpret data and perhaps to think about how to present your own data. These notes are intended only as a guide. Understanding statistics is your responsibility.
Why do dentists need stats at all? Because data without statistics are meaningless. Consider the word meaningless: MEANING LESS; devoid of meaning. To say that drug x caused a 50% increase in survival rate of condition y sounds fantastic and is perfect for a tabloid headline but, without any statistical indication of how good the data are, the statement is worthless. The bottom line of the recent MMR "scandal" and the link between immunisation and autism is a perfect example of a poorly designed study that yielded meaningless results that nevertheless caused great alarm when the data were reported in the press in the complete absence of objective statistical analysis. Nobody with any biological or statistical understanding would have been in the least impressed or worried by the original paper. Tragically, this argument still rumbles on and the impact of the masses of misinformation from newspapers and elsewhere on vaccination programmes have effectively reintroduced measles into this country.
Note to future selves.
What follows are maths-free definitions of all the key statistical terms that I think you might need, plus a few others that you need to know about in order to make sense of the rest. Writing in this way is always a compromise between giving too much information (and so losing the message) and missing things out or simplifying to the point of nonsense. Please let me know if you think I have missed anything or generated too much nonsense. Links are provided to relevant articles in Wikipedia and elsewhere for anyone who wants to know more.
Definitions: Data, Probability and the Null Hypothesis
- Population: Everything. All possible items from which samples may be drawn. If it were possible to work with populations then we wouldn't need statistics. You can't (ever!) work with the entire population of anything so instead you take a sample of the population and use statistics to prove beyond reasonable doubt whether the sample does or does not represent a particular population.
- Sample: Part of a population about which you have made observations. For example, If you were interested in the distribution of eye colour in the UK you couldn't possibly check the eye colour of every individual but you could look at sample of (for example) a few hundred and argue (statistically) that this sample was truly representative of the population as a whole. Similarly, if you wanted to know how tall people in the UK were then you could measure the heights of a sample of the population and so on. Making sure that you choose a representitive and fair sample of the population can be the most important part of any study. Any bias in sample selection will invalidate the entire study, even more entertaining and completely invalid are studies where the samples are chosen after the data have been collected.
- Probability: Chance in a random universe. Usually given as a fraction or a percentage but may be given as odds. A 0.5 probability is a 50% chance or odds of 1:2 that an event will occur. If you take part in a Grand National sweepstake and draw a horse with a starting price of 20:1 you know that you have little chance of winning because bookies only offer 20:1 odds on a horse that they think has no more than 1 chance in 20 of winning. Scientists and statisticians also think odds of 1:20 (0.05, 5%) are a bad bet and so use these odds to determine whether or not to reject the Null Hypothesis. Experimental scientists are generally sceptical about data (except their own). A 1 in 20 chance is a bad bet, but one that does come off from time to time (about 1 time in 20) and so we are much more impressed by odds are 1:100 (0.01, 1%) or even 1:1000 (0.001, 0.1%) and may even refer to such data as being extremely statistically significant.
- Data types
- Continuous: data that you measure e.g. height, weight, intracellular calcium concentration etc.
- Discrete or discontinuous: data that you count. e.g. eye colour, petals on a flower etc.
The distinction between data types, and how you deal with them, is obvious at the extremes, for example you wouldn't try to average eye colours (what is the average of blue and brown), but it is possible to work statistically with such qualitative data simply by counting how often each type occurs (How many people have blue eyes). Some discrete data types, e.g. exam scores, come as numbers. One way or another your sample data must end up as numbers so that you can then manipulate them using statistical tools.
- Simple Statistical tools
- Mean: (arithmetic mean, average): Add up all the items in your sample and divide by the number of items. The symbol for population mean is μ, the symbol for sample mean is
. You don't need to know about geometric or harmonic means....(unless you do an elective that involves compound rates of change).
- Mode: The most commonly occuring value in a discrete data set. For example, the mode of (1,2,3,3,3,4) is 3. Any data set may contain more than one mode, for example the modes of (1,2,2,3,3,4) are 2 and 3. Mode is a useful concept in discrete data sets, but not necessarily in continuous data sets where every observation may be unique. It is possible to determine a mode for a continuous data set by grouping the observations (turn them into a discrete data set) for example the continuous data set (0.35, 0.34, 1.15, 1.55, 1.81, 2.11, 6.11) could be transformed into (0.5, 0.5, 1.5, 1.5, 1.5, 2.5, 6.5) which has a mode of 1.5. The question you must ask is not "is it possible to calculate the mode of this data set" but rather "is there any point in doing so".
- Median: The middle value of a data set arranged in numerical order. In an even numbered data set, take the mean of the two middle values.
- Quartiles: The "median" of the two halves of the data set divided by the median. Rank the data then find the median then find the middle value of the lower half of the data (lower quartile) then find the median of the upper half of the data (upper quartile). The median is the same as the middle quartile.
- Null Hypothesis: (By-and-large) The hypothesis that two observations are not different. If you were testing a drug or treatment your Null Hypothesis would be that the drug or treatment had no effect on the condition. In statistical parlance, the simplest Null Hypothesis is that two samples belong to the same population. Wherever there is a Null Hypothesis, there is also an Alternative Hypothesis.... which usually boils down to whether the two samples belong to different populations. Statistical testing (see below) gives a probability that two (or more) samples belong to the same population. If your statistical test shows that the probability that two samples belong to the same population is 0.05 (5%) or less, then you get all excited because you can say that there is a significant difference between the samples i.e that the chances that they belong to the same population is less than 1/20 and you may therefore reject the Null Hypothesis. A type I error is a false positive.... e.g. the 1 time in 20 that you shouldn't have rejected the Null Hypothesis. A type II error is the other way around, a false negative.
The Normal Distribution (and friends)
Normal Distribution: (Gaussian) If you plot the frequency with which an observation occurs against the value of the observation and get a bell-shaped curve that centres on the mean of the sample then this is (probably) a normal distribution. Many, but not all, observations of biological (including clinical) phenomena are normally distributed. You may choose to believe that the reason that these data are normally distributed is a) magic or b) something to do with The Central Limit Theorem. Either explanation is equally useful in the context of these notes.
Standard Deviation: The most commonly used measure of the statistical dispersion of data. If standard deviation is small then the data are clustered closely around the mean. If standard deviation is large then the data are widely spread. Samples with a wide dispersal (big standard deviation) are less likely to belong to different populations than samples with little dispersal. For normally distributed data, 68.26% of observations will occur within 1 standard deviation either side of the mean, 95.46% will occur within 2 standard deviations of the mean and 99.73% within 3. Given that 95% of all normally-distributed data fall within 2 standard deviations of the mean.... if you make an observation that lies more than 2 standard deviations from the mean then you can be 95% certain that it does not belong to this sample or population. The symbol for population standard deviation is σ. The symbol for sample standard deviation is s or SD. (Technical note: Standard Deviation is the square root of variance (v). Variance is the average of the squared difference between each datum and the mean. Now you know)
Student's t distribution: Almost the same as the normal distribution. There are two interesting things about the t distribution. The useful one is that it is a better fit for small samples than is the normal distribution. (Why? More magic). At sample sizes >30 t and normal are the same. The other interesting thing about the t distribution is that it was invented by William Seally Gosset whilst working at the Guinness Brewery in Dublin. Before his work, statistical tests were designed by and used by biometricians who had hundreds of observations and no inclination to design tools specially to work with small samples. In experimental biology (e.g. brewing, cellular physiology etc), you only ever get small samples. Thank you Mr. Gosset. The t-distribution is of course the basis for the t-test.
Presenting Data
Data should be presented in such a way as to make clear the observation and the degree of statistical certainty of the observation. there are various ways in which this may be achieved. Mean +/- SEM (n=number of observations) is one common way. Odds Ratios and confidence intervals are another. The important thing is to use the most appropriate technique for the data.
- Confidence Interval: Usually, the 95% confidence interval.... The interval between two numbers in which you are 95% confident that the mean (or other appropriate statistic) lies. Data may be writen 0.33 (0.13 - 0.82) i.e. mean (mean-CI - mean+CI). This makes data very easy to interpret.... Are 0.33 (0.13 - 0.82) and 1.68 (0.87 - 3.27) likely to be the means of samples from the same population? Answer no. The 95% confidence intervals of the two means do not overlap, therefore we can be at least 95% certain that these two means are significantly different. Confidence Intervals are calculated using the t or normal distribution as appropriate. Confidence Intervals are a natural accompaniment to Odds Ratios.... As may be seen in the paper about dental treatment and infectious endocarditis.
- Odds Ratio: is the ratio of the probability of an event happening in one group compared to the probability of it occuring in another. Odds ratios may used to assess the efectiveness of medical treatment. Suppose you have a wonder drug that stops people developing a particular sort of cancer. Find two samples of people with an equal chance of contracting this cancer. Treat one group with the drug and the other with a placebo. Observe the fraction that develop disease in each group (the observed chance or odds of contracting the disease) and then divide the odds of the test group by that of the control (placebo) group. An Odds Ratio of 1 indicates that the treatment had no effect. An Odds Ratio (significantly) less than 1 indicates that the treatment reduced the chance of contracting the disease (hurrah). An Odds Ratio greater than 1 indicates that the treatment made things worse. Odds Ratios are often used to assess risk factors, for example those relating to dental treatment and infectious endocarditis.
- Standard Error (of the mean): (SE, SEM) can be a bit confusing, but it is much beloved by experimental scientists because it is easy to calculate and it is a value that you can make sense of on a graph. What it really represents, the standard deviation of the sample mean, is harder to get your head round. If you sample a population, you can calculate a sample mean from your observations. If you did this lots of times you would obtain lots of sample means.... If you then made a frequency plot of the sample means, it would be normally distributed.... with a Standard Deviation estimated by sample Standard Error of the mean. In other words, The Standard Error of the Mean is an estimate of how well the sample mean reflects the population mean. Mean +/- SEM (n) is commonly used to present data from laboratory studies (See the results section of this paper on Sjogren's syndrome).
It doesn't really matter whether you understand this or not, you do need to know that Standard Error is an appropriate statistic to use when reporting experimental results in biology and medicine. Whenever you look a a chart showing SEM, if the error bars overlap then there is NO significantly difference between them. If there is a gap between the error bars of the sample means equal to the combined size of the error bars then they are likely to be significantly different
Statistical Tests
- t-test: The t-statistic may be calculated from one or more samples of continuous data taken from a normally distributed population. The t-statistic and the number of observations may be used together to calculate the probability that a sample belongs to a given population or the probability that two samples belong to the same population. In other words, the t-test tells you the chance that your calculated means are the same. If they are not, you may reject the Null Hypothesis.
t-tests come in a variety of flavours
- 2-sample t-test: The bog-standard t-test. You have two independent sets of observations obtained under different conditions. The t-test will tell you the probability that the means of the observations are the same. Either way, you should be able to draw conclusions about the effects of the "different conditions".
- 2-sample t-test with paired data: As above, except that one set of data is dependent on the other. In other words, the data come in pairs. For example, data before and after treatment.
- 1-sample t-test: Compares a sample mean to the population mean. Offers an alternative way of analysing before and after treatment data. Express the "after" data as a % of the "before" data and then apply a 1-sample t-test to a population mean of 100%. (Technical note: The 1-sample t-test is appropriate when the population standard deviation is unknown. In the unlikely event that the population standard deviation is known then the z-test is more appropriate.)
I have deliberately included an example where it is not clear which test should be applied to the data. This is part of the fun. It may be very difficult to know which is the best statistical analysis for any given set of data. The golden rule is "keep it simple". A definitive answer to the question "which test should I use" usually requires input from a statistician. To be absolutely sure may require input from several statisticians. On the other hand, if you have good data sets that show clear effects (or clearly show a lack of effect) any "appropriate" statistical test will give essentially the same answer (Please don't tell any statisticians that I said this). Technical note: The t-test (and most other statistical tests) only work with samples with the same Standard Deviation (equivalent degree of spread). Therefore, you should test your data (perhaps using an F-test) first to see whether it is appropriate to perform a t-test.
- (Pearson's) χ2 (Chi squared) test: The t-test only works for data that are normally distributed. The χ2 test is the simplest statistical test that is not based on the normal distribution. χ2 will work on normally distributed data, the test makes no assumptions about the frequency distribution of the data. This branch of statistics is called non-parametric statistics. χ2 is most commonly used to determine how well data fit a particular model. For example, you can make a prediction about the frequency with which any one number on a 6-sided dice should occur (1/6). This is your model. You could then test the model by rolling the dice and checking your observations using χ2. Applying the χ2 test return a χ2-statistic, analagous to the t-statistic. You can use the χ2-statistic to calculate a probability value which you can then use to accept or reject the Null Hypothesis.
- Mann-Whitney test: The closest non-parametric alternative to a 2-sample t-test. May be used for data not normally distributed. It is no better than the t-test in dealing with 2 samples with very different Standard Deviations. Applying the Mann-Whitney test return a U-statistic, analagous to the t-statistic. You can use the U-statistic to calculate a probability value which you can then use to accept or reject the Null Hypothesis.
How to perform statistical calculations and tests
If you are really interested in what underlies this brief run through elementary statistics then follow any of the links into Wikipedia and go from there. The people writing these articles know far more about statistical processes than I do. So far as calculating Standard Deviation etc. from data..... look closely at your calculator.... the σ and σ-1 buttons will calculate Standard Deviation based on normal or t distributions respectively. Better yet, use MS Excel (the spreadsheet...
). Enter data into (for example) column A and then use the basic statistical tools. To calculate the mean of data contained in rows 1-20 enter "=average(A1:A20)" into any empty cell. Similarly "=stdev(A1:A20)" returns standard deviation and "=count(A1:A20)" returns the number of observations..... Always start a formula in Excel with "=". Excel will calculate (some) probabilities for you if you ask it nicely. A simpler alternative are the on-line statistical testing packages. The one offered by GraphPad is particularly good. If all you want is a simple t-test then try my home-grown software. I've compiled the most commonly used formulae in a simple spreadsheet.
Laboratory based experimental biologists (and physicists) are suspicious of complex statistics because a) we don't understand them and b), to paraphrase Ernest Rutherford, "If your experiment needs (complex) statistics, then you ought to have done a better experiment". Non laboratory based disciplines (most social sciences, epidemiologists etc.) depend more heavily on statistical analyses because they can't easily perform experiments.
Learning support documents, such as this one, can only get better with feedback from users. Please give feedback. Positive or negative.