A correlation simply means the data falls in a pattern in this case, higher SAT scores are often associated with better performance in coll...
A correlation simply means the data falls in a pattern in this case, higher SAT scores are often associated with better performance in college. Correlations do not tell us causation, and they cannot predict individual behavior. If that correlation is true, you can still most certainly have someone with a perfect score drop out of college, or somebody with a very low SAT score become valedictorian. The correlation just tells us that a pattern exists.
Correlation Coefficients
Correlations can be tricky, though, precisely because they don’t indicate causation. Correlations come from graphing data from two variables on a coordinate plane, then calculating what is known as the Pearson correlation coefficient (usually represented by the letter r), or how closely related the two variables are. The correlation coefficient can tell us that the two variables are strongly correlated, that they are strongly negatively correlated, or that there is no correlation at all (and everything in between).
Correlation coefficients, or r-values, are represented as numbers ranging from negative 1 to positive 1. Think of it as a spectrum. An r-value of –1 means there is a very strong (actually perfect) negative correlation, zero means there is no correlation at all, and positive one means there is a perfect positive correlation.
A strong correlation is considered anything between 0.7 and 1. Less than 0.7 means there is some correlation, but the closer the r-value gets to zero, the weaker the correlation.
Perfect correlations, either positive or negative, mean that the two variables are directly linked: Either one rises exactly in step with the other, or one declines exactly as the other rises. It’s kind of difficult to produce examples of perfect correlation in everyday life because most phenomena have some variation, even if slight. The number of miles you drive and the amount of gas you use would have a near-perfect positive correlation, since the amount of gas used increases steadily as the number of miles driven increases. This might not be a perfect correlation, though, because several factors influence how much gas you are using at any given moment. (Are you coasting on the highway? Idling at a light? Using the air conditioning?)
A strong negative correlation exists between, for example, the average temperature in winter and the amount of energy used to heat a home. As the temperature increases (assuming it’s still cold enough for houses to require heating), the energy used to heat homes decreases. Again this isn’t a perfect correlation, because other factors might affect how much energy a given home uses.
What about correlations that are only somewhat strong? Let’s look at feet for an example. Have you noticed that your tall friends mostly wear a bigger shoe size than you do? Or maybe you are the tall friend, and you feel like you have clown feet when you stand next to your friends. Most of us would probably guess that shoe size and height are correlated, but we can look at the data to prove it. Here’s a graph from StatCrunch showing height on the y-axis (vertical) and shoe size on the x-axis (horizontal).
This type of graph is called a scatterplot. Each point on the graph represents one data point, or one person with a shoe size of x and a height of y. You can see that there are points all over the graph, but that they follow an upward pattern as your eye moves to the right. Based on this pattern, we can surmise that there is a positive correlation between height and shoe size. It turns out that, when a linear regression model is run on a calculator or computer, the r-value is .6222.[lix] Remember that the closer the number is to zero, the less strong the correlation, and an r-value of one means there is a perfect correlation. This correlation coefficient tells us that height and shoe size do have a positive correlation, but that it’s not perfect—it’s not even considered statistically “strong” since it is below 0.7. In real-world terms, this means most people who are tall have big feet, but there are plenty of outliers. You could be a shorter person with unusually large feet for your size, or a tall person who happens to wear the smallest shoes among your friends. Either of these would be acceptable, and even expected, with a correlation coefficient of .6222.
A classic example of a strong (but not perfect) negative correlation is car price and age of the car. As cars get older, their cost goes down; a new car will usually cost you substantially more than a ten-year-old one. Just as a positive correlation means that the two variables increase together, a negative correlation means that as one variable (age) increases, the other (cost) decreases. The strength of this correlation varies by car brand, though. Some cars are known to “hold their value” better than others, meaning their value declines less rapidly. Once again, the correlation can tell us what the trend is or what we can expect to be the case, but it allows plenty of room for outliers.
The British website Auto Express gives us two examples of graphs showing car depreciation. The first graph (Car B) is for a typical car that loses much of its value within a year or two of being purchased:
The second graph shows a more linear progression for a car that holds on to its value better, as certain brands are known to do.
Both of these graphs show a negative correlation between the variables: As the years since purchase increase, the value of the car decreases. Car A shows an almost linear relationship, with the value decreasing pretty steadily over time, while Car B’s value drops almost instantly.
Finally, let’s look at two variables that have no correlation. Imagine if the two variables we looked at were height and number of pets owned. We would most likely find no correlation there. The graph might look something like this, with each point again representing a single person’s data:
You can tell from looking that there is no correlation between those variables, which makes sense. Height has nothing to do with how many pets a person owns.
Misleading Correlations
Interestingly, we sometimes find a correlation where there is none. Imagine in the example above that my sample the people I queried happened to show that greater height was associated with more pets. We would have to think carefully about that correlation and whether it made sense. Was the sample set appropriate? Was the data accurate? Could I repeat the study and get comparable results? Answering these questions would reveal that the correlation was most likely just a fluke. Maybe my sample size was too small, and I happened to get a handful of tall people who have a lot of pets and a bunch of short people with no pets. This inappropriate sampling could suggest a correlation that wouldn’t exist if we sampled a larger population.
Incorrect correlations can be fun, but they can also be misleading. Let’s look at some fun ones first. There’s an entire website, written by Tyler Vigen, devoted to supposed correlations that have nothing to do with each other. He examines, for instance, the correlation between yogurt consumption and Google searches of “i cant even”:
As we can see in the graph, those two variables look pretty well correlated—they would likely have a high r-value if he calculated it. He even used artificial intelligence (AI) to come up with an explanation for why this supposed correlation exists: “It’s simple. As yogurt consumption rose, so did our tolerance for the sour and curdled aspects of life. It’s as if the active cultures in the yogurt fermented a newfound ability to handle all the whey-ward frustrations. So next time you’re feeling moody, just grab a spoon and dairy yourself to a better mood. Remember, when life gives you lemons, make fro-yo!”
Vigen has even linked each of his graphs to an AI-generated “research” paper. He describes his process on the linked website Spurious Scholar:
- Step 1: Gather a bunch of data.
- Step 2: Dredge that data to find random correlations between variables.
- Step 3: Calculate the correlation coefficient, confidence interval, and p-value to see if the connection is statistically significant.
- Step 4: If it is, have a large language model draft a research paper.
- Step 5: Remind everyone that these papers are AI-generated and are not real. Seriously, just pick one and read the lit review section.
- Step 6: . . . publish.
The note after Step 1 claims that he has 25,156 variables in his database. After Step 2, he describes data dredging: “‘Dredging data’ means taking one variable and correlating it against every other variable just to see what sticks. It’s a dangerous way to go about analysis, because any sufficiently large dataset will yield strong correlations completely at random.”
Data dredging has another term: p-hacking. P-hacking refers to a study’s p-value, which is the probability that one could get the same results that the study did by chance. It is used to indicate how statistically significant a study is. The lower the p-value, the greater the statistical significance of a result. P-hacking means dredging a data set until you find something that is statistically significant, whether that thing makes any sense or not. Tyler Vigen’s “spurious correlations” prove the point that p-hacking is both possible, given enough data, and dangerous, as the results can be incredibly misleading.
Correlation Does Not Equal Causation
It is also possible to have a correlation be true but not causal. In other words, the two variables are indeed linked, but the cause is a third variable that wasn’t measured. For example, according to the website Scribbr, ice cream sales and violent crime rates are closely correlated. One might draw incorrect conclusions based on that fact: Maybe eating ice cream leads people to commit crimes, or maybe criminals like to eat ice cream after committing a crime. Both of these seem highly unlikely, though. The correlation exists because a third variable heat affects both ice cream sales and violent crimes. When temperatures increase, both of these also increase. So although ice cream sales and violent crimes have a correlation, they don’t have a causal relationship.
Finally, another problem we encounter when we look at correlated variables is lack of understanding about directionality. For example, researchers have known for many years that depression and vitamin D levels are negatively correlated. In other words, people with low vitamin D levels are often depressed. But researchers are still unclear about causality. As a 2020 mega-review in the Indian Journal of Psychological Medicine put it: “Overall findings were that there is a relationship between vitamin D and depression, though the directionality of this association remains unclear.”
Determining causality may seem like a minor point, but it’s critical in deciding on a course of action. Doctors may realize depression and vitamin D are linked, but it’s unclear if increasing serum vitamin D levels will ease depression symptoms. Should doctors tell their patients to take vitamin D? Or should they focus on other approaches to treating depression and other potential causes of vitamin D deficiency? Causality matters on both an individual and population level. More research is necessary to determine which direction the causality goes and thus what treatments are warranted.
Program Evaluation
If you’re a gen-xer, you probably remember Joe Camel. Joe Camel appeared in Camel cigarette ads from 1988 through 1997. He was supposed to be cool with his cigarette and masculine outfits, often alongside the tagline “smooth character.” His job was to entice people to smoke, thereby boosting sales of Camel cigarettes. Joe came under fire in the 1990s, with at least one study showing that the character was as recognizable to six-year-olds as the Disney Channel logo was. In 1997, after years of court battles, R.J. Reynolds Tobacco Company voluntarily retired Joe.
The pressure to ban Joe Camel was part of a national panic about teen smoking. Studies showed that teen smoking declined in the 1970s and 1980s but began to rise in the 1990s. Data from high school seniors showed that, in 1990, 19.4 percent of them were “current smokers” (defined as having smoked in the last thirty days). By 1997, that rate had risen to 24.5 percent. Studies have shown that most adult smokers began smoking when they were teens; very few adults pick up smoking as a new habit. All sorts of programs emerged in the 1990s and 2000s to try to reduce or prevent teen smoking, as that was seen as the key to lowering smoking rates overall. If you remember assemblies in school, ad campaigns, or public service announcements about the dangers of smoking, you were the target of one of these programs.
Billions of dollars are spent each year on large-scale programs like the ones to prevent teen smoking. But are these dollars being put to good use? Knowing whether or not these programs are effective is critically important. Nonprofits and governmental agencies do not want to waste billions on initiatives that aren’t making a difference. This leads us to another important use for statistical analysis: program evaluation.
Program evaluation, broadly speaking, is the process of figuring out if a program has done what it was created to do. An evaluation of teen smoking prevention programs would tell us whether or not fewer teens smoked, meaning if those billions of dollars spent were worthwhile. A program evaluation might also tell us if certain parts of a program are effective (certain ad campaigns, for example) and if other parts need to be tweaked or discontinued.
A good program evaluation is a thorough, systematic process that uses data to make a determination. According to a guide published by the US Department of Education:
A well-thought-out evaluation can identify barriers to program effectiveness, as well as catalysts for program successes. Program evaluation begins with outlining the framework for the program, determining questions about program milestones and goals, identifying what data address the questions, and choosing the appropriate analytical method to address the questions. By the end, an evaluation should provide easy-to-understand findings, as well as recommendations or possible actions.
Let’s stick with the campaign to reduce teen smoking for now. Program evaluations happened at many points in the process and looked at many different aspects of the campaign. A summative evaluation might tell us how effective the overall campaign was to reduce teen smoking, but other tools were used along the way to tweak the campaign, changing tactics and emphasizing new strategies.
To evaluate the campaign’s effectiveness, study designers had to first identify the outcomes they wanted (fewer teens initiating smoking, for example). They had to figure out how they were going to collect the data, and also what kind of data it was going to be. Quantitative data is numbers: Are there actually fewer teens who smoke? Qualitative or categorical data measures things numbers alone can’t measurewhich ad campaign teens remember seeing, for example. Both are useful, but study designers need to be clear on which type of data will give them the information they want. Program evaluators might also decide to use randomized controlled trials (RCTs) to evaluate a program’s effectiveness. You’ve probably heard of RCTs most with new drugs or medical treatments, as researchers try to determine how effective they are.
According to a meta-analysis published by the National Institute of Health, the teen smoking campaign involved, among other approaches, three major school initiatives: an “information deficit” model in which school-age children were taught about the effects and risks of tobacco, an “affective education” model that emphasized self-esteem and developing values, and a “social influence resistance model” that taught teens how to resist social influences, which included ad campaigns and peer pressure.
Think back to your years in school. Did you have a health or drug-education class that taught you about the effects of tobacco and other drugs? Do you remember attempts to build your self-esteem and influence your health outcomes, like wellness classes or meetings with school counselors? If so, these were likely part of that educational approach. Several studies published in the 1990s and early 2000s showed that, of the three approaches, the social influence resistance model was the most effective. In other words, teaching teens to identify and resist peer and societal pressure had the largest impact (in terms of educational programs) on preventing smoking.
Other aspects of the teen smoking campaign included laws targeted at sales of tobacco to teens, penalties for breaking these laws, advertising restrictions, counter-marketing campaigns (ad campaigns about the dangers of smoking, for example), and other community-based interventions. Of these approaches, preventing sales to minors proved to be one of the least effective models. Kids who want tobacco (or alcohol, for that matter) have a way of getting it from older friends and relatives! As of 2004, meta-analyses revealed that the most effective approach was a combined one:
[The] CDC recommends several components as critical in a comprehensive youth tobacco control program, all of which have parallels in efforts to reduce underage drinking. These components include implementing effective community-based and school-based interventions in a social context that is being hit with a strong media campaign (aimed at some set of “core values”) and with an effort to vigorously enforce existing policies regarding the purchase, possession, and use of the substance.
Without comprehensive data to back up claims, the smoking prevention campaign might have been abandoned after a few years, or ineffective aspects of the program might have continued while others were terminated. What if someone thought that simply preventing the sale of tobacco to minors would stop all youth from smoking, for example? What if no other intervention programs existed because someone believed so strongly in the power of the law to change behavior? Without an effort to study the data to systematically analyze the effectiveness of the program smoking rates today might be as high or higher than they were in the 1990s. As it turns out, in 2023, only two out of every one hundred high school students reported smoking cigarettes in the past thirty days.
Evaluating Healthcare Initiatives
Program evaluation is a critical part of many healthcare initiatives as well. Formative evaluations meaning ongoing, mid-process ones rather than retrospective ones have helped shape the Affordable Care Act, first passed in 2010 under President Obama. The ACA attempted a massive reform of healthcare, mandating universal coverage for individuals and attempting to curb costs from providers and insurance companies. Whether or not it succeeded in meeting those goals has been widely debated, with political views often complicating the picture.
Data tells us that the ACA did succeed in getting more people insured: According to the US Census Bureau, 26.4 million Americans remained uninsured in 2022 versus 2013’s 45.2 million.