Incidentally, Shalizi is a great source for going back to the basics. His course at CMU "Advanced Data Analysis from an Elementary Point of View" has course notes and exercises available gratis: http://www.stat.cmu.edu/~cshalizi/uADA/13/. And any time I need to get a reading list, the best place to start is usually his 'notebooks' page.
Sometime ago I tried to come up with the simplest possible explanation for Simpson's paradox. This was the result:
1) Imagine that most women with a certain disease survive, while most men die.
2) Imagine that most women with the disease take a certain medicine, while most men don't.
3) Imagine that the medicine has absolutely no effect. Women just happen to have better innate resistance to the disease, and also just happen to buy the medicine more because it's marketed to women.
Now if you do a statistical analysis without counting men and women separately, you will conclude that the medicine is very correlated with survival!
Note that there's no way to know in advance that you should slice the population along such-and-such variables, which can be a lot more subtle than just gender. Also note that the example works even if the medicine has a slight negative effect, i.e. you can reverse the direction of correlations by choosing to slice or not to slice.
I think such results make it clear that you can't easily trust conclusions from statistics. One minute you're thinking that cholesterol causes heart disease, and the next minute you're asking yourself, what if cholesterol is part of the body's response to heart disease? That's why we need randomized controlled studies, and theories of causality.
Also note that slicing the data too many ways is also dangerous. Every time you slice the data a different way, you increase the chances that you will find a spurious correlation. It is very easy for a naive researcher to fail to adjust their p-values and mistake the spurious correlation for a real one. It is also very easy for a biased researcher to cherry-pick the correlations which match their expectations.
This article is largely focused on epidemiological studies. These are a very well-known type of scientific investigation, particularly among the tech crowd--because it is tech-intensive to manage and analyze large volumes of data.
But, epidemiology is not all of science, it's just one way to do science. So this statement is wrong, or, at least, not complete:
> The standard scientific answer to this question is that (with some caveats) we can infer causality from a well designed randomized controlled experiment.
The more general scientific answer is that we need an hypothesis that can be tested and disproven by observation. Such observations can come from a randomized controlled trial, but a surprising amount of important scientific observations do not. For example, a major confirming test of relativity was to measure the apparent displacement of stars due to gravitational lensing during an eclipse. Obviously there is no way to create a random double blind study of this. (Unless one were to stare too long without eye protection! [rimshot])
But wait, I could hypothesize that saying "blutarski" when flipping a coin will make it land heads up, and if I only do the observation once, my stupid hypothesis might get confirmed. Right?
In this case, yes, a lengthy trial would help provide much better observations. But the more general answer is that a reasonable hypothesis must propose a physical system that could plausibly cause the predicted result. There is no known physical system that connects "blutarski" with coins, or Facebook traffic with Greek bonds. But there is a plausible physical mechanism by which smoke inhalation could lead to lung cancer.
This is what scientists are talking about when they say things like "there's more to science than curve-fitting." But most people do not understand the difference, and that's why we see things like global climate circulation models get mixed up with historical climate reconstruction, or paleontology get mixed up with evolutionary biology.
A lot of critics of evolutionary theory believe that the fossil record is the most important piece of supporting evidence--hence the focus on "missing links" between species.
The scientific reality is that Darwin formulated his theory by looking at living species, and the modern study of evolutionary processes does too--primarily viruses, bacteria, and insects, because their lifespans are so short.
Thanks. This is a treat to see this article. I've long wanted to read Pearl's book on this topic, but haven't had the chance to carve out the time. I'm going to dive into this article now.
This is the fundamental reason why general AI might not be possible on a computer without a body. To infer causality, you must form a hypothesis then design an experiment to confirm/deny it. Passively observing the world can't disambiguate between complex correlations or causality (even with a fancy calculus). You need action to learn the intricacies of the world.
Think how the discovery of electricity led to electronics. There is nothing like electronics in the natural world, the only route to electronics is via an iteratively refined causal model of the universe.
> This is the fundamental reason why general AI might not be possible on a computer without a body.
'Body' is a rather misleading term to use for what a general AI might need.
And I say 'might' because part of the motivation for the Pearlean program is for discovering under what data and conditions one can infer causality without randomized interventions.
Correlation + plausible based on your knowledge of the world implies causation (obviously to the appropriate degree).
It's the flip-side of extraordinary claims require extraordinary evidence.
Facebook driving Greek debt is implausible and two vaguely shaped curves aren't enough. A formula that predicts to many decimal places over a fair period, prospectively, would be really weird but hard to ignore.
Spanish debt, gold prices or something related on the other hand, would require proportionately less evidence for causation.
This disturbs people because it implies you can get stuck in a pathological view of the world and statistics and evidence won't get you out. Sorry about that.
Edit: And this is pretty the Bayesian Approach. It's just that Bayesian Statistics and Pearl arguments, are themselves just models of the world. You can have others but, even more, you need more of an arguments than "stuff that seems implausible needs more evidence" and all of the more elaborate stuff is going to be specific to a given situation and thus might be applicable to a different one.
Plausibility is a subjective measure, and while I would say that, yes, it can be a hint, you cannot disregard something merely because it's implausible
> a hint, you cannot disregard something merely because it's implausible
Coming back to Bayesian statistics, the word for this is prior, and its not that you disregard evidence, its that you can quantify both your existing beliefs about reality and the change in your beliefs according to the evidence you see.
Firstly, our hint: we have a prior assumption of the probability that the popularity of Facebook is driving up Greek debt (call it P(Fg)). Then, we observe a correlation between these two things. For the sake of argument, I'm going to make this 0.001 (I'd probably estimate less).
Now, once we see this correlation, we now need to calculate two things: 1. The probability of observing that correlation (call that P(C)). Note that the more extraordinary the correlation, the less probable it is, and the smaller this term would be. In this case, the graph matches vaguely, I'm going to give it a probability of 0.1.
2. Given a world where there is a causation, what is the probability we'd see this correlation (Q: I'm not 100% on this part). Now, Greek debt could plausibly be driven by other things, which would mask the Facebook effect, so there's no guarantee there would be a correlation. This term is called P(C | Fg), and I have no idea what value to give it. Let's try 0.5.
What we want to know is: P(Fg | C), that is, the probability of a connection given we have observed a correlation.
Boom! P(Fg | C) = P(C | Fg) x P(Fg) / P(C)
So our posterior probability (after observing this correlation) changes from 0.001 to 0.001 x 0.5 / 0.1 = 0.005
Right.. but when you find correlation, you just see that two things are obviously related in some way - there is a common factor. It could be direct causation, or it could be any number of other things.
The question is whether or not you have a theory that can be analyzed to see if this is a CAUSE or not... and we can't test every correlated thing out there exhaustively.
Plausibility is a measure based on less than exact knowledge of the world. I indeed don't know with certainty the relationship between Greek debt and Facebook value.
Not all less-than-perfect knowledge of the world is subjective however - subjective involves personal prejudice but there's more than personal prejudice operating in the judgement that Facebook prices won't affect Greek Debt - there's an understanding of finance, a doubt about action at a distance, etc.
Also, if you read my post, I'm not "disregarding" anything. I am simply saying that some things have a higher bar than others.
No amount of correlation seems to imply causation. Suppose there is some event X that always causes two other events, Y and Z. Let's also say that Y and Z are exclusively caused by X. Then Y has a 100% perfect correlation with Z, but ex hypothesi it is not caused by it.
It is dangerous to assume causality from any data alone. (Data and statistics are over-rated nowadays). You need to do the harder work of discovering the proper mathematical model (equation) relating explicitly the dependent (caused) variables to the independent (causing) variables. In the absence of such a verified and proven model, you just can not take a shortcut of pulling causality out of statistics, like a rabbit out of a hat.
Incidentally, the "Simpson's paradox" to which so much attention is given here, is a trivial illustration of the fact that you can not meaningfully add percentages from differing amounts. Something every school kid ought to learn.
But why is it dangerous, on balance, to make assumptions of causality from data and statistics alone?
Animals, such as rats and ravens, face this problem all the time, and yet they can meaningfully effect the world in such manner that would imply causal understanding, and a sensitivity towards the difference between mere correlation or a correlation with causal potential.
Humans do the same as well, naive people who have never learned about experimental design, or have never learned the concept of correlation, also make useful judgments on the causal model behind ordinary problems and events.
How did these machines make actionable judgments on causality with nothing more than noisy inputs to their sensory systems? Through what technique did they discern the difference between mere correlation, and a correlation with exploitable causality?
> naive people who have never learned about experimental design, or have never learned the concept of correlation, also make useful judgments on the causal model behind ordinary problems and events.
They also are often... racist. Or hold whatever other stereotypes to heart. Racism is just a good example of an extreme position to hold which is often due to assumptions of causality.
"Lots of minorities are in prison. There is a high correlation between being a minority and being in prison. Therefore being a minority leads to being a criminal."
This completely ignores external reasons why minorities might end up in prison more often than others. For instance, it could be that minorities have an equal amount of criminal activity as the general population, but are more likely to end up in prison because of it. Correlation does not imply causation.
I think the number of social issues that arise due to assumptions of causation is quite high, actually, and often leads to poor decision making in policy. That is why it is "dangerous."
It is probably the ability to make the abstract jump from the data and its correlations to generalised laws that marks the chief difference between people on one hand and animals and machines on the other. Correlation is only useful up to a point. It shows that there may be some relationship between the variables but it says nothing about its nature. Was X caused by Y or was Y caused by X or were X,Y both caused by some unknown Z, was it all just an accidental data sample, was it significant, with how much doubt? Does the assigned significance in fact rely on an unwarranted assumption of some underlying population distribution? Just too many questions and no answers.
Besides, causality is problematic enough (see non-aristotelian philosophies and/or quantum mechanics) even without trying to demonstrate it with statistics.
> But why is it dangerous, on balance, to make assumptions of causality from data and statistics alone?
Well it all depends on the actions you take based upon those assumptions. If the action is low risk, low cost then it may be the wise choice. You need to remain aware of the uncertainty and that you are basically guessing until a better understanding is achieved.
One of the dangers is that initial uncertainty is forgotten and wrong information becomes "common knowledge".
Another danger is where there is there actually is causation but running the opposite way to that assumed. For example if a chemical substance is a useful form of self medication for sufferers of a condition there may be a correlation between the use of the chemical and the condition but banning/withdrawing/warning about the chemical would actually worsen the situation.
"Through what technique did they discern the difference between mere correlation, and a correlation with exploitable causality?"
Evolution - that is, assumptions that are accurate are favoured since they are more likely to lead to the animal surviving, vs embracing spurious correlations which are likely to get you killed.
Of course, this can break down if we try to apply the cognitive rules of thumb we've evolved to new domains outside our original evolutionary scope - our difficulties in thinking about statistics and probability are a great example of this. The math is trivial, but it just doesn't fit our brains very well without a great deal of cultural scaffolding.
I think it helps a lot if you have a plausible explanation for the chain of causes. For example, the correlation of dead people in houses with shitty stoves and furnaces in those houses is interesting, but once you discover that said stoves and furnaces leak carbon monoxide, AND that carbon monoxide binds to hemoglobin with much greater affinity than oxygen, AND that the affected people don't receive any specific warning symptoms when that happens, you suddenly have a reason to stop wondering whether or not they were committing suicide at a heightened rate because of poverty-related depression.
It depends where the data comes from. The gold standard in science is a controlled, randomized experiment - but obviously that standard is not always attainable.
This example is often cited for Simpson's Paradox, but because the South headcount for Republicans (10) and Democrats (94) had such a wide disparity and the Republican headcount was so low, it always seemed more like twisting the data to fit a political narrative than to make a statistical argument. Like other statistical methods, there should be a minimum population size for the paradox to be meaningful.
Has larger sample sizes, all would be valid for statistical analysis. Although this paradox isn't really dependent on a sample size, it is not so much a statistical test. It's a trend that can be observed when looking at sub groups compared to overall averages.
In the above example open surgery was better for both small and large kidney stones and is a prime example of Simpson's paradox. However, one should note that Simpson's paradox can also be subject to further confounding variables. Perhaps patients that were suitable for open surgery were in better medical condition and therefore the open surgery sample was non-random and caused the higher success rate.
The presence of Simpson's paradox, similar to correlation, also does not imply causation.
And contrary to the OP, these results don't demonstrate a reversal of the predictors (i.e., that actually being a Democrat predicts voting for the Civil Rights Act and vice versa), but rather than we're possibly using the wrong predictors.
That is, rather than demonstrating that there is a different correlation between party and vote, it demonstrates that there is a (stronger) correlation between geography and vote.
Yes, but then the nodes in the graph become even more disconnected from the concretes in reality that they are supposed to represent.
I am actually highly skeptical of the entire notion of the article, but I haven't had time to finish the article yet. My skepticism comes from the fact that causal relationships can be explained by identifying and understanding the causal factors at play and drawing relevant conclusions. For example, smoking causes lung cancer because exposure to toxic chemicals causes genetic mutations. Existing mathematics (such as basic statistics, including correlation) can be used as additional empirical evidence to support such an explanation.
(1) A clear mechanism. Data. My car
won't run. Cause. The universal joint
at the differential for the rear wheels
failed leaving the rear end of the drive shaft on the
ground.
(2) A solid scientific theory. Data.
I let go of the 2 x 4, and it
hit my foot and hurt. Cause. Newton's
law of gravity.
(3) Other. Data. There is a correlation
between smoking and lung cancer.
Cause. Guess that there are some
chemicals in cigarette smoke
that cause lung cancer. Without actually
finding the chemicals, basically test
the heck out of the connection, i.e.,
look for and reject (statistically as
in an hypothesis test) other candidate causes,
see if cigarette smoke does
cause mutations (causing mutations are easier
to test for than causing cancer,
and nearly every chemical that causes
cancer also causes mutations;
so, if a chemical doesn't cause
mutations, then it likely doesn't cause
cancer; if the chemicals in cigarette
smoke do cause mutations, then can't
reject that they cause cancer and
have to keep entertaining that the
chemicals might cause cancer and have
to keep testing),
reject spurious correlations,
do some more tests that might reject
causality and observe that they do not reject,
look for other causes, work hard,
get tired, give up,
and finally conclude that
have done enough work and time to
stop smoking.
I responded directly to the question in the
title of the post here at HN:
> If correlation doesn’t imply causation, then what does?
not directly to the article or the discussion here on HN.
I don't think that the article makes much
sense.
I gave a very simple answer, (1) and (2).
For (3), that's a mess and closer to the
discussion.
My view is that for causality, what I gave
with (1) and (2), simple, childishly simple,
dirt simple, is, unfortunately, in reality,
about all there is to the poor, struggling
subject.
Put more respectfully, beyond my simplistic
(1) and (2), working with causality is
super difficult and quite unpromising
as in mostly just f'get about it.
For my (3) it boils down to
ways to reject causality
and then we accept causality
once we get so tired trying
to reject it we just give up
and accept it. Why? Because
without something detailed and
mechanical, say, from
chemistry, biochemistry, and
the forbiddingly complicated,
detailed biochemistry of cells,
we are missing anything very
solid to call causal. E.g.,
in my (1), with the universal joint
that failed, we have a simple
explanation that makes a solid
cause; getting something
so simple and solid for a
cause of cancer from smoking
will be super difficult. Don't
worry, I don't smoke, but
neither do I claim really to know
a cause of cancer.
Causality is a great, intuitive
idea for humans and animals,
but looked at in detail it's
tough to establish in all but
some narrow situations.
For causal networks, path analysis,
Markov random fields, directed acyclic
graphs, lots of diagrams with circles
and arrows, f'get about it.
For getting causality out of data
analysis, mostly just a fool's errand --
f'get about it.
There is a significant reason I concentrated
on (1) something mechanical and (2) something
from classic physics -- those two darned near
cover what can be done with causality.
For the biological sciences -- causality
is really important but really tough.
For the social sciences, they try and
try, and my wife did in her Ph.D. in
essentially mathematical sociology
and my brother did in his Ph.D. in
political science, but, net, watching
my wife and brother struggle with
trying to make causality work in
social science, where it's so easy
to make it with (1) and (2), I
just said f'get about it.
You can entertain my views and my
first very short post as a
contribution to the discussion
based on a lot of background
and a claim that more is a fool's
errand or just chalk it up to my ignorance.
One more point: I didn't even mention
correlation. Why? Because correlation
is so far from causality that it's
hardly worth even mentioning.
Yes and no. The "imply" of "correlation does not imply causation" is a technical term from logic meaning "to be a sufficient circumstance." It does not mean "to suggest".
As described in the article, "correlation does not imply causation" would hold true even if you take "imply" to mean "is weak evidence for". This is due to Simpson's paradox[1], which says that correlation can be inverted when you take into account an additional distinguishing factor in your data.
A specific example. Look at [2] and take the "y" axis to be "cigarettes per day" and the "x" axis to be "life expectancy in years". The overall trend is that more cigarettes leads to a shorter life. However, if we find that the blue group are people who do not inhale, and the red group are people who do inhale, the correlation is reversed for all smokers -- smoking more leads to a longer life.
Actually, that is precisely why you should form a hypothesis based on correlation. A properly conducted experiment should elucidate any causative link between the elements and the direction of those elements.
Hypotheses are cheap. Create lots of them. Being 'wrong' about a hypothesis is awesome. It's the path of Reason.
(Yes, I woke up this morning wanting to capitalize terms for definitional emphasis. Sorry.)
Usually, you have some additional information, aside from the correlation itself. For example, instead of just knowing "some variable X correlates with some variable Y", you may know what X and Y actually are, and some facts about similar entities.
The problem is in distinguishing Logic from biased Rhetoric. p → q is often an inappropriate conclusion based on correlation of two vectors. The common counterexample is (r → p) ∧ (r → q).
Rhetorically, "correlation (does not imply|is not) causation" is often a red herring to dismiss a hypothesis that warrants further investigation, as correlation is essential for establishing causation.
Here is my understanding from a non-expert. The obviousness is suposed to come from if `a` causes `b`, `b` can not have caused `a` because that would need some form of time travel.
`b` could cause `a2` though which is very similar to `a` but is separated in time from `a`.
Yeah that makes sense. But then he uses the model wrongly, since he puts "hidden factor" "smoking" "lung cancer" in the vertices, and not "hidden factor (t=0)", "smoking(t=1)" and "lung cancer(t=2)". Further if used like that, then a possible hidden factor could actually be "lung cancer(t=0)", since it could (conceivably) cause both "smoking(t=1)" and "lung cancer(t=2)".
> But then he uses the model wrongly, since he puts "hidden factor" "smoking" "lung cancer" in the vertices, and not "hidden factor (t=0)", "smoking(t=1)" and "lung cancer(t=2)".
The direction of the arrows shows causation with also implies time has passed though the amount of time is not specified in the model he is using, though it could be added. He has a section latter in the article about why he left it absent earlier and covers how it could be included.
> Further if used like that, then a possible hidden factor could actually be "lung cancer(t=0)", since it could (conceivably) cause both "smoking(t=1)" and "lung cancer(t=2)".
If lung cancer actually caused smoking then yes that would a simple solution. If that was the answer though the very next question would be what caused "lung cancer(t=0)". Which "smoking(t=-1)" if existed would be suspect, and another "hidden factor"(s) of course as well.
I read chapters of it a few years back, too, but forgotten the details. All in all, I was pretty convinced of the soundness of the approach at the time.
My recollection is that this framework of causal inference allows one to ask questions about a probabilistic model that one can then try and measure to test causality.
These questions are interventions or assertions (the do operators) that something happened.
So one would start out with a Graphical Model like in the smoking example which defines a probability model, and then make do assertions on the model for various candidate causes and see what that would imply about the change in probabilities and then design an experiment to measure them and confirm them.
I thought it was a good approach as well, but I have a hard time fitting it into the math I know. I've been learning category theory so I may take another stab at placing it. But this all takes time.
And the book's layout is fairly disjointed.
I may just begin experimenting with the theory, rather than placing it, my life is short.
Ref list: http://vserver1.cscs.lsa.umich.edu/~crshalizi/notebooks/caus... Chapters: http://www.stat.cmu.edu/~cshalizi/uADA/13/lectures/ch22.pdf http://www.stat.cmu.edu/~cshalizi/uADA/13/lectures/ch23.pdf http://www.stat.cmu.edu/~cshalizi/uADA/13/lectures/ch24.pdf
Incidentally, Shalizi is a great source for going back to the basics. His course at CMU "Advanced Data Analysis from an Elementary Point of View" has course notes and exercises available gratis: http://www.stat.cmu.edu/~cshalizi/uADA/13/. And any time I need to get a reading list, the best place to start is usually his 'notebooks' page.