Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, explains you how to show a mathematician that he is wrong.
Here's the unedited transcription:
Hi - I'd like to talk to you about a problem. It's a problem I've had. And the problem is how do you have a techie talk to a non-techie? Often this can be a very difficult conversation, because the frames of reference are so far apart. And in fact, this is based on a true story. Where arguably the best mathematician in the world at the time, Paul Erdős, was given a very, very simple puzzle, and he got the wrong answer. So here's the problem, suppose you were face to face with this guy that's very very emphatic, very very strong willed, he's got the wrong answer. How do you one: figure out what the right answer is. And two: how do you explain to him that he's got the wrong answer? How do you do it in such a way that you're on firm ground, whereas he is the best mathematician in the world, and it's a mathematical question.
So I'm going to talk today about how random numbers can simplify many kinds of computation to the point where you, or somebody can, and indeed they did convince this man that he'd had the wrong answer initially. He immediately realized what the problem was, but there had to be a common ground to talk with. And this works in reverse too, if you're on the technical side as I tend to be, speaking in mathematics doesn't convey the thoughts you want to a large fraction of the audience. And so what you'd have to do is have a more intuitive way to express some of these probabilistic mathematical concepts. And so this is common ground that goes both ways. It's how to tell the best mathematician in the world that he's wrong.
So the problem here is a classic one. It's a puzzle that's been posed, it was the fervor of the internet ten, fifteen years ago, and the question is the Monty Hall puzzle. You have door number one, two, and three. Behind one of the doors is a goat, presumably you don't want the goat, and behind one of the doors is a fabulous prize. Fabulous, you know the untold wealth's of the orient, things like that. You really really want what's behind that door, but you don't know which door it is. Now the way the game is played is you pick a door, two, that's the door we'll pick. And then the host, the original host was Monty Hall, he would show you that behind another door- so we're going to pick two. That's ours, and Monty now shows us that door number one is the goat. Our question then, should we stay with door number two, or should we switch?
Now of course there's people shouting at you in the game, and Monty Hall is offering you money to do what you don't want to do, so it's very difficult. It's very difficult under stress. But mathematically, we should be able to figure out the probabilities that given that we pick a door, that Monty will always show us the goat, should we stay, or should we switch? That's the question. Many people say 50/50, because they see two doors now, and they ignore the information they've had in the past, other people say you should always switch. Now, even if you do the mathematics very carefully and you're good at it, there's some chance you'll get it wrong. My case, I have to do it four or five times to get it solidly right. Because I have a high probability of error when I just jump and try to do something. Ultimately I get a high probability of correctness, but there's always that nibbling doubt. So here's an alternative way to solve the problem, that uses random numbers.
What we're going to do is first we will do thousands and thousands of samples that represent where the prize might be, we will pick a door that'll be our choice, and then we'll pick Monty's door, which is not ours and is not the prize. And then we will figure out two strategies, stand or switch. Stand will be computed by how often the door we pick equals the prize door. And switch is the frequent- the how often will- we'll figure out when our switching strategy which is to switch to the door that Monty did not pick, and we did not pick. That's the switching strategy. And when the switching strategy equals the prize, we can compute the means of these two things. And that will tell us how often the two strategies win or lose. We can do this ten or a hundred times and come up with very precise estimates. These are only estimates, but the confidence bounds on these estimates will be so tight, that we can be sure of which of the two answers, 50/50, or 2/3, 1/3, which of those answers is correct.
And here's a quick bit of live coding to show how this works. We start by sampling the prize. We do that by sampling the from the uniform interval, expanding it to zero-two-three, open ended on the right, and then taking the ceiling of that, that gets us a number one, two, or three, one hundred thousand times. If we summarize that, we get ones twos and threes roughly in equal proportions. Those are the prizes. So that's what we call it, a prize. Next, we will pick a door. Now, we could pick any door we like. We could pick it exactly the way we picked the prize, or we could pick door number one always. Since the problem is completely symmetrical here, there is no information whatsoever, let's just start by picking door number one.
You might like to replicate this, picking a different door, or different door at random. So our strategy, our pick, is door number one ten thousand or a hundred thousand times. Now Monty is going to pick a door, he's going to pick a door that is not the prize, that is not the door we picked. And so we first figure out the number of times the places where the prize is door number two, and we'll pick a three there, and the numbers of times where the places where the prize is door number three, and we'll pick a two there. Since we always picked a one, Monty will always pick something we didn't pick and always pick something that is not the prize. So that's Monty's. And no we need to compute two strategies variables. One is stand, that's our original thing, so stand equals original. And then switch, we need to compute a number which is not our original, and is not the one Monty picked. That will be the switching strategy, that's the door we pick under that.
And now we pick two means, the mean of stand equals the prize, and the mean of switch equals the prize. We do that here, and you can see that switching has a probability of winning, about two thirds. This is a very very concrete simulated way of solving the problem. It's concrete enough that it only involves a few steps, and the steps are highly non-controversial. We understand how they happen, they're very very procedural, very very similar to the rules of the game. They are much simpler to explain to people, technical or non-technical, than the mathematics that we would have to go through to try to compute a bayesian or a frequentest approach to this. And as such, this is the approach that convinced Paul Erdős that he had the wrong answer initially. And this is a very very handy way to either make a point with technical people if you're not very technical, or vice-versa, if you are technical to make a point to people who are not.
It's a common ground, people can understand simulations, they can see what's happening, and you can check your math. It's really a cool idea. We've done something very very similar to this. We simulated the probability of failure of a cluster. Disks fail, machines fail, we know probabilities of these things, we know how often they occur, we can model different lots of disks, having different frequencies, but what does that mean in terms of data loss? What is the probability of or the mean time to data loss with different assumptions about data transfer rates, disk sizes, machine sizes, and so on. Or you can do a back of the envelope sort of computation quickly, but it's very very hard to understand what the impact of all of these strategy discussions, like: how much bandwidth do you allot to recover data? Do you have a grace period before which you do not replicate data that is partially lost?
Well, all of these things can be computed in simulation relatively easily. And so you can estimate what the probabilities are, and you can pick strategies that are good, or bad, and more importantly you can make the case and get consensus across a group as to what are good strategies. So here you go, this is how you tell the best mathematician in the world that he is wrong, or right, and how you make peace with that. Thanks very much!
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.