Metadata
- Author: Ben Recht
- Full Title: Bit Prediction
- URL: https://www.argmin.net/p/bit-prediction
Highlights
- The prototypical statistical prediction problem is bit prediction: Having seen a long list of 0s and 1s, your task is to predict whether the next bit equals zero or one. (View Highlight)
- We reduce complex classification problems to predicting bits. So how do we evaluate methods for bit prediction? Why do we use the method of averaging in reference classes? (View Highlight)
-
Evaluation measures the average difference between our articulated expectations of a system and its actual performance. For bit prediction, I’ll let the distance between expectations and performance be 1 if we get the answer right and 0 if we get the answer wrong. (View Highlight)
- Let me make our lives a bit easier today and say that we don’t have to predict a bit, but instead a number between zero and one. This way, if we predict 0.94 and the event happens, our score is 0.004, which is much smaller than if we predicted 0.5. In some way, this real-valued prediction quantifies our confidence in future predictions. (View Highlight)
- Predicting confidence makes our lives easier. We could say we’re predicting the probability of the event happening. This is nicer than having to predict an actual bit because we can claim to never be wrong. As long as we predict something strictly greater than 0 and strictly less than 1, we are saying there’s a chance of both outcomes. (View Highlight)
- Following a legacy started by Ronald Fisher and pushed heavily in engineering applications by Wiener and Shannon, the first two models assume that all data is random. In this case, personal expectations become mathematical, probabilistic expectations. (View Highlight)
- There are two popular random models of bits: [2] • i.i.d. That is, is independent and identically distributed. In this model, every outcome at every time is equivalent to flipping a biased coin. Each coin flip is completely uninfluenced by every other coin flip. And the mechanism that generates the coin flip is the same each time. Data is generated by repeating the identical mechanism over and over and over again. • Exchangeable. i.i.d. is a pretty strong assumption! Exchangeable is the model that makes Bayesians feel better about their methodologies. In the exchangeable model, there is a random process that generates a sequence of bits, and the probability of any sequence is equal to the probability of any reordering of that sequence. That is, the probability of seeing (A, B, C) is the same as seeing (C, B, A) as seeing (B, A, C). Exchangeability asserts that the order of observations doesn’t matter. (View Highlight)
- Let me add a nonrandom model to the pile: • Arbitrary sequence. Assume our observations are a sequence of 0s and 1s fatefully determined before we start measuring them. Our predictions do not influence their outcome. The bits are already set in stone before we start predicting. For people who work in causal inference, this is similar to the assumptions in the Neyman-Rubin model of potential outcomes. Our interactions don’t change the data. They simply reveal what we haven’t seen. In prediction problems, the data that is missing is the future. (View Highlight)
- These three models give us very similar predictions about bit prediction. They all tell us that predicting the rate in the reference class is about as good as you can do. The i.i.d. model gives the easiest calculations, but all three models for bit streams lead us to the same prediction procedures. (View Highlight)
- These models converge on the same answer because all three transform prediction problems into missing data problems. They assume that our sequence of bits has been chosen by fate, both in the past and the future. (View Highlight)