Thoughts on: The Signal and the Noise, the art and science of prediction by Nate Silver

  • by

Nate Silver has recently lost his sheen, having been a consistent Trump doubter throughout the Republican primaries – his piece ‘Donald Trump’s Six Stages of Doom’ felt like a ray of hope last year, predicting that Trump lacked the breadth of support to continue his rise as the Republican field began to narrow. FiveThirtyEight has remained highly sceptical of Trump’s ability to win the nomination until far too late, until effectively the nomination has been secured. In a return to form Nate Silver has looked at his failed predictions more closely in ‘How I acted like a pundit and screwed up on Donald Trump’ – maybe next time he will be a little less wrong.

The Signal and the Noise has been a strange book to read twice – the first read was very easy, the second dragged. Large passages of the book that on first reading flow well feel frustratingly shallow when read again. However, much of the book is excellent and shows a deep knowledge not only of statistics and modelling of systems but also of the nuances his subjects, in particular baseball, poker and American politics. I would give the chapters on climate change, flu and terrorism a miss.

The real strength of the book is its accessible introduction to some language around forecasting and examples of its use, some of these concepts include:

Forecast: a probabilistic statement, usually over a longer time scale, i.e. there is a 60 percent chance of an earthquake in Southern California over the next thirty years. There are many things that are very hard to predict but fairly easy to forecast once longer term data is available. An upgrade to a forecast is a ‘time dependent forecast’ – where the probability of an event occurring varies over time according to some variables, for example the forecast for a thunderstorm in London over the next week shows a higher probability when it is hot and humid.

Prediction: a definitive and specific statement about when and where something will happen, e.g. there will be a riot next Friday afternoon in Lullingstone – a good prediction is testable. Hypothesis live by the quality of their predictions.

The signal: the signal indicates the true underlying relationship between variables, in its ultimate form it is Laplace’s demon, where given the position and velocity of every particle in existence and enough calculation power, past, present and future are all the same to you.

The noise: the data that obscures the signal, it has a wide variety of sources

Overfitting: finding a model (often extremely well calibrated against the past) that fits the noise rather than the signal (underlying relationship). An overfitted model ‘predicts’ the past well but the future extremely poorly – it is taking the wrong things into account and failing to find the parts of the model that matter. Superstitions are often examples of overfitting – finding false patterns in the shadows. The most famous example is probably the the superbowl stock market indicator. With the growth of available data and computing power it increasingly easy to create a falsely convincing overfitted model. to search millions of potential correlations, many of which will appear to be significantly correlated, purely by chance. To avoid overfitting, theory can help, showing which variables might be the most useful in any given model. I am fairly certain I have been guilty of overfitting in various pieces of work in the past.

Calibration of a model: a well calibrated model gives accurate probabilities of events occurring across the range it predicts – events it predicts with X% likelihood of occurring actually occur X% of the time over the long term.

Wet bias: when the calibration of a model is deliberately poor, for example to provide more positive surprises than negative surprises, so the forecast is more pessimistic than would be accurate. This is taken from weather forecasting, where rain is given a higher than accurate probability as the public notice a false negative (i.e. low probability of rain when there actually is rain) more than a false positive.

Discrimination of a model: a model is capable of discriminating between different events to give different probabilities – a model that does not discriminate can still be very well calibrated, but is not useful. Saying 1/6th of rolls of a fair dice will come out as a 6 is well calibrated, but not very useful if you need an advantage over someone in a game.

Goodhart’s law: the predictive value of an economic variable decreases when it is measured and targeted, in particular by government policy

Bayesian prior, posterior probability and Bayes theorem: a simple identity that helps to update the probability of something being true given a prior belief that it was true, a probability of observing the evidence seen given the belief is true, and a probability of observing the evidence seen given the belief is true. The outcome (posterior probability) can then be used as the prior for the next input of data to update the probability of something being true.

Derivation of Bayes theorem:

P(A|B)P(B) = P(B|A)P(A)

P(Rain|Humid)P(Humid) = P(Humid|Rain)P(Rain)

Then divide by P(B), or P(Humid), this gives Bayes, and allows us to use as a tool for updating our ‘degree of belief’ when provided with new evidence:

P(Rain|Humid) = P(Rain) * (P(Humid|Rain)/P(Humid))

Posterior = Prior * Bayes factor

The same can be applied to continuous distributions.

 

Frequentism: the interpretation of probability as the limit of its proportion as the number of opportunities for it to occur tend to infinity

The role of assumptions in models: often the assumptions made in a model dominate its behavior, to generate an idea try to break the assumptions and check for changes in output. To create an idea of the sensitivity of your model make small changes to the inputs, run many times and plot the range of outputs as a probability distribution.

Accuracy and precision: accuracy is when the predictions fall, on average, on the correct values, a precise prediction has a small range of expected values

Fitting the noise, correlation without causation: similar to overfitting

Nearest neighbor analysis: finding the results of similar individuals from the past, and using this to find a probabilistic analysis of the potential of that individual. Of those who have been here in the past, what actually happened to them?

What makes a forecast good: accuracy, honesty, economic value: accuracy is about whether the forecast is correct over the long term (is accuracy the same as calibration?), honesty is whether the model is the best the analyst could produce given the information available to them, economic value is whether the model allows those who use it to make better decisions, so takes their biases into account

What makes a forecast of use: persistence, climatology: persistence is a model that assumes a result will be the same as previously (i.e. the temperature today will be the same as yesterday), climatology (for a weather forecast) takes a slightly longer view, looking at the average behavior previously with only the most basic inputs (for example the day of the year for weather).

Results oriented thinking, process orienting thinking: acknowledging your results may not be an indication of the quality of your predictions over the short term, required confidence you are on the right track, however

Theory, correlations and causation: where many variables are available, theory is needed to help separate the relationships with true predictive value from those that are mere causation and will not be of long term value.

Initial condition uncertainty, structural uncertainty and scenario uncertainty: when making a prediction about events, a lack of information about the current condition may dominate very term predictions, this is initial condition uncertainty, scenario uncertainty dominates over the longer term as the behavior of the system may change from our current model, at any time period there will be structural uncertainty, uncertainty about our current understanding of the system

Log-log charts: make it much easier to observe relationships in variables that are exponentially related, for example those that obey Zipf’s law. Log-log charts can reveal otherwise hard to see relationships around frequency of occurrence, and ranks – and sometimes reveal where the limits of the data might be, if the size of events and their frequency follows a straight line on a plot, then it is not unimaginable that something an order of magnitude (or two) larger might occur in the future.

IMG_20160410_175903

Leave a Reply