Small data – FiveThirtyEight and predicting the World Cup

A mug’s game

I should really begin with some sort of “full disclosure”. Despite being pretty obsessive about football – I have a season ticket to my home town club despite living a good couple of hours away by train, my favourite TV programme by miles is the Bundesliga highlights on ITV4 – and despite doing “stats” for a living, I am terrible at predicting football matches. I got beat in my World Cup Fantasy Football league by a child. I thought my team, Norwich, would finish comfortably mid-table last season, thanks to goals from our new striker, Dutch international Ricky van Wolfswinkel. We got relegated, and, with his one goal all season, Ricky van Wolfswinkel is no longer a Dutch international.

But that doesn’t mean I didn’t join in the schadenfreude around all those mistaken World Cup stats models when Germany battered Brazil last week. My favourite tweet was this

And I sent this one on my way home from the pub, where I appear to be criticising data viz legend Alberto Cairo (I wasn’t, but still, for shame. Also, for drink).

The main butt of the jokes appeared to be FiveThirtyEight, the data journalism site set up by Nate Silver. They gave Brazil a 65% chance of winning that game. That makes them definite favourites. (Again, full disclosure, I thought Brazil would win too, but we know about me now). Here’s one of many, many tweets.

(Note – that’s a different %age to the one I saw. Anyway)

Most bookies thought the odds were roughly even, with Germany possibly slight favourites. You could find pundits backing both sides, but most thought it would be close. So, in not predicting a 7-1 battering, FiveThirtyEight were far from alone, but they did rate Brazil’s chances higher than most.

So how did this high rating come about? FiveThirtyEight built a model estimating each side’s chances of winning the tournament before it kicked off, and adjusted those chances as the games went on. The high rating for Brazil comes, therefore, from this initial model. They have published the data behind their model, which is both honest and useful of them so we can have a poke around.

Brazil – everyone’s favourites

As a starting point, here’s a graph showing their pre tournament calculations of the probability of each country winning the World Cup. For comparison, we’ve also got the odds, supplied by FiveThirtyEight in their original post, from Betfair, a betting exchange.

There appears to be an outlier.

That 45% chance of Brazil winning the whole thing is pretty high. It’s a 32 team tournament, some of the others are pretty good, saying Brazil were almost as likely as not to win seems pretty brave. It stands out. (Note that as a result of the Brazil figure being so high, the other favourites are necessarily low, so Betfair gives them a greater chance of winning than FiveThirtyEight does).

Yes, Brazil were the favourites with the bookies too. That’s fine, the favourites don’t always win. But the bookies had the chances of winning closer to 20%, half that of FiveThirtyEight. So why were FiveThirtyEight’s ratings so high? There could be a number of reasons, but there seems to be one big one – Home Field Advantage.

There’s no place like etc

Sports teams tend to do better playing at home than away from home, and in football this is more pronounced than in other sports. Nate Silver, in his piece before the tournament, is quite clear that this gives Brazil a big advantage, and that’s the difference between his and other estimates.

One of the reasons teams do better at home than away is that refereeing decisions tend to go in their favour, and it wasn’t hard to spot examples of this in Brazil, from Fred’s hilarious penalty winning dive against Croatia to the non booking of Fernandinho for his continued assaults on James Rodriguez. With my small team supporter’s hat on, I would also point out that these are the kind of decisions that the big sides always get. That may also be irrelevant, but anyway.

Moreover, Brazil are (/ were) indisputably good at home. They hadn’t lost a competitive game at home since the 1970s. No European team had won in Brazil at all since John Barnes ran through the Brazilian defence in 1984. That’s a pretty good record.

But is it that amazing? Firstly, John Barnes’s goal is great (watch it again!) but Brazil rarely played friendlies at home until the run up to this years World Cup, preferring to jet off round the world for oodles of cash instead.

Likewise, competitive international matches are quite rare, so it’s not hard to go a long time without losing. England, rubbish England, who never, ever, ever learn, have only lost twice at home in competitive games since 2000, and both defeats cost the manager their jobs – Keegan’s resignation in the toilets and McClaren’s humiliation by umbrella.


Spain have only lost one competitive game at home since 2000, Argentina two (in normal time plus one on penalties).

Competitive home games in South America are less common, too. Unlike the European Championships, there is no qualification for the Copa America – the entire continent, plus the occasional guest, qualifies and the tournament takes place in one host nation. So if you’re not the host, there are no competitive home games. Brazil hasn’t hosted the Copa since 1989, when, obviously, they won, but that means that the only competitive home games they’ve played since then are World Cup qualifiers. As hosts, they didn’t have to qualify this time round, so that’s no competitive home games since 2009.

Small data

So we’re looking at quite a small dataset here and Brazil might not be quite as good at home as we think. This is, I think, a specific example of a general problem with modeling sports results for quadrennial tournaments –there just aren’t enough data points to go around. International teams don’t play very often, so models such as that used by FiveThirtyEight rely on club statistics. That seems fair enough, but the model only had good stats for a handful of European leagues and none at all for the Brazilian league. They likely underestimated quite how useless Fred was, for instance, since he doesn’t play in Europe.

It’s hard, then, using the data available to quantify home field advantage, and harder still in a tournament setting. There’s some evidence for it – England, France, South Korea and Japan all had their best World Cup results as hosts. There’s some evidence of no effect – no hosts have won the European Championships since France in 1984 and the last two sets of co-hosts have made almost no impact at all. In the Copa America, the hosts have won three out of the 11 tournaments in which there has been a host nation.

If you wanted to play pop sports psychologist, and we all do, you could make a case for home field advantage being weaker in a tournament setting. In a league, and in the continental level knock out club competitions, teams play each other home and away. You can, to a certain extent, cede some advantage as the away side if you know that you’ve got the opportunity to redress the balance back at your place (or, even more so, if you bring such an advantage to the second leg). In a tournament, though, there are no second chances, and we saw Croatia, Chile and Columbia, if not so much Mexico and Cameroon, really take the game to Brazil.

In order to get a handle on the impact of home field advantage at the tournament level, you would need some sort of idea of how a host fared compared to how it would have fared elsewhere. To do that, you’d need an idea of how good the team was anyway and given that we don’t really have enough data on the national sides that would presumably require the kind of individual player stats at club level that we also don’t really have. By this stage, we’re piling small data on small data and it’s starting to creak.

Home field disadvantage?

So that’s the stats. But there was something else about the home field this time – the pressure it appeared to put the Brazilian players under. The game against Germany was a full on 11-man meltdown, which started like this


and somehow became less professional. It’s hard to imagine that such a loss of perspective upon losing a player to injury, or control upon conceding an early goal could have happened in, say, South Africa four years ago.

A lot of the commentary after the game focused on the way the whole side crumbled under the pressure, but this was not just wisdom after the event. This is an interesting piece piece by Juninho, who won the world cup with Brazil in 2002, and was excellent on the BBC throughout the competition, from after the Croatia game where, almost in passing, he compares the current side to the one he played in, saying

“I know (the players) feel they are carrying a lot more responsibility on their shoulders”

Even in the opening game, against Croatia, Neymar was crying during the national anthem. Croatia started much the better side, and scored early due to an own goal from Marcelo. The pressure had been evident from the start and what happened in the semi final was an extreme version of what we’d seen previously.

None of which is to say modeling is stupid because it can’t predict such extreme responses, or such crazy results. Of course it can’t, no one can (actually, almost no one) and in many ways that’s the beauty of the whole thing. But the shortage of data is a problem, if you end up stacking one set of assumptions on top of another. In this case, it leads to Brazil’s home field advantage being inflated, and especially so for a tournament setting. That in turn means their chances of winning are inflated well beyond what most other people think, which means FiveThirtyEight stands out, hence the 140 character teasing in their direction.

Presumably, now we are officially in the era of big data, some of the inputs to these kinds of models will improve, even if the central problem of there only being one World Cup every four years remains unsolved. The interesting thing, then, is what happens now to the weight given to home field advantage in the FiveThirtyEight model. Presumably it gets adjusted down, because Brazil lost 7-1 while playing at home. But what about losing 7-1 because they were playing at home? Can we have a dummy variable for the crushing weight of 64 years of cultural, social and sporting expectation? Because we’ll need it if England host in 2030.

Comments are closed.