bars and lines and things that show


I finally figured out how to make my maps zoomable. It was, compared to actually making a map, really easy -like 6 lines of code. Basically there’s a behaviour in D3 already that allows you to zoom about, you just have to set it to work on the map projection so everything moves together nicely.

It doesn’t interfere with the tooltip or the interactivity or anything. Although the more you zoom in the more the boundaries get a bit weird – there appears to be a lake just north of Croydon, for instance. The example below is based on the young adult unemployment data I got from the census.

This from D3 tips and tricks was that I used – it’s really very helpful indeed

Small data – FiveThirtyEight and predicting the World Cup

A mug’s game

I should really begin with some sort of “full disclosure”. Despite being pretty obsessive about football – I have a season ticket to my home town club despite living a good couple of hours away by train, my favourite TV programme by miles is the Bundesliga highlights on ITV4 – and despite doing “stats” for a living, I am terrible at predicting football matches. I got beat in my World Cup Fantasy Football league by a child. I thought my team, Norwich, would finish comfortably mid-table last season, thanks to goals from our new striker, Dutch international Ricky van Wolfswinkel. We got relegated, and, with his one goal all season, Ricky van Wolfswinkel is no longer a Dutch international.

But that doesn’t mean I didn’t join in the schadenfreude around all those mistaken World Cup stats models when Germany battered Brazil last week. My favourite tweet was this

And I sent this one on my way home from the pub, where I appear to be criticising data viz legend Alberto Cairo (I wasn’t, but still, for shame. Also, for drink).

The main butt of the jokes appeared to be FiveThirtyEight, the data journalism site set up by Nate Silver. They gave Brazil a 65% chance of winning that game. That makes them definite favourites. (Again, full disclosure, I thought Brazil would win too, but we know about me now). Here’s one of many, many tweets.

(Note – that’s a different %age to the one I saw. Anyway)

Most bookies thought the odds were roughly even, with Germany possibly slight favourites. You could find pundits backing both sides, but most thought it would be close. So, in not predicting a 7-1 battering, FiveThirtyEight were far from alone, but they did rate Brazil’s chances higher than most.

So how did this high rating come about? FiveThirtyEight built a model estimating each side’s chances of winning the tournament before it kicked off, and adjusted those chances as the games went on. The high rating for Brazil comes, therefore, from this initial model. They have published the data behind their model, which is both honest and useful of them so we can have a poke around.

Brazil – everyone’s favourites

As a starting point, here’s a graph showing their pre tournament calculations of the probability of each country winning the World Cup. For comparison, we’ve also got the odds, supplied by FiveThirtyEight in their original post, from Betfair, a betting exchange.

There appears to be an outlier.

That 45% chance of Brazil winning the whole thing is pretty high. It’s a 32 team tournament, some of the others are pretty good, saying Brazil were almost as likely as not to win seems pretty brave. It stands out. (Note that as a result of the Brazil figure being so high, the other favourites are necessarily low, so Betfair gives them a greater chance of winning than FiveThirtyEight does).

Yes, Brazil were the favourites with the bookies too. That’s fine, the favourites don’t always win. But the bookies had the chances of winning closer to 20%, half that of FiveThirtyEight. So why were FiveThirtyEight’s ratings so high? There could be a number of reasons, but there seems to be one big one – Home Field Advantage.

There’s no place like etc

Sports teams tend to do better playing at home than away from home, and in football this is more pronounced than in other sports. Nate Silver, in his piece before the tournament, is quite clear that this gives Brazil a big advantage, and that’s the difference between his and other estimates.

One of the reasons teams do better at home than away is that refereeing decisions tend to go in their favour, and it wasn’t hard to spot examples of this in Brazil, from Fred’s hilarious penalty winning dive against Croatia to the non booking of Fernandinho for his continued assaults on James Rodriguez. With my small team supporter’s hat on, I would also point out that these are the kind of decisions that the big sides always get. That may also be irrelevant, but anyway.

Moreover, Brazil are (/ were) indisputably good at home. They hadn’t lost a competitive game at home since the 1970s. No European team had won in Brazil at all since John Barnes ran through the Brazilian defence in 1984. That’s a pretty good record.

But is it that amazing? Firstly, John Barnes’s goal is great (watch it again!) but Brazil rarely played friendlies at home until the run up to this years World Cup, preferring to jet off round the world for oodles of cash instead.

Likewise, competitive international matches are quite rare, so it’s not hard to go a long time without losing. England, rubbish England, who never, ever, ever learn, have only lost twice at home in competitive games since 2000, and both defeats cost the manager their jobs – Keegan’s resignation in the toilets and McClaren’s humiliation by umbrella.


Spain have only lost one competitive game at home since 2000, Argentina two (in normal time plus one on penalties).

Competitive home games in South America are less common, too. Unlike the European Championships, there is no qualification for the Copa America – the entire continent, plus the occasional guest, qualifies and the tournament takes place in one host nation. So if you’re not the host, there are no competitive home games. Brazil hasn’t hosted the Copa since 1989, when, obviously, they won, but that means that the only competitive home games they’ve played since then are World Cup qualifiers. As hosts, they didn’t have to qualify this time round, so that’s no competitive home games since 2009.

Small data

So we’re looking at quite a small dataset here and Brazil might not be quite as good at home as we think. This is, I think, a specific example of a general problem with modeling sports results for quadrennial tournaments –there just aren’t enough data points to go around. International teams don’t play very often, so models such as that used by FiveThirtyEight rely on club statistics. That seems fair enough, but the model only had good stats for a handful of European leagues and none at all for the Brazilian league. They likely underestimated quite how useless Fred was, for instance, since he doesn’t play in Europe.

It’s hard, then, using the data available to quantify home field advantage, and harder still in a tournament setting. There’s some evidence for it – England, France, South Korea and Japan all had their best World Cup results as hosts. There’s some evidence of no effect – no hosts have won the European Championships since France in 1984 and the last two sets of co-hosts have made almost no impact at all. In the Copa America, the hosts have won three out of the 11 tournaments in which there has been a host nation.

If you wanted to play pop sports psychologist, and we all do, you could make a case for home field advantage being weaker in a tournament setting. In a league, and in the continental level knock out club competitions, teams play each other home and away. You can, to a certain extent, cede some advantage as the away side if you know that you’ve got the opportunity to redress the balance back at your place (or, even more so, if you bring such an advantage to the second leg). In a tournament, though, there are no second chances, and we saw Croatia, Chile and Columbia, if not so much Mexico and Cameroon, really take the game to Brazil.

In order to get a handle on the impact of home field advantage at the tournament level, you would need some sort of idea of how a host fared compared to how it would have fared elsewhere. To do that, you’d need an idea of how good the team was anyway and given that we don’t really have enough data on the national sides that would presumably require the kind of individual player stats at club level that we also don’t really have. By this stage, we’re piling small data on small data and it’s starting to creak.

Home field disadvantage?

So that’s the stats. But there was something else about the home field this time – the pressure it appeared to put the Brazilian players under. The game against Germany was a full on 11-man meltdown, which started like this


and somehow became less professional. It’s hard to imagine that such a loss of perspective upon losing a player to injury, or control upon conceding an early goal could have happened in, say, South Africa four years ago.

A lot of the commentary after the game focused on the way the whole side crumbled under the pressure, but this was not just wisdom after the event. This is an interesting piece piece by Juninho, who won the world cup with Brazil in 2002, and was excellent on the BBC throughout the competition, from after the Croatia game where, almost in passing, he compares the current side to the one he played in, saying

“I know (the players) feel they are carrying a lot more responsibility on their shoulders”

Even in the opening game, against Croatia, Neymar was crying during the national anthem. Croatia started much the better side, and scored early due to an own goal from Marcelo. The pressure had been evident from the start and what happened in the semi final was an extreme version of what we’d seen previously.

None of which is to say modeling is stupid because it can’t predict such extreme responses, or such crazy results. Of course it can’t, no one can (actually, almost no one) and in many ways that’s the beauty of the whole thing. But the shortage of data is a problem, if you end up stacking one set of assumptions on top of another. In this case, it leads to Brazil’s home field advantage being inflated, and especially so for a tournament setting. That in turn means their chances of winning are inflated well beyond what most other people think, which means FiveThirtyEight stands out, hence the 140 character teasing in their direction.

Presumably, now we are officially in the era of big data, some of the inputs to these kinds of models will improve, even if the central problem of there only being one World Cup every four years remains unsolved. The interesting thing, then, is what happens now to the weight given to home field advantage in the FiveThirtyEight model. Presumably it gets adjusted down, because Brazil lost 7-1 while playing at home. But what about losing 7-1 because they were playing at home? Can we have a dummy variable for the crushing weight of 64 years of cultural, social and sporting expectation? Because we’ll need it if England host in 2030.


Every now and again I use this site to post links to datawrapper and google charts. Top blogging!

Odds/ ends

I’ve been working on quite a lot of things that I haven’t really managed to finish, so I thought I’d start putting them up here. Often when I’m doing a visualisation I’ll start with the analysis and then do the coding. Sometimes though I want to come up with a particular type of presentation, just to see how it works. Without the impetus of a decent piece of analysis behind it, though, they just languish on my hard drive.

There’s one particular example recently, where I wanted to do something compact for mobiles that would be interactive but simple. I based it on this idea by Scott Murray, which in turn was adapted here. Basically, what you’ve got in the example are three overlapping shapes. As you click one it moves to the front of the pack. It’s a really neat effect and you can try it out below.

At the same time, I was starting to get bothered by how the interactives I’d been working on aren’t great on mobiles. Those maps, for instance, are far too detailed for a 4 by 3 inch screen. All the clicking is really fiddly, the mouseover is actually a touch on a touch screen, which is also fiddly.

So this presentation seemed to be a good solution to that problem. You can have big things to click, but they overlap so they don’t take up much space. I thought that maybe you could use it for pie charts, showing each slice separately, which would give the slices more room. When I was working on it there had been some stats out about Food Banks and who was using them so I used those numbers. The text is too big cos I never got round to sizing it properly. And the colours are horrendous cos I never got round to doing nicer ones. Anyway.

What I thought, and still think, about this, is that it’s OK but a bit pointless. It seems a lot of work – a click! – to get one number. And there are only four numbers in the whole thing, including the total, that you see when you open it up. So I’m not really convinced it’s worth it. Also, it sort of doesn’t really work on a mobile. There’ some technical stuff I can’t quite crack about getting it to fill the screen and not doing this weird blinking thing on each click that happens at the moment. So it’s not great all round.

In retrospect, what happened here is that I had a solution – this nice bringing to the front thing that someone else had developed – and I tried to find it a problem to solve. That’s unlikely to be the right way round. I can see that the effect might be useful as a smaller part of something else, and I now have that on hand should I ever need it. But I’ll use it because I need it, not because I’ve got it.

Data wrappin

Here’s a graph I drew on data wrapper

See how you can select the different categories? Nice, huh?

Let’s draw maps

From around the middle of January, I’d been trying to figure out how to make interactive maps. Last month, I put up my first finished map ( I say “my” – I got a lot of help along the way, and those people, who I owe a lot of thanks to, will be acknowledged below). So it looks like it took about two months to figure out how to do it. It felt like rather more. What I’m going to do here is explain a little about what I did, why it didn’t work, what I then did, why that was OK but still needed work, then what I ended up doing. It’s likely to be a long post.

Background – the kind of maps I wanted to draw

What I was keen on doing was coming up with an interactive map where you would colour in areas according to certain population level data – eg unemployment rates for an area, maybe GDP for a country, that kind of thing. I wasn’t looking to draw GIS maps, maps that would help people navigate between places. My interest was more in using the map as a picture, that could tell us about how the characteristics of areas differed.

I also wanted something that looked a bit different from most similar maps online. I don’t much like, for instance, the google fusion approach, where colours are overlaid an existing streetmap. Visually I think it’s not great – it’s a bit cluttered. (Also there’s something about them that is almost misleading – by including streets, and even, on a close zoom, buildings in the picture could make the viewer think that the data applies at the level of the street, or even the building, when actually it’s an average of a larger area). Really what I was looking to do was create something quite clean looking that I could give an identity to.

Starting out

Like almost all my D3 expeditions, this one begins with Scott Murray’s book, Interactive Visualisation for the Web. There’s a chapter in there about drawing a map of the United States, linking it to some data and then colouring it in nicely.

The key to drawing the map of the United States is getting a file of the map coordinates in JSON format. You link that to the D3 code which does the work of turning these coordinates into lines and borders that form shapes that you can then colour in as you see fit. So what I needed was a JSON file for the UK, showing the boundaries of UK local authorities.

The first place I went was the Ordinance Survey, which gives away loads of boundary data, mainly as shapefiles. Shapefles (.shp files) are what mapinfo and Arc GIS use. They make no sense without the software to read them, and I don’t have the software.

What I had to do was find something to turn these files into JSON files. Scott Murray recommended, which allows you to drop files in and convert them to other file types.

That’s not the half of it, though. The main thing mapshaper does is allow you to smooth out wiggly lines, of which the UK has loads, both as internal borders between areas and, more obviously, in the form of coastline. Doing this makes the file size much smaller, and so the map loads much quicker which is obviously a good thing.

Anyway, the net result of al this is that spits out a JSON file which you can point at the code which will just draw you the map you need. Pretty much. Here is my first attempt. I have coloured the authorities in red, as you can see.

A map of English local authorities.

Screen Shot 2014-03-09 at 18.37.59
Which is pretty good as a first try. Here’s a second effort

A second map of English local authorities

Screen Shot 2014-03-09 at 18.37.35

Clearly, this is a more detailed map. But it’s still not quite right.

What’s going wrong is the projection. Anytime you want to reproduce a 3d picture (which is what a map of the UK is) in 2d, you need to use a projection to map those coordinates. And if you get the wrong one, pretty much anything could happen.

The problem I was having was to do with the coordinates system the Ordinance Survey use, which is unique to its own surveys, rather than a global system. When I put those coordinates into my code template, they got read as the above. Which is obviously hopeless.

A new approach

What I needed, then, were files that used the right kinds of coordinates and actually there are plenty out there. Every time someone makes a google fusion map, if the settings are friendly enough, you can download the file that draws the boundaries. These boundaries are KML files but the key thing is that they are drawn off global coordinates, which are the ones my D3 code knows how to deal with. So we’re getting somewhere here.

Simon Rogers, who used to run the Guardian datablog and now works at twitter, draws lots of google fusion maps, and has a bunch of boundary files on his site, including some for UK local authorities. They’re free to download, so I took them.

All the data I need is in these files. The coordinates are global – you can see, for instance, in the borough of Greenwich that the latitude crosses 0 degrees. There’s a bit of messing about to do to get the right punctuation for a JSON file – JSON uses lots of [square brackets], KML files tend not to – but mostly it’s easy enough.

But after getting one clean line of JSON for each local authority, I ran into a problem. I could see the different LAs on the screen, the boundaries looked fine, but when I tried to colour them in, it would colour in the whole page. I thought that maybe the boundaries weren’t closed, but they kind of are by definition – that’s how the things work, they’re just lists of coordinates and the code links them up.

So, as I always do, I went to Stack Overflow and posted my problem. And as always happened, someone helped me out. They pointed out that the problem was that the boundaries were effectively drawn inside out, so that the areas I was seeing were cut outs from the whole screen; the obverse of what I was looking for. So rather than colouring in the shape of the local authority, I was colouring everything but that shape and after I’d coloured in a couple, the whole screen was coloured in – local authorities, the North Sea, the Channel, Ireland, the lot.

The reason this had happened was that the coordinates were written anti clockwise, whereas the code wanted clockwise coordinates. My stack overflow helper sent me a link to a place that would tell me how to reverse them without tediously going through each one and rewriting them, which would have been impossible. It looked a little complicated, and it was already late in the day so I thanked my correspondent and went to the pub.

By the time I got back from the pub, the same person had responded again to my thank-you post. They said they hadn’t realized quite what they were linking to, and yes it did look rather complicated, and why don’t you try this piece of code I’ve just written especially that reverses all the coordinates for you, and prints the new ones you need underneath the map that demonstrates that this all now works?

I was pretty taken aback, and I am now eternally indebted to AmeliaBR, who, a quick inspection of Stack Overflow reveals, seemingly dedicates her waking life to solving the computing problems of the less able. An absolute star.

With this magic code in hand, I was nearly there. The one small problem remaining were the authorities that are made up of a few different areas – Great Yarmouth straddles a river, for instance, and lots of the South Essex coast includes tiny little islands. They needed to be broken up, reversed, and stitched back together again. With some places, I didn’t bother, and just hacked off the tiny extra parts . Anglessey will just have to do without Holyhead. With others – Great Yarmouth being a good example – you can’t really do that. It took a little while, but it’s done now.

So then you get a map and it looks like this.

A map of English local authorities that actually works

You can play with it a bit – hovering over the areas reveals the values in the tooltip. Off the top of my head, I can’t now remember what the underlying data is – it’s likely to be people claiming Job Seeker’s Allowance or something. Looks like bad news for Hull and Birmingham, whatever it is. I decided to pull out London, as it’s too small to look at in the full map.

I then worked on an interactive version that compared unemployment among under 25s from the 2001 and 2011 censuses. The finished product is below -the interactivity is based on the same ideas as the interactive graphs I’ve been working on – click a button, pass in new data. The design of it was done by my collegue Hannah, who also pointed me in the direction of to choose the colours. You could lose hours to that site, choosing different colour combinations for maps. The one we ended up with goes from green (good) to red (bad).

We also added in a graph showing the distribution of unemployment rates in 2001 and 2011, as a quick way of looking at the whole picture. (That idea came from the Facts are Sacred book, by Simon Rogers) This time, the map is of England and Wales.

Unemployment among Under 25s in the 2001 and 2011 censuses

I think it’s a good visualisation because there’s an obvious story – unemployment gets a lot worse, everywhere. There are geographical aspects to it– the north south split, the deterioration in parts of coastal England, the rural/ urban differences –which a map can show that a graph can’t.

And that, more or less, is it in terms of drawing a map. But if we go back to why I wanted to do this, there was the thing about not looking like google maps. But a far better reason, when I thought about it was that google maps is a bit restrictive. You’re kind of stuck with their layout, it’s hard to annotate the map or add any more to it graphically. So you’ve got a map, but not much more.

Since finishing this map, I’ve done three more quite quickly. The work is setting the thing up – once that’s done, you’ve essentially got a template you can keep dropping stuff into. Then the only additional work depends on what specific changes to the map you want to make – adding in another year’s data, or a different type of data. The map at the bottom here allows you to choose different data within each year, which was a fair bit more work. But again, once that’s done, it’s done.

The maps we eventually put up had a bit more to them; they were more like full visualisations, with words and other info. It’s really good to have that control. And the longer I spend on this stuff, the more I realise that there’s no other way I’d want to do it – to visualise stuff properly you need to be able to use all of the space how you want it.

Edit: I don’t take comments on the blog as dealing with the spam is too much of a pain. You can find me on twitter (all too often) – @tommacinnes


This is a stacked bar graph showing changes over time in some (made up) measure, broken down into its component parts (Things A to D). The static version would be OK, but a bit limited – it would show a total pretty well, but it’s hard to compare the changes in the component parts as (with the exception of whichever series you choose to go along the x axis) they all start at different heights. So what this graph does is allow you to choose the component parts you’re interested in and just compare those. That allows the user to identify which parts are driving the change in the whole. Click the squares at the top to show or remove the different slices of the chart in whichever combination you like (the numbers are nonsense, obviously).

Most of the graphs I’ve done so far allow the user to look at one thing, then another thing, then back to the first thing, maybe via some third thing, the advantage being that it’s all in the same space on the screen. That’s handy. But this allows the user to look at things in different ways, according to their preference, and that’s subtly different. And, I think, better.

(edit: Just struck me that maybe the difference is just quantitative, as there are simply more options to view. This graph has 4 different views, for all its seeming complexity. The one above has 15).

As always, it’s done using D3, and builds on the principles in this graph. The main challenge is getting the bars to move around when you remove others, but that’s pretty straightforward after a fashion. The code is a bit repetitive, but the main thing I wanted to avoid was writing code for every possible permutation of the four components separately and I managed that so I’m happy enough.

Multiple choice

Here’s a graph that let’s you choose combinations of stuff, ie female or male, life expectancy or disability free life expectancy. So there’s four things you could look at. I’ve broken it down by the deprivation of the area people live in – poorest fifth, next poorest etc etc. The final version will turn up on the NPI site once I’ve made it look nicer.

When you click from male to female, that just does what it always does when you click, ie switch from one set of data to another. The trick here is that the data is conditional on what’s going on elsewhere, ie whether you selected LE or DFLE. So there’s an if function that refers to the thing you aren’t clicking, if that makes sense.

It’s OK, but it’s a bit limited. Currently there are 4 (2 x 2) options. If you increase that to say 6 (2 x 3, by adding in another measure of health), the amount of code you need to write also increases by (around) 50% (ish), and a lot of that would be set inside some complex, nested if function which would be prone to error and need loads of checking. So there’s no real economies of scale in the way I’ve set it up.

That’s in contrast to say, analysing by ten different types of areas, rather than 5. That would be pretty much no work at all, if I could get hold of the data.

Entrance to exit

I’ve been working on a graph whereby you can switch the x axis and increase the number of categories. Essentially, lots of different breakdowns of the data in the same space. The changing x axis allows you a lot of flexibility re what you can look at.

Click the tabs below etc and so on.

This was made using d3 again. The main technical thing here is using the enter/ exit functions, and seeing the bars move in and out on the right hand side. Making the bars thinner and fatter so that they fit is pretty basic.

The point here is that if the analysis is all in the same space, then the explanatory text has to be as well. There’s no point in having the numbers at the top of the page and the words at the bottom. So I’ve appended some text to the graph that moves in and out with the clicks.

Each chunk of text is a jpeg. This is because Internet Explorer (even the most recent version) cannot handle the “foreign object” code, which allows me to use a text box like it was any other object in d3. IE is still around 25% of the browser market, so it’s not really OK just to rule those users out. So I needed a workaround.

It’s not ideal using jpegs- it seems a bit flickery on my browser (possibly my computer) and this project now contains 9 files, which feels too many (one basic html file, 4 data files and 4 text boxes made into jpegs). Hopefully someone will come up with something better soon, or IE will catch up with the rest of the world. Let’s hope.

Also, final thing. When you switch between c and d, the word “Me” scoots along the x axis to a new position. This is totally unintentional, and part of how d3 animates stuff, rather than a choice. Obviously it’s just the result of using “Me” twice in my different nonsense axes but it’s quite funny nonetheless.

It’s a London thing