Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Animations, and sorting by event. #2

Open
knowlsie opened this issue May 9, 2017 · 7 comments
Open

Animations, and sorting by event. #2

knowlsie opened this issue May 9, 2017 · 7 comments

Comments

@knowlsie
Copy link
Contributor

knowlsie commented May 9, 2017

Gold Medals Animation

You may know this already, but the animation above found in index.html and exploring.md doesn't actually chart the number of gold medals, but the number of gold medallists (with repeats for people who won multiple medals). This means that a team of 8 that won gold counts as 8 gold medals - which definitely isn't the standard way to count medals. It's especially bad, because up to 8 people are listed as taking part in a 4 x 100m relay, because the data also include athletes that competed in the preliminaries but not in the final. As a result, the chart makes it seem like the US won 141 gold medals, when they actually won 46.

You can start to solve this by also grouping by event - below is one way of doing this).

let data =
  olympics
    .'filter data'.'Games is'.'Rio (2016)'.then
    .'group data'.'by Event'.'and Team'.'and Gold'.then
    .'group data'.'by Team'.'sum Gold'.then
    .'sort data'.'by Gold descending'.then
    .paging.take(8)
    .'get series'.'with key Team'.'and value Gold'

The problem is that for Rio this groups the Men's events and the Women's events together, and therefore reports that the US only won 38 medals, rather than 46. One can't yet fix this, as for the Rio (2016) data, the genders are all marked as Unknown, and the genders aren't noted in the event name like they are in the some of the other data.

Speaking of the other data, I tried moving this animation from Rio (2016) to London (2012), because London has its genders marked. However, you again get a small error in some calculations due to unclearly named events, between which it's pretty much impossible to differentiate. For example, the "synchronized 3m springboard women" and the "individual 3m springboard women" have the same name ("3m springboard women"), so if you group the data by event, it looks like China only won one event, when they actually won 2. There's no (nice/safe) way to separate these two data without better labelled events, as all the other fields are the same.

By the same logic as above, the other animation on exploring.md suffers from the same issues.

Secondary Animation

Overall, it's currently impossible to accurately transform the medals dataset in some of the most meaningful ways that a journalist might want to try - e.g. actual gold medals per country. It therefore seems like it might be sensible to reclean the relevant columns the medals dataset, to try to make the whole thing more sortable. One could either do this by adding a "Single/Team" column, and improving the "Gender" column; or by simply improving the "Event" column so that every event is clearly distinguishable. It may also be sensible to clarify somewhere that this is a dataset of medallists, not of medals as such.

@knowlsie
Copy link
Contributor Author

knowlsie commented May 9, 2017

I probably should have noted that this issue also applies to all the other repos where this dataset is used, most notably https://github.com/the-gamma/thegamma-services which contains the dataset itself. It's just most obvious here because of the pretty animations 🙂

@tpetricek
Copy link
Member

tpetricek commented May 10, 2017

You are certainly right that this is wrong!

The Rio 2016 data is parsed from this JSON file downloaded from some Olympics organization web site. Sadly, they don't have gender as a separate field, but the event names often start with "Men's" or "Women's". My F# script for scraping this tries to guess the gender here, but it seems I must have gotten something wrong...

The script is also responsible for matching names - I think this would actually be a fun problem for the AIDA project - I remember @triangle-man was looking for "entity matching" problems and so matching events from different files that I'm merging in the messy script in workyard would be a good demo.

@myyong @triangle-man Are you still looking for entity resolution problems? Perhaps we could add this to your AIDA challenges!

@knowlsie
Copy link
Contributor Author

Awesome, thanks for the raw data. I'll spend a bit of time tomorrow trying to classify exactly what we actually want to change, then seeing if I can do anything about it. Possibly won't be able to fix much - (I haven't yet looked at what data the JSON includes, so not sure) - but I'll give it a shot.

@knowlsie
Copy link
Contributor Author

So I've been working on this, and the gender issues are certainly doable 👍 but I've hit some other issues with the fuzzy-matching and wrangling algorithms. Here is a summary of the main issues:

  • it makes mistakes on events. For example: Christian Toro won the 200m Kayak in 2016, which is logged in the JSON, but the CSV has him down as winning the 200m Canoe. This happens in a few places, particularly within canoeing.
  • it calls different events the same thing. This means that there are rows that are completely identical and indistinguishable. For example: synchronized "3m springboard women" and the individual "3m springboard women" have the same name ("3m springboard women").
  • it calls the same events different things over time. For example: the event above (Synchronized Diving, 3m Springboard, Women) is called "synchronized diving 3m springboard women" in earlier years, but loses the "synchronized diving" to become "3m springboard women".
  • it results in confusingly named or inconsistent event names team is used to describe the team events for Aquatics, Gymnastics, and Table Tennis, but the team events for other sports get very different names with their own sport-specific labels, sometimes with the word team and sometimes without.

So I've been struggling at this, and really can't come up with a solution that fuzzily integrates the Rio data and the other olympics...

If you ignore all the early data, and take the event names from just Rio 2016, you can generate a cleaned CSV with nice event names. Even though the event names are things like Women (which refers to the 'Women' events for sports with only one event, like Football, Basketball, Rugby Sevens etc.), every individual event and row is distinguishable provided you filter or group by Discipline. This means that you can plot the correct medals chart which I did right here. I've uploaded the dataset so you can do some dot-driven development on this table if you want.1

The problem with this solution is that while this works great for Rio, obviously this easy-way-out doesn't integrate any of the other 30 olympics we're interested in. I really have no clue how to solve this issue properly. We could try fuzzy-matching the other way (i.e. fuzzy-match other event names to Rio, rather than vice versa) but that will probably only exacerbate the current inaccuracies. We could just replace the current Rio data with this table, but that will just lead to Rio having different event names to everything else, so we can't compare events over time, and doesn't solve the problems we have with transforming the other datasets (which also have indistinguishable rows, and different names over time.) So possibly this is an entity-resolution challenge for AIDA...?

Anyway, I'm going to put this on hold for now until I hear more from @tpetricek as to what he wants to do. I might have a more in-depth look at how the earlier non-Rio data is retrieved and wrangled, as I haven't really done much on that yet, and it could maybe go somewhere


1: The "Sport" column for the CSV isn't ideal right now, but you shouldn't need it for now, as the Discipline column will suffice. If we go anywhere with this, we can always improve the "Sport" column.

@tpetricek
Copy link
Member

Regarding the data sources:

  • Medals before 2012 come from data obtained by The Guardian
  • Medals for 2012 are scraped from BBC (see source) - this does not seem to work anymore, but here is a data dump. I think event names there match the Guardian table.
  • Medals from 2016 come from Rio2016 web site, which had them as embedded JSON - probably not working anymore - but it is saved in workyard as a dump

I tried scraping data from olympics.com too, but ironically, they seem to have worse data than the Guradian 😆

For data on The Gamma, I think:

  • Linking people names right is very important for getting aggregate stats (and at least one female athlete changed her surname 😨 )
  • Linking the areas (e.g. athletics) is quite useful for some data visualizations too
  • Figuring out the genders right would be very useful
  • Linking specific disciplines such as 200m canoe is not that important and I think it is fine to keep them as different names in 2016 vs. 2012, because it's not likely that someone will want to visualize the timeline of 200m canoe medals :)

Aside from that, if you can come up with a better way of linking things than the one I hacked together here, that would be awesome - if we are to add this to the "AIDA challenges" for @triangle-man, then we'll need some sort of baseline that people can improve!

Do you have your code for this somewhere on GitHub?

@knowlsie
Copy link
Contributor Author

knowlsie commented May 16, 2017

Thanks @tpetricek. On the Guardian stuff, the issue is really with the resolution of the Guardian's data (which is pretty bad) rather than the fuzzy-matching 😢 Not much we can do about this honestly...

But, if you'd like to keep that animation, I can think of a couple short-term fixes here:

  1. Replace the code used by the animation with the following. The estimate this gives is still wrong, but it's better, and it's roughly how you'd do it were the dataset perfect.

    let data =
      olympics
        .'filter data'.'Games is'.'Rio (2016)'.then
        .'group data'.'by Event'.'and Discipline'.'and Gender'.'and Team'.'and Gold'.then
        .'group data'.'by Team'.'sum Gold'.then
        .'sort data'.'by Gold descending'.then
        .paging.take(8)
        .'get series'.'with key Team'.'and value Gold'

    This isn't really as clear as it was before, but it's a little more accurate.

  2. Add gender to the Rio dataset. This is quite easily doable, and this line really should do the job, and I can't figure out what's wrong with it. I've been able to do it with other cleaning scripts, but I just can't get olympics.fsx to run (it's because of the BBC stuff, to my knowledge) so I'm struggling to debug it...

Anyway, those ideas are probably obvious and not that helpful, but I can't think of any other way out, because really we just need to improve the quality of the Guardian's data 😄

@knowlsie
Copy link
Contributor Author

knowlsie commented May 16, 2017

On how I cleaned just the Rio data, it's very obvious, I've taken pretty much the exact data from the medals.json, and formatted the athlete names a bit. I hacked it together in a bad Jupyter notebook which you can find here. (Only problem is the JSON has no "Sport" column, so I just took what I could from your data; anything else is now marked as "Unknown". Not great.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants