-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Animations, and sorting by event. #2
Comments
I probably should have noted that this issue also applies to all the other repos where this dataset is used, most notably https://github.com/the-gamma/thegamma-services which contains the dataset itself. It's just most obvious here because of the pretty animations 🙂 |
You are certainly right that this is wrong! The Rio 2016 data is parsed from this JSON file downloaded from some Olympics organization web site. Sadly, they don't have gender as a separate field, but the event names often start with "Men's" or "Women's". My F# script for scraping this tries to guess the gender here, but it seems I must have gotten something wrong... The script is also responsible for matching names - I think this would actually be a fun problem for the AIDA project - I remember @triangle-man was looking for "entity matching" problems and so matching events from different files that I'm merging in the messy script in workyard would be a good demo. @myyong @triangle-man Are you still looking for entity resolution problems? Perhaps we could add this to your AIDA challenges! |
Awesome, thanks for the raw data. I'll spend a bit of time tomorrow trying to classify exactly what we actually want to change, then seeing if I can do anything about it. Possibly won't be able to fix much - (I haven't yet looked at what data the JSON includes, so not sure) - but I'll give it a shot. |
So I've been working on this, and the gender issues are certainly doable 👍 but I've hit some other issues with the fuzzy-matching and wrangling algorithms. Here is a summary of the main issues:
So I've been struggling at this, and really can't come up with a solution that fuzzily integrates the Rio data and the other olympics... If you ignore all the early data, and take the event names from just Rio 2016, you can generate a cleaned CSV with nice event names. Even though the event names are things like The problem with this solution is that while this works great for Rio, obviously this easy-way-out doesn't integrate any of the other 30 olympics we're interested in. I really have no clue how to solve this issue properly. We could try fuzzy-matching the other way (i.e. fuzzy-match other event names to Rio, rather than vice versa) but that will probably only exacerbate the current inaccuracies. We could just replace the current Rio data with this table, but that will just lead to Rio having different event names to everything else, so we can't compare events over time, and doesn't solve the problems we have with transforming the other datasets (which also have indistinguishable rows, and different names over time.) So possibly this is an entity-resolution challenge for AIDA...? Anyway, I'm going to put this on hold for now until I hear more from @tpetricek as to what he wants to do. I might have a more in-depth look at how the earlier non-Rio data is retrieved and wrangled, as I haven't really done much on that yet, and it could maybe go somewhere 1: The "Sport" column for the CSV isn't ideal right now, but you shouldn't need it for now, as the Discipline column will suffice. If we go anywhere with this, we can always improve the "Sport" column. |
Regarding the data sources:
I tried scraping data from olympics.com too, but ironically, they seem to have worse data than the Guradian 😆 For data on The Gamma, I think:
Aside from that, if you can come up with a better way of linking things than the one I hacked together here, that would be awesome - if we are to add this to the "AIDA challenges" for @triangle-man, then we'll need some sort of baseline that people can improve! Do you have your code for this somewhere on GitHub? |
Thanks @tpetricek. On the Guardian stuff, the issue is really with the resolution of the Guardian's data (which is pretty bad) rather than the fuzzy-matching 😢 Not much we can do about this honestly... But, if you'd like to keep that animation, I can think of a couple short-term fixes here:
Anyway, those ideas are probably obvious and not that helpful, but I can't think of any other way out, because really we just need to improve the quality of the Guardian's data 😄 |
On how I cleaned just the Rio data, it's very obvious, I've taken pretty much the exact data from the medals.json, and formatted the athlete names a bit. I hacked it together in a bad Jupyter notebook which you can find here. (Only problem is the JSON has no "Sport" column, so I just took what I could from your data; anything else is now marked as "Unknown". Not great.) |
You may know this already, but the animation above found in index.html and exploring.md doesn't actually chart the number of gold medals, but the number of gold medallists (with repeats for people who won multiple medals). This means that a team of 8 that won gold counts as 8 gold medals - which definitely isn't the standard way to count medals. It's especially bad, because up to 8 people are listed as taking part in a 4 x 100m relay, because the data also include athletes that competed in the preliminaries but not in the final. As a result, the chart makes it seem like the US won 141 gold medals, when they actually won 46.
You can start to solve this by also grouping by event - below is one way of doing this).
The problem is that for Rio this groups the Men's events and the Women's events together, and therefore reports that the US only won 38 medals, rather than 46. One can't yet fix this, as for the Rio (2016) data, the genders are all marked as Unknown, and the genders aren't noted in the event name like they are in the some of the other data.
Speaking of the other data, I tried moving this animation from Rio (2016) to London (2012), because London has its genders marked. However, you again get a small error in some calculations due to unclearly named events, between which it's pretty much impossible to differentiate. For example, the "synchronized 3m springboard women" and the "individual 3m springboard women" have the same name ("3m springboard women"), so if you group the data by event, it looks like China only won one event, when they actually won 2. There's no (nice/safe) way to separate these two data without better labelled events, as all the other fields are the same.
By the same logic as above, the other animation on exploring.md suffers from the same issues.
Overall, it's currently impossible to accurately transform the medals dataset in some of the most meaningful ways that a journalist might want to try - e.g. actual gold medals per country. It therefore seems like it might be sensible to reclean the relevant columns the medals dataset, to try to make the whole thing more sortable. One could either do this by adding a "Single/Team" column, and improving the "Gender" column; or by simply improving the "Event" column so that every event is clearly distinguishable. It may also be sensible to clarify somewhere that this is a dataset of medallists, not of medals as such.
The text was updated successfully, but these errors were encountered: