Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some advice on structuring #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion data-bulletproofing.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ _by Jennifer LaFleur, ProPublica_
- Beware of nonscientific methods: Web surveys, man on the street or other self-selection.
- Know the sample size, which will give you the sampling error.
- Again, know the source.
- Account for margin of error and non-response or “don’t know” when drawing conclusions.
- Account for margin of error and non-response or “don’t know” when drawing conclusions.
- If possible, run statistical tests on the data. What may look significant to you, may not be.
- When reporting, avoid false precision. Saying 52.18 percent of people think “blah, blah, blah” is portraying an impossible accuracy to readers.
- Put your numbers in perspective
Expand Down Expand Up @@ -69,6 +69,13 @@ _Russell Clemmings of the Fresno Bee on rechecking your data:_
- Have someone who knows the data check your results before publication -- even the target of the story, if possible.
- Double-check surprising results -- if citations spiked by 50 percent in one year, it could be a story or it could (more likely) be an error.

_Peter Harkins, formerly of the Washington Post on structuring data:_

- Extract real data (choose the worst, noisiest stuff) to use as fixtures in automated tests and you'll know you haven't broken things as you go.
- Don't model anything as many-to-many. Name the intermediate concept and give it two one-to-many associations. It will almost always want to accumulate more info in the future, and the cost of changing your model will be high.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand this point. In what situation would a many-to-many relationship not have an intermediate table / linking vehicle like you describe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to have an intermediate table, but I'm saying that intermediate table must be considered an entity in its own right. For instance, if you link authors and screenplays in Rails it'll call the join table authors_screenplays. Don't do that, name it authorships, model it, and treat it as a first-class record rather than plumbing. It will certainly gain complexity over time, and the longer you go before recognizing the relationships as a record, the more pain your code will suffer.

- Manually-entered data (especially intern-powered scraping) needs even more spot checks than programmatically-scraped data.
- You can't communicate too much or too often about how you're extracting data with the journalists, sources, and experts who know the topic better than you do.

## For More Information

Numbers in the Newsroom: Using Math and Statistics in News by Sarah Cohen for Investigative Reporters and Editors, Inc.
Expand Down