Simple tool to analyze data in order to support documenting some ad hoc "schema" of available information-elements.
The simple steps to get you enjoying this.
The data at hand is worth your time
You have a bunch of data that is worth being exchanged: people are willing to maintain, others are eager to do stuff with it.
Some of those actually have a clue about what the data suposedly means, and can be motivated to help augment any systematic description a robot already found out about it.
Automatic Data Analyzer
One side of the shmdoc project is an automated process (robot) that analyzes any data presented to it to produce a basis for further documentation.
From the actual data, this system will primarily list all the various information elements found in the data, and where to find it
To let the robot do his work, you simply need to
- install nodejs,
- checkout this project and run
npm install
- then use the tool with a command like
node index.js datafiles*.json -r path/to/write/result.json
You would typically set this up on a server that through some schedule job (cron)
- downloads the data
- feeds the data to the robot
- makes the result.json of the robot available via a web-server to the next step
Manually Adding Human Insight (aka Documenting)
shmdoc
choose using simply the Google Spreadsheets product to allow both further documenting and immediately publishing the documentation about the data.
To use it, you create a new Google Spreadsheet and add the shmdoc addon from the addon - marketplace.
The code for the project is available at github, and published from this apps-project
Roundtrip the process to effectively maintain the documentation and assert the
shmdoc
The good thing is that, the human augmented information in the google-spreadsheet can now be fed back into the robot (step 1)
Doing this will make the robot still do his earlier job (i.e. learn about possible new fields), but additionally it will now check and assert the human undersigned version of the shmdoc
coming from the google-spreadsheet. In doing so, it will list any found value-errors it occured in the (possibly newer datasets).
The easiest way to close this loop is to
- publish the shmdoc-spreadsheet as csv straight from google-docs
- feed that published uri back into the robot with the
-b <<uri>>
switch.
The google-spreadsheet-uri should typically follow this pattern: https://docs.google.com/spreadsheets/d/{doc_id}/pub?gid={tab_id}&single=true&output=csv
If all is well, you have succesfully setup this cycle:
- The robot currently only reads samples in json, xml and csv. We're open for additions that support more.
- More suggestions and patches are welcome.
My personal, somewhat related, and almost relavant experience in this space can be summed up as coming from these random events:
During an ancient classical typed versus untyped language debate a wise man once told me
"Nobody (really) is interested in the types, only in the values".
The common knowledge among friends regarding Ikea furniture is
"Manuals (any documentation) are for sissies".
There. You can twist the above in a number of ways, so feel free :)
Here is my read: When faced with (real) people exchanging data through some formal mechanism I see them quickly agreeing on some serialization format (xml, json, csv) - but never taking the time to actually list or describe the information-elements being communicated, nor checking if their claims about the nature of the values really can be asserted.
And later, when the Master Engineering Peeps emerge from their Ivory Towers up in the Sky (gotta love the view) and see the Silly Dilettantes failing to bootstrap even the simpelest interchange... they mostly end up removing any remaining enthousiasm and motivation by insisting upon a full blown Schema. Ouch.
Having seen many of those without even the most basic shred of comments or explanations, let alone semantics I've come to realise that once again this is about human communication, and not about technical wizardry. Far more often then a schema the project would benefit from a (maintained) simple documentation of all available information.
What shmdoc does is just make sure people get to analyze real data (what they are interested in) and describe in the most intuitive and pragmatic way what it means for them.
The name is pronounced sjiemdoc
(ʃim'dɔk). And spelled all lowercase shmdoc
.
It was meant as a play on 'only the gist (no vowels) of the schema' + plus a 'focus on getting it documented' by a group of people.
The data-structures produced by the shmdoc robot and maintained in the spreadsheet do NOT strictly follow an explicit schema.
Why bother? Since, you can simply consult the shmdoc on shmdoc!