-
Notifications
You must be signed in to change notification settings - Fork 12
random discussion
maxogden edited this page Feb 1, 2012
·
1 revision
19:54 < eightysteele> working on a project. 200 institutions all sharing their biodiversity data.
19:54 < eightysteele> want to crowd source improvements.
19:54 < maxogden> wow
19:54 < eightysteele> yeah man, cool stuff
19:54 < eightysteele> so the idea would be
19:54 < eightysteele> i fork your data about tigers, improve the data, and send a pull request
19:55 < eightysteele> i've played with nodejs a smidge
19:55 < eightysteele> notice datacouch has some node deps
19:55 < maxogden> i am rewriting the whole thing in node at the moment (almost done)
19:55 < eightysteele> niiiice
19:55 < maxogden> mostly for performance reasons
19:55 < eightysteele> still rides on couch or no?
19:55 < maxogden> yep
19:56 < eightysteele> amazing
19:56 < eightysteele> so only thing with a nodejs backend is you can't serve the app entirely in couch, yeah?
19:56 < eightysteele> couchapp style
19:56 < maxogden> yeah exactly, but i wrote https://github.com/maxogden/rewriter to run couchapps in node
19:56 < eightysteele> nice!
19:56 < eightysteele> had not seen that
19:57 < eightysteele> are pull requests on deck in dc
19:57 < maxogden> basically im trying to get all the datacouch code to use 100% streaming parsers
19:57 < eightysteele> i see forking is there
19:57 < eightysteele> nice
19:57 < maxogden> pull requests are on the to do list for sure
19:57 < eightysteele> nodejs deps for pull requests or no
19:57 < eightysteele> like, would fork pull work without the node stuff
19:58 < maxogden> yeah you could technically go couch to couch
19:58 < maxogden> right now if you fork someones dataset and improve it
19:58 < maxogden> they can actually replicate back using pure couch
19:58 < maxogden> its just not in the UI
19:59 < maxogden> but im writing the node bits so that you're not just doing one-off forks but actually forking streams so to speak
19:59 < maxogden> so if i fork your dataset that has EVERYTHING IN CAPITAL LETTERS and run a JS function against the data that lowercases everything
19:59 < eightysteele> and you can replicate back to the same couch it was forked from
19:59 < maxogden> now i have a copy of your database in lowercase letters
19:59 < maxogden> but if you add a new doc to the database tomorrow
20:00 < maxogden> i want datacouch to automatically take the new doc and run it through the function i wrote and copy it to my database
20:00 < eightysteele> right
20:00 < eightysteele> exactly
20:00 < maxogden> and then if i send you a pull request it would just take the functions i wrote and apply them to your database as well as any new data that gets added to your
database
20:00 < eightysteele> right right
20:01 < maxogden> so instead of sending diffs in the form of lines of code like github you are sending functional transformations
20:01 < eightysteele> huh, nice!
20:01 < maxogden> which is the same as couch's map reduce... a little javascript function that takes 1 doc at a time and emit()s a new value
20:01 < eightysteele> would this thing be usable by scientists who can't write js functions
20:02 < eightysteele> like, what is the bare bones most simple way to get basic fork and pull request going
20:02 < maxogden> yeah js functions are just the underlying tool, but the UI of datacouch abstracts that away a bit
20:02 < eightysteele> sure sure
20:02 < maxogden> like if you click the little arrow on a dataset and get the dropdown menu then there are built in functions
20:02 < eightysteele> right
20:03 < maxogden> another motivation for the rewrite is to make it easier for people to contribute
20:03 < maxogden> cause couchapps are a little more exotic than nodejs now
20:03 < eightysteele> nice
20:03 < eightysteele> haha
20:03 < eightysteele> well
20:03 < eightysteele> so
20:03 < eightysteele> so when you fork, what's happening at the couch level. just a copy of documents?
20:03 < maxogden> so i figure if its just another node app it will be easier to get running
20:03 < eightysteele> right
20:04 < maxogden> eightysteele: yep exactly. the docs get copied to another couch database that your user account owns but all the docs have the same ids and revisions
20:04 < eightysteele> could it be the same couch?
20:04 < eightysteele> for example, 200 institutions have their data in a single couch, and we want people to fork and store the fork in the same couch
20:04 < maxogden> so if you fork a db of 100 docs and they are all at revision 1, then you edit 5 of them, how 5 of them will be at revision 2. if you replicate them back to the
upstream master then it will only copy the 5 edited docs
20:05 < eightysteele> right right
20:05 < maxogden> if during the time between your fork and your replication back upstream those same 5 docs were also edited on the upstream couch then couch will tell you you
have a conflict
20:05 < eightysteele> yep
20:05 < maxogden> eightysteele: they can be on the same couch or different couches, no difference (other than upload/download speed)
20:06 < eightysteele> so you just keep track of forks by grouping users with revision numbers on a doc
20:07 < eightysteele> haha, no, that doesn't sound quite right. can you give me a pointer to your code where forking happens. it might make more sense seeing it.
20:07 < maxogden> eightysteele: not that granular actually, i just keep a list of every dataset and what user owns it, and if its a fork what dataset it was forked from
20:07 < eightysteele> oh, nice, that's pretty simple actually.
20:08 < eightysteele> so then then merging back in a fork... is that more tricky?
20:08 < maxogden> https://github.com/maxogden/datacouch/blob/node_server/service/database_provisioner.js#L49
20:08 < eightysteele> there we go, nice
20:08 < maxogden> i havent thought through all of the scenarios yet admittedly
20:09 < maxogden> so i have yet to decide whether pull requests will only be javascript functions
20:09 < maxogden> or if they will also include the documents
20:09 < eightysteele> right
20:09 < maxogden> technically every action you take in the UI in datacouch can be represented as a javascript function
20:10 < eightysteele> that's a good point
20:10 < maxogden> like if you edit 1 field in 1 document and update it's value then the function would be function(doc, emit) { if (doc._id === "5") doc.name = "bob"; emit(bob) }
20:10 < eightysteele> ah, yes, this makes more sense now
20:10 < maxogden> but it wouldnt make sense from a performance standpoint to do it that way
20:10 < eightysteele> good example
20:11 < maxogden> because if you have 100000 docs it would go through 99999 of them and do nothing
20:11 < eightysteele> haha, yeah
20:11 < maxogden> so i need to figure out what the 'patch format' is for data
20:11 < eightysteele> patch format... like the diff format or no
20:11 < maxogden> maybe its a combination of functions and also HTTP operations like the normal couch api
20:12 < eightysteele> i like the thought of riding on http for this
20:12 < maxogden> yeah basically if i fork your dataset and change it
20:12 < maxogden> how do i describe my changes in the most flexible way
20:12 < eightysteele> yep
20:12 < maxogden> so that you can pick and choose which changes you want to have merged into your data
20:12 < maxogden> i think theres 3 types of forking
20:13 < maxogden> a one time copy of all the documents
20:13 < eightysteele> right
20:13 < maxogden> a copy of the documents that uses continuous replication from couch
20:13 < eightysteele> yep
20:13 < maxogden> and then a continuous copy that also pipes through any functions that you have added to your fork
20:13 < eightysteele> k
20:13 < eightysteele> yeah that covers it
20:14 < eightysteele> 3 seems most complex
20:14 < maxogden> part of me just wants to do the 3rd one and give you controls on a function by function bases if you want each function to affect new data thats written
20:14 < maxogden> and then another control that says 'do you want data from the forkedFrom database to get automatically copied into your fork in real time or not'
20:15 < eightysteele> wow, that's super flex
20:15 < eightysteele> very nice
20:15 < maxogden> so far ive been trying to keep things as simple as possible, then i realized i could support all the fun real time streaming stuff using nodejs
20:15 < maxogden> so now that the backend is written for doing that stuff i need to figure out how i actually want to implement it
20:16 < eightysteele> yeah man, that's really good stuff
20:16 < maxogden> but there are other pending things as well, such as putting visualizations on top of the data in the form of HTML5 apps
20:16 < eightysteele> for my use case, simpler is better. we have the constraint of riding 100% on couchdb, so no other server
20:16 < eightysteele> html5 ftw
20:16 < maxogden> theres basic support for installing couchapp templates into your database to do things like put any geo data on a map
20:17 < maxogden> but it could be way better