Skip to content

Split mongo architecture and rollout options

Don Mitchell edited this page Jan 7, 2014 · 42 revisions

Executive Summary & Action page

Goal: how to wire up split mongo asap with as little risk as possible?

Real goal: how to get RESTful api asap for SPOCs and other uses?

In particular, @rrubin says he wants to see these RESTful API use cases running:

  1. SPOC subsetting
    1. Create new course from existing course
    2. Set dates
    3. Delete some chapters, sections, units, &/or components
    4. Publish
  2. SPOC compilation
    1. Create new course from existing course
    2. Add chapter from another course
    3. Set dates
    4. Publish

Note: these do not require any lms functionality (grading, student tracking, randomization). The focus is on xblock CRUD and sharing.

I don't know the priority of the following Split mongo use cases, but they've been my highest priorities:

  1. Flag conflicting edits (forks)
  2. Undo edits
  3. Publish transactionally rather than inadvertent dribble

There are 2 RESTful APIs:

  1. Studio's existing one
  2. The proposed central one.

The proposed central one depends upon split mongo or at least a full integration of the new Locator syntax.

Of the above use case, Studio's existing RESTful api supports:

  1. Create new course (but from scratch not from existing)
  2. Set dates (any xblock field editing)
  3. Create, update, or delete any xblocks
  4. Publish subtree

Risks of going live on split mongo:

  1. Performance: effect of any data migrations if done lazily. That is, will large course migrations block user access during migration.
  2. Non-invertability of migration: what to do if a migrated course has a defect since migration is only from old to split?

Decisions/options requiring action:

  1. Course migration from old to split mongo:
  2. Big bang: migrate all courses or all that may be edited?
  3. Lazy: migrate upon attempt to write to old mongo?
  4. Controlled dribble: explicitly migrate some subset and increase that subset over time
    1. Does Studio need to support unmigrated courses for more than read access? (hybrid split)
    2. Will this strategy only apply to edx or also edge and other sites?
    3. Can we implement this strategy by having two separate code branches and servers? One for old mongo and one for split?
  5. Hybrid split if chosen: Have Studio support both back ends at the same time for not only read but also write to enable gradual and deliberate course migration?
  6. broadcast updates to both to enable reversion to old if needed?
  7. just assign courses to one or the other (split v old)?
  8. Separate code branches and servers?
  9. Much less time to go live: don't need to work out co-habitation which has been the main impediment
  10. Requires updating lms to use split backend or changing publishing to publish to old as well as split
  11. Requires production ops to go back to the same type of dispatching as we were using when we had xml and old mongo running on separate servers. At a minimum, this should be a short-term strategy.
  12. Use & extend Studio's existing restful api or implement the more general one we proposed (in the short-run)?
  13. Choose where to put the locator - location mapping (see locator-location-locus

Punchlist for go-live:

  1. xml export from split
  2. mixed modulestore figure out whether to read & write to split v old mongo v xml
  3. if using broadcast model of updates, implement that.
  4. if using hybrid, reconcile the method signatures or have mixed know how to invoke each
  5. command line or admin page to invoke course migration from old to split mongo (unless using lazy migration only)
  6. what if any of the split mongo use cases above to support in Studio? What to do w/ that functionality in case of hybrid split b/c old won't support the use cases?
  7. hook up Studio to split &/or hybrid
  8. hook up lms to hybrid or split
  9. test, test, test
  10. extend the studio restful api or implement the general one

Architectural depictions with options

To illustrate the differences among these architectures, I will use a combined studio and lms use case. You may want to imagine what you think the students should see at each point:

  1. Teacher creates course, sections, and subsections (Studio)
  2. Student1 registers for course (LMS)
  3. Student1 looks at course content (LMS)
  4. Teacher creates units and components (Studio)
  5. Teacher edits titles and dates for the course, sections, and subsections (Studio)
  6. Teacher configures grading policy and marks some subsections as graded (Studio)
  7. Student1 looks at course content (LMS)
  8. Teacher makes some units (u_0..u_i) public (Studio)
  9. Student1 looks at course content (LMS)
  10. Student1 works through u_0..u_i (LMS)
  11. Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
  12. Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i (Studio)
  13. Student2 looks at course content (LMS)
  14. Student2 works through u_0..u_i (LMS)
  15. Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
  16. Student3 looks at course content (LMS)
  17. Student3 works through u_0..u_i (LMS)

Pre-split mongo

This section covers how the system worked before split mongo and the location mapper.

Pre-split architecture stack

Pre-split mongo architecture stack

This document does not currently fully explain this stack, but some notes on this diagram.

  • The top (yellow) are the user facing clients: currently just browser clients.
  • The next layer (light green) shows the app layer which is primarily restful and non-restful url handlers with any client models and app logic (e.g., most grading).
  • The dark green shows (external) grading and analytics as disconnected services purely as a reminder that these and others like these (e.g., drupal) exist not to show how they use the back end. It would be good to get diagrams of how these plug into the back ends.
  • The cyan layer is the data access and modeling layer. It handles figuring out the identities and repositories, serializing and deserializing data, determining authorization, etc.
  • The xblock runtime currently is subordinate to the modulestore layer which instantiates it, computes addresses, and feeds it data models. lms writes directly to it for student state data which the runtime then persists directly in SQL; however, all courseware writes go through the modulestore layer. I believe @Cale envisions the xblock runtime as above or encompassing the modulestore; however, it currently doesn't and it's not obvious to me how it can.

Use case

  1. Teacher creates course, sections, and subsections (Studio)
  2. Studio uses MixedModulestore to create entries in Mongo
  3. Student1 registers for course (LMS)
  4. LMS uses auth svcs to create entries in SQL
  5. Student1 looks at course content (LMS)
  6. LMS uses MixedModulestore to access all of the courseware from step 1
  7. Student just sees an outline of the course with no content
  8. Teacher creates units and components (Studio)
  9. Studio uses MixedModulestore to create draft entries in Mongo
  10. Teacher edits titles and dates for the course, sections, and subsections (Studio)
  11. Studio uses MixedModulestore to update the entries in Mongo
  12. Teacher configures grading policy and marks some subsections as graded (Studio)
  13. Studio uses MixedModulestore to update the entries in Mongo
  14. Student1 looks at course content (LMS)
  15. LMS uses MixedModulestore to access all of the courseware from step 1
  16. Student1 sees an outline of the course with no content but with grading, dates, and new titles
  17. Teacher makes some units (u_0..u_i) public (Studio)
  18. Studio uses MixedModulestore to rename draft entries as non-draft ones in Mongo
  19. Student1 looks at course content (LMS)
  20. LMS uses MixedModulestore to access all of the courseware from step 1
  21. Student1 sees content
  22. Student1 works through u_0..u_i (LMS)
  23. LMS records student state via xblock runtime to SQL
  24. Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
  25. Studio uses MixedModulestore to update the entries in Mongo
  26. Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units and components (Studio)
  27. Studio uses MixedModulestore to copy non-draft entries into ones marked draft and update the draft entries in Mongo
  28. Studio updates the children of the subsections for the inserts and reorders.
  29. Student2 looks at course content (LMS)
  30. LMS uses MixedModulestore to access the courseware
  31. Student2 sees content in its new chapter, section, subsection, and unit order but not new component order and does not see the new units nor the changes to u_0..u_i
  32. Student2 works through u_0..u_i (LMS)
  33. LMS records student state via xblock runtime to SQL
  34. Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
  35. Studio uses MixedModulestore to convert draft entries to non-draft overwriting the existing non-drafts (and removing the drafts) in Mongo
  36. Student3 looks at course content (LMS)
  37. LMS uses MixedModulestore to access the courseware
  38. Student3 sees all content in its new order and material
  39. Student3 works through u_0..u_i (LMS)
  40. LMS records student state via xblock runtime to SQL

Long-term split mongo architecture

Eventually split mongo will completely replace the current mongo; so, the diagram will look just like the above one except that Mongo Modulestore will be Split Mongo Modulestore with its 3 collections giving it the ability to support editing undo, reusing content among courses, tracking changes over time (who and when), adding organizational governance over course id namespaces, and running courses over and over without export, rename, and import. The version tracking, for example, will enable the lms to know that student1 did not see the subsequently inserted material and mark which of u_0..u_i changed since the student saw them so the student can decide whether to check out the changes. It will enable analytics to compare performance before and after a courseware change. It will enable course authors to compare versions.

The eventual split mongo architecture will execute the use case above as follows:

  1. Teacher creates course, sections, and subsections (Studio)
  2. Studio uses MixedModulestore to create entries in draft course version in Mongo
  3. Student1 registers for course (LMS)
  4. LMS uses auth svcs to create entries in SQL
  5. Student1 looks at course content (LMS)
  6. LMS uses MixedModulestore to notice that there is no published content yet for the course
  7. Student sees that the course has no content nor outline yet
  8. Teacher creates units and components (Studio)
  9. Studio uses MixedModulestore to create draft entries in Mongo
  10. Teacher edits titles and dates for the course, sections, and subsections (Studio)
  11. Studio uses MixedModulestore to update the entries in Mongo
  12. Teacher configures grading policy and marks some subsections as graded (Studio)
  13. Studio uses MixedModulestore to update the entries in Mongo
  14. Student1 looks at course content (LMS)
  15. LMS uses MixedModulestore to notice that there is no published content yet for the course
  16. Student sees that the course has no content nor outline yet
  17. Teacher publishes some units (u_0..u_i) and their parents (Studio)
  18. Studio uses MixedModulestore to create a published branch and version and then copies the draft entries into it via Mongo
  19. Student1 looks at course content (LMS)
  20. LMS uses MixedModulestore to access the courseware
  21. Student sees content
  22. Student1 works through u_0..u_i (LMS)
  23. LMS records student state via xblock runtime to SQL
  24. Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
  25. Studio uses MixedModulestore to update the entries in draft branch in Mongo
  26. Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units (Studio)
  27. Studio uses MixedModulestore to update the entries in draft branch in Mongo
  28. Student2 looks at course content (LMS)
  29. LMS uses MixedModulestore to access the courseware
  30. Student2 sees same content as Student1 saw in the same order
  31. Student2 works through u_0..u_i (LMS)
  32. LMS records student state via xblock runtime to SQL
  33. Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
  34. Studio uses MixedModulestore to create a new live version with the changes in being published draft entries in Mongo
  35. Student3 looks at course content (LMS)
  36. LMS uses MixedModulestore to access the courseware
  37. Student3 sees content in its new order all the way down and new content
  38. Student3 works through u_0..u_i (LMS)
  39. LMS records student state via xblock runtime to SQL

Hybrid intermediate state of split running with old mongo

The focus of this document is which of several intermediate state options should we support. The reason for the intermediate hybrid state is to incrementally deploy functionality and to not require a big bang conversion of all existing course material and records.

We have 2 possible approaches:

  1. make a purely split version of studio and lms and run on separate server(s) than the current one
  2. make the current one simultaneously support some things in split and some in old mongo

Of these, the first is the easiest but runs the risk of another long-running code branch and the expense of nginx having to route requests based on course id (or lack thereof).

Roadblocks to big bang conversion:

  1. To enable reusing content among courses and versioning content, the new representation has a richer and slightly incompatible addressing scheme (Locators). This complicates
  2. student state which uses the old locations
  3. analytics using the old locations
  4. references within a course to other course locations
    1. especially if the material is now being referenced in a different course than the original course because the references will reference the original course not the course-invariant address nor the new-course relative address.
  5. similarly references to assets because for some reason they're identified relative to the creating course.
  6. Risk around the data migration scripts which have unit tests but which have had no real course content test.
  7. The length of time it's taking to finish writing the code for the hypothetical future state.
  8. The absence of Studio UX design and development to take advantage of the new functionality (reusing content, undo, comparison, controlled publication, etc)

Strategy for mitigating the risks and roadblocks:

  1. To mitigate the addressing schema change,
  2. we've implemented and made live an address scheme mapping service (loc_mapper) which we need to wire wherever needed (it's currently wired at the highest levels of the Studio App and the Studio Client is using the new address scheme).
  3. we decided to temporarily not use the new address scheme in lms so that student state, analytics, grading, drupal, and other such things won't need to be aware of the new address scheme.
  4. we'll use loc_mapper to generate the asset addresses for now
  5. we'll use loc_mapper to try to recode cross-course references (assets and xblocks) into within course relative-references. n Locations map to same Locator. loc_mapper knows how to convert to the Locator and then from the Locator to the Location for any of the n course ids which map to it.
  6. If we separate the split and old mongo servers, then we don't need as much wiring for addresses. We could begin using the new address scheme in lms, but we'd need to either map to old for analytics, student state, drupal or update those to the new address scheme.
  7. to mitigate migration risk,
  8. we'll manually invoke migration on a subset of courses and ensure they work well in practice before migrating the others
  9. this some-migrated-some-not state will require ensuring Studio can write to either back end depending on which repository owns the course or using separate code branches.
  10. we may make mixed modulestore broadcast each write to each representation (which means it needs both addressing schemes simultaneously and a strategy for handling old mongo's restrictive capabilities).

Current Architecture (work-in-progress):

Location mapping in studio app. Using new Locators in studio client.

Location mapping for studio app diagram

The difference here is the insertion of the loc_mapper and its store. The studio app takes each outgoing Location and uses the loc_mapper to convert it to a Locator so that all the client sees are Locators. It takes each incoming Locator and reconverts it back to a Location so that all the MixedModulestore sees is Locations.

This change has no effect on the current use case other than the form of the urls (which the use case does not discuss).

The problem is how to wire split mongo which uses Locators and old mongo and xml which use Locations without having the applications know which of the two addressing schemes the underlying data access and modeling layer uses. This problem is complicated by the fact that addresses are usually passed around merely as strings without any hint to their semantics and often hidden within other structures. Another complication is that those using Locations must also provide the unique id for the course to get a valid Locator. The loc_mapper will give a mapping even if it doesn't know which course is really in effect, but that mapping may be wrong. In practice, we don't allow more than one course with the same org and "course name"; so, most mappings will be correct; however, we cannot guarantee that they will.

The above diagram's depiction of converting at the app tier does not work for using split mongo which does not want the conversion.

Considered approaches:

  1. use only the new Locators wherever we know that a field is an address instead of using strings, Locations, or other inert types (dicts, tuples, arrays).
  2. add behavior to these Locators for them to mock old Location functionality for read as well as create and write (calling loc_mapper as necessary)
  3. ensure every code place does not merely pass these around as assumed strings or ensure that the objects present such strings wherever such assumptions lie.
  4. move the mapping functionality to the low level modulestore methods having them accept any address form and converting it to whatever representation that modulestore needs via the loc_mapper.
  5. to ensure existing higher level code does not trip on alternative representations, we'd have to
    1. ensure those functions just pass the address around inertly,
    2. duplicate each xblock field which we know holds addresses and have a version of the field for each repr,
    3. ensure each access stipulates what type of address it wants (and provides the course_id), or
    4. tell the modulestore which representation to populate into the reference fields according to which app requested it.
  6. use separate code branch and servers:
  7. leave the current as-is and focus on making a split only version
  8. use the current to migrate courses over to the new
  9. in the new, use only Locators no Locations
    1. verify that student state, analytics, and drupal will accept Locators or
    2. convert the Locators to Locations for any outside service which cannot handle Locators or
    3. update the outside services to handle Locators perhaps also on separate branch w/ separate servers?

Of the 2 above approaches, the first and last seem cleanest. The first has some risks including performance because our code frequently calls Location methods, race conditions if the code requests a translation before the loc_mapper knows about the course, and the need to do 2 pass conversions for inadvertent wrong-course hard-coded references (see above where I described asset and in course references for things borrowed by other courses) (this dual conversion problem exists in both approaches). For the second approach, none of the sub-approaches is sufficient in and of themselves. The last (encoding the address according to the application's preference) may be the closest to sufficient; however, because the code will not know how to find each reference in an xblock, some will leak to the upper layers which will need to catch address failures and attempt conversions.

For either approach, we'll need to decide whether to convert the existing mongo (aka, "old mongo") to read and write persisted addresses in either representation or only use Locations because that's what old mongo uses now. For the separate server approach, it will only use the new Locators.

In the long run, I'd like to deprecate the old Location and its behavior; however, it's not clear how we get there.

If we use either mixed address approach, the architecture becomes the following where most of the location mapping is done at the modulestore layer and only inadvertent references get mapped in the apps. The xblock runtime may need to use the loc_mapper as well.

Location translation at the modulestore layer

Hybrid approaches

The hybrid approaches for running split mongo alongside or on a separate server from old mongo have several control dimensions:

  1. Which courses persist in split and which into old mongo?
  2. All courses: one time big bang conversion--unlikely approach.
  3. All current and future courses: leave archived courses alone but don't allow access from Studio--also unlikely.
  4. Any course being edited in Studio: proactively move any course which should be accessible in Studio and have Studio only use split mongo. (not possible in the separate server implementation)
  5. Lazily any course being edited in Studio: read from either store, but only allow writes to split mongo. This was the approach I was working on. It would force migration from old to split upon first update attempt.
  6. All new courses, but leave old ones in old mongo: this strategy doesn't save any work but may reduce risk for running courses by ensuring that no addresses change. It requires having Studio able to read and write to both stores and having LMS able to read from both (all of the below do as well).
  7. All new courses plus a gradually increasing set of other deliberately migrated courses.
  8. Should Studio use split but LMS use old mongo?
  9. Requires writing a publish mechanism from split to old mongo.
  10. Still requires determining strategy for when to move courses to split for Studio.
  11. Should Studio broadcast updates to both stores to enable easy roll-back?
  12. Will require some additional work as well as analysis as what information is lost in the old mongo version and whether we care about that loss.

Whatever choice we make is an interim choice; so, we need to patch together a path from all old mongo to all split no matter how hypothetical that end point may be.

Comparative effort estimates

Locator - Location approaches

Needed for cohabiting approaches, not the dual server approach.

  1. Locator w/ Location veneer
  2. Remove assumptions that Location is a tuple
  3. Change loc_mapper to know that these are not distinct types
  4. Upon instantiation, loc_map each Location to populate the Locator fields.
  5. Upon Location attr access, loc_map each Locator to get its Location fields (map once, cache)
  6. Lazy Locator w/ Location veneer. Location is a separate class. Attempts to access Location attrs on a Locator forces mapping and vice versa.
  7. Don't need to change any access patterns
  8. Upon Locator attr access, loc_map each Location to get its Locator fields (map once, cache)
  9. Upon Location attr access, loc_map each Locator to get its Location fields (map once, cache)
  10. Make app ignore reference type and have all mapping done at modulestore.
  11. Update all modulestore functions to take either Locator or Location and call loc_mapper if it didn't get the type it expected.
  12. Deserialize ids and children etc from persistence into either Locator or Location according to
    1. runtime preference?
    2. type info in serialized form?
  13. Have app tier pass Location/Locator around as intact objects rather than fields:
    1. Change each url and view function to accept either set of fields and cons up the appropriate object
    2. Refactor each app tier access of Location/Locator attrs to get via correct pattern or not use attr
Hybrid approaches

The cohabiting approaches require that the app (at least Studio) never tries to write to any particular modulestore but lets the mixed modulestore layer figure out the routing. That entails changing all modulestore() and get_modulestore() calls as well as writing router logic in mixed.

Any implementation in which Studio must support both old and split mongo for write access is roughly equivalent in effort and additional effort over the existing planned work (which assumes we're converting all writes to Split). I'm fairly concerned about not only the amount of work for this simultaneous support but also the functional limitations as I was envisioning the app tier quickly becoming more version aware so that we can begin implementing conflict management (detection as well as resolution), reuse (e.g., spoc reuse of course, course reuse of modules), version comparison, history tracking (show the user who and when last changed each xblock), etc.

The signature and pattern of the methods for split modulestore's create, update, and delete methods differs from old mongo; so, simultaneous support will require that mixed modulestore mediate that difference and we code the app tier to the superset of functionality with some way of handling attempts to use split functionality on a non-split course.

The upside to a gradual migration is that if the version spamming of split has performance implications, the gradual migration will give us time to implement larger granularity version updates as well as other performance improvements before all authors are affected.

The additional work for broadcast update over supporting either will mainly impact performance, but it will require a few days of development work at the mixed modulestore layer.

If split does not have to lazily convert any courses, it does remove that behavior from mixed modulestore which will save some work (not a lot).

Keeping LMS using only old mongo reduces work on LMS but adds the task of either reverse migration from split to old mongo or, better, changing publish from split to support writing to old mongo. In the short-run, the main impact will be flattening out reuse references into copies of reused content in old mongo. In the long-run, it will keep LMS from tracking versions and thus reconstructing what the student saw or controlling what the student sees by giving the student a consistent version w/in some courseware scope. As long as the publish action also publishes to split's published branch, we'd be able to reconstruct what the student saw via the edited on dates of the published versions in split.

Clone this wiki locally