-
Notifications
You must be signed in to change notification settings - Fork 3
Split mongo architecture and rollout options
Goal: how to wire up split mongo asap with as little risk as possible?
Real goal: how to get RESTful api asap for SPOCs and other uses? This api depends upon split mongo or at least a full integration of the new Locator syntax.
In particular, @rrubin says he wants to see these RESTful API use cases running:
- SPOC subsetting
- Create new course from existing course
- Set dates
- Delete some chapters, sections, units, &/or components
- Publish
- SPOC compilation
- Create new course from existing course
- Add chapter from another course
- Set dates
- Publish
Note: these do not require any lms functionality (grading, student tracking, randomization). The focus is on xblock CRUD and sharing.
High priority split mongo functionality which is subordinate to the api includes:
- Flag conflicting edits without losing content (forks)
- Undo edits
- Publish transactionally rather than inadvertent dribble
Risks of going live on split mongo:
- Performance:
- split creates new versions of courses for every edit. These new versions may significantly expand our storage requirements and may take longer to persist
- Non-invertability of migration: what to do if a split has a defect since migration is only from old to split?
- Interruption of in flight or archived courses:
- saved bookmarks may not work
- hardcoded references may not work (e.g., anchor tags, jumps, etc)
- ORA location references need to change to locators
- converting student state to reference Locators not Locations
- change in address syntax in middle of analytics stream
- Deciding how to handle uploaded assets:
- currently course relative which makes reuse problematic especially for locked assets.
- We should decide if we're going to continue to host the assets through the app server and thus should come up with an addressing solution that works on the app server or will we move the assets to a cdn and thus come up with an addressing and locking solution that works in the cdn.
- if on app server, should the addressing still be course relative even if multiple courses reference the same assets?
- Studio uses Locators, LMS uses Locations
- assumes LMS always has the old org/coursenum/run triple which the loc_mapper needs to uniquely map to Locators (if there's more than one course run with the same org/coursenum).
- mixed modulestore acts as more than a router and converts all addresses to consumer's format
- make mixed wrap each method w/ proper conversion code
- change all uses of modulestore through lms and cms to only use mixed not directly go to default, draft, or direct
- provide 2 mixed modulestore instances: a locator-based one and a location-based one which ensure they treat all calls as providing the declared type and requiring the declared type back on all calls.
- Convert not only reference fields but also known special indicators inside the data payload like /static, /jump-to, and /jump-to-module (need full enumeration)
- to get proper repr to lower level modulestores (Locators to split, Locations to old-mongo) use a config or method on lower level ones which mixed consults and uses to do the conversions in the wrapped methods when sending into the lower level one.
- Mixed will look at a local cache to see where the course is. If the cache doesn't say, it will look in split. If it doesn't find it there, it will look in old mongo. When it finds the course, it will record the routing in a memcache to expedite future retrieval.
- May still need to write top level view handlers for lms which handle any url requests for Locators which somehow slip through the mixed modulestore conversion and convert to Locations, but we believe that any hardcoded Locator references will be authoring errors not things the app should translate.
- Complete the conversion of Studio to use Locators all the way through (not just in client-server urls, auth, and such places it does now.). For phase 1, don't take advantage of new split functionality.
- Course migration from old to split mongo will be a controlled dribble: explicitly migrate some subset and increase that subset over time.
- Studio needs to support both repositories but only one repository per course; however, migration will not remove the course from old mongo. It just won't update it with on-going edits.
- All new courses will go into split.
- Migration will be via a command. There will also be a command to roll back to the old mongo version.
- At first at least, we will do the migration on a staging server and ask the course team to verify the course before doing the migrations on the production servers. Proposed workflow:
- course team, PM, or someone decides to migrate course
- send ticket to devops to do migration
- devops copies the course to staging (export from prod, import onto staging)
- devops invokes the migration command
- course team verifies course on staging
- course team updates ticket with approval or rejection of migration
- devops migrates course on prod
- course team verifies course on prod
- We need to figure out what we're doing with uploaded assets addressing?
- Convert also to locators 1. handle and rewrite old locations
- Copy to each reusing course using old Location (c4x) 1. means supporting old locations in studio just for assets
- Move out to the webserver with a new asset specific addressing scheme 1. handle and rewrite old locations
Some ordering of the following functionality where the studio functionality may only apply to migrated courses. That is, studio may disable the functionality for unmigrated courses.
- studio supports only deliberate publishing as a large transaction not inadvertent publishing of small changes
- studio supports coursewide and xblock undo and redo
- studio supports version comparison on an xblock-by-xblock basis including "use that one" repointing
- studio supports course creation starting from some version of another course (new course points to same snapshot as old course not to a disconnected copy)
- studio supports looking at revisions of related courses (derived from this course, this course derived from, etc)
- studio supports governance around sharing of content (anyone can use my content, only my university system can use my content, only my university, only my department, these specific people or institutions, only my course team, ...)
- studio supports versioning of uploaded assets with all of the above comparisons
- all courses are migrated to split
- analytics supports locators w/ versions
- ora supports locators (or is agnostic)
- lms supports locators (or is agnostic)
LMS is Locator based. There is no more location mapping. student, ora, and analytic records version information. ORA and Student table have extra column for Locator and that's what's populated.
Punchlist for go-live:
- Ensure PMs agree to the migration plan and their role. (Don to talk to Jennifer A)
- xml export from split
- xml import to split
- mixed modulestore figure out whether to read & write to split v old mongo v xml
- reconcile the method signatures or have mixed know how to invoke each
- write loc_mapping calls into mixed as method wrappers
- command line or admin page to invoke course migration from old to split mongo (unless using lazy migration only)
- what if any of the split mongo use cases above to support in Studio? What to do w/ that functionality in case of hybrid split b/c old won't support the use cases?
- hook up Studio to hybrid
- uploaded assets addressing
- test, test, test
- implement the restful api
To illustrate the differences among these architectures, I will use a combined studio and lms use case. You may want to imagine what you think the students should see at each point:
- Teacher creates course, sections, and subsections (Studio)
- Student1 registers for course (LMS)
- Student1 looks at course content (LMS)
- Teacher creates units and components (Studio)
- Teacher edits titles and dates for the course, sections, and subsections (Studio)
- Teacher configures grading policy and marks some subsections as graded (Studio)
- Student1 looks at course content (LMS)
- Teacher makes some units (u_0..u_i) public (Studio)
- Student1 looks at course content (LMS)
- Student1 works through u_0..u_i (LMS)
- Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
- Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i (Studio)
- Student2 looks at course content (LMS)
- Student2 works through u_0..u_i (LMS)
- Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
- Student3 looks at course content (LMS)
- Student3 works through u_0..u_i (LMS)
This section covers how the system worked before split mongo and the location mapper.
This document does not currently fully explain this stack, but some notes on this diagram.
- The top (yellow) are the user facing clients: currently just browser clients.
- The next layer (light green) shows the app layer which is primarily restful and non-restful url handlers with any client models and app logic (e.g., most grading).
- The dark green shows (external) grading and analytics as disconnected services purely as a reminder that these and others like these (e.g., drupal) exist not to show how they use the back end. It would be good to get diagrams of how these plug into the back ends.
- The cyan layer is the data access and modeling layer. It handles figuring out the identities and repositories, serializing and deserializing data, determining authorization, etc.
- The xblock runtime currently is subordinate to the modulestore layer which instantiates it, computes addresses, and feeds it data models. lms writes directly to it for student state data which the runtime then persists directly in SQL; however, all courseware writes go through the modulestore layer. I believe @Cale envisions the xblock runtime as above or encompassing the modulestore; however, it currently doesn't and it's not obvious to me how it can.
- Teacher creates course, sections, and subsections (Studio)
- Studio uses MixedModulestore to create entries in Mongo
- Student1 registers for course (LMS)
- LMS uses auth svcs to create entries in SQL
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to access all of the courseware from step 1
- Student just sees an outline of the course with no content
- Teacher creates units and components (Studio)
- Studio uses MixedModulestore to create draft entries in Mongo
- Teacher edits titles and dates for the course, sections, and subsections (Studio)
- Studio uses MixedModulestore to update the entries in Mongo
- Teacher configures grading policy and marks some subsections as graded (Studio)
- Studio uses MixedModulestore to update the entries in Mongo
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to access all of the courseware from step 1
- Student1 sees an outline of the course with no content but with grading, dates, and new titles
- Teacher makes some units (u_0..u_i) public (Studio)
- Studio uses MixedModulestore to rename draft entries as non-draft ones in Mongo
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to access all of the courseware from step 1
- Student1 sees content
- Student1 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
- Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
- Studio uses MixedModulestore to update the entries in Mongo
- Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units and components (Studio)
- Studio uses MixedModulestore to copy non-draft entries into ones marked draft and update the draft entries in Mongo
- Studio updates the children of the subsections for the inserts and reorders.
- Student2 looks at course content (LMS)
- LMS uses MixedModulestore to access the courseware
- Student2 sees content in its new chapter, section, subsection, and unit order but not new component order and does not see the new units nor the changes to u_0..u_i
- Student2 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
- Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
- Studio uses MixedModulestore to convert draft entries to non-draft overwriting the existing non-drafts (and removing the drafts) in Mongo
- Student3 looks at course content (LMS)
- LMS uses MixedModulestore to access the courseware
- Student3 sees all content in its new order and material
- Student3 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
Eventually split mongo will completely replace the current mongo; so, the diagram will look just like the above one except that Mongo Modulestore will be Split Mongo Modulestore with its 3 collections giving it the ability to support editing undo, reusing content among courses, tracking changes over time (who and when), adding organizational governance over course id namespaces, and running courses over and over without export, rename, and import. The version tracking, for example, will enable the lms to know that student1 did not see the subsequently inserted material and mark which of u_0..u_i changed since the student saw them so the student can decide whether to check out the changes. It will enable analytics to compare performance before and after a courseware change. It will enable course authors to compare versions.
The eventual split mongo architecture will execute the use case above as follows:
- Teacher creates course, sections, and subsections (Studio)
- Studio uses MixedModulestore to create entries in draft course version in Mongo
- Student1 registers for course (LMS)
- LMS uses auth svcs to create entries in SQL
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to notice that there is no published content yet for the course
- Student sees that the course has no content nor outline yet
- Teacher creates units and components (Studio)
- Studio uses MixedModulestore to create draft entries in Mongo
- Teacher edits titles and dates for the course, sections, and subsections (Studio)
- Studio uses MixedModulestore to update the entries in Mongo
- Teacher configures grading policy and marks some subsections as graded (Studio)
- Studio uses MixedModulestore to update the entries in Mongo
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to notice that there is no published content yet for the course
- Student sees that the course has no content nor outline yet
- Teacher publishes some units (u_0..u_i) and their parents (Studio)
- Studio uses MixedModulestore to create a published branch and version and then copies the draft entries into it via Mongo
- Student1 looks at course content (LMS)
- LMS uses MixedModulestore to access the courseware
- Student sees content
- Student1 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
- Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
- Studio uses MixedModulestore to update the entries in draft branch in Mongo
- Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units (Studio)
- Studio uses MixedModulestore to update the entries in draft branch in Mongo
- Student2 looks at course content (LMS)
- LMS uses MixedModulestore to access the courseware
- Student2 sees same content as Student1 saw in the same order
- Student2 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
- Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
- Studio uses MixedModulestore to create a new live version with the changes in being published draft entries in Mongo
- Student3 looks at course content (LMS)
- LMS uses MixedModulestore to access the courseware
- Student3 sees content in its new order all the way down and new content
- Student3 works through u_0..u_i (LMS)
- LMS records student state via xblock runtime to SQL
The focus of this document is the intermediate states we will support as we transition from Location-based to Locator-based. The reason for the intermediate hybrid state is to incrementally deploy functionality and to not require a big bang conversion of all existing course material and records.
Roadblocks to big bang conversion:
- To enable reusing content among courses and versioning content, the new representation has a richer and slightly incompatible addressing scheme (Locators). This complicates
- student state which uses the old locations
- analytics using the old locations
- references within a course to other course locations
- especially if the material is now being referenced in a different course than the original course because the references will reference the original course not the course-invariant address nor the new-course relative address.
- similarly references to assets because for some reason they're identified relative to the creating course.
- ORA because it uses the Locations to track the lifecycle of each response.
- Risk around the data migration scripts which have unit tests but which have had no real course content test.
- The length of time it's taking to finish writing the code for the hypothetical future state.
- The absence of Studio UX design and development to take advantage of the new functionality (reusing content, undo, comparison, controlled publication, etc)
Strategy for mitigating the risks and roadblocks:
- To mitigate the addressing schema change,
- we've implemented and made live an address scheme mapping service (loc_mapper) which we need to wire wherever needed (it's currently wired at the highest levels of the Studio App and the Studio Client is using the new address scheme).
- we decided to temporarily not use the new address scheme in lms so that student state, analytics, grading, drupal, and other such things won't need to be aware of the new address scheme.
- we'll use loc_mapper to generate Location-based asset addresses for now, but this means that each course using the same modules must have its own copy of each asset.
- we'll use loc_mapper to try to recode cross-course references (assets and xblocks) into within course relative-references. n Locations map to same Locator. loc_mapper knows how to convert to the Locator and then from the Locator to the Location for any of the n course ids which map to it.
- to mitigate migration risk,
- we'll manually invoke migration on a subset of courses and ensure they work well in practice before migrating the others
- this some-migrated-some-not state will require ensuring Studio can write to either back end depending on which repository owns the course or using separate code branches.
Location mapping in studio app. Using new Locators in studio client.
The difference here is the insertion of the loc_mapper and its store. The studio app takes each outgoing Location and uses the loc_mapper to convert it to a Locator so that all the client sees are Locators. It takes each incoming Locator and reconverts it back to a Location so that all the MixedModulestore sees is Locations.
This change has no effect on the current use case other than the form of the urls (which the use case does not discuss).
The problem is how to wire split mongo which uses Locators and old mongo and xml which use Locations without having the applications know which of the two addressing schemes the underlying data access and modeling layer uses. This problem is complicated by the fact that addresses are usually passed around merely as strings without any hint to their semantics and often hidden within other structures. Another complication is that those using Locations must also provide the unique id for the course to get a valid Locator. The loc_mapper will give a mapping even if it doesn't know which course is really in effect, but that mapping may be wrong. In practice, we don't allow more than one course with the same org and "course name"; so, most mappings will be correct; however, we cannot guarantee that they will.
The above diagram's depiction of converting at the app tier does not work for using split mongo which does not want the conversion.
- use only the new Locators wherever we know that a field is an address instead of using strings, Locations, or other inert types (dicts, tuples, arrays).
- add behavior to these Locators for them to mock old Location functionality for read as well as create and write (calling loc_mapper as necessary)
- ensure every code place does not merely pass these around as assumed strings or ensure that the objects present such strings wherever such assumptions lie.
- move the mapping functionality to the low level modulestore methods having them accept any address form and converting it to whatever representation that modulestore needs via the loc_mapper.
- to ensure existing higher level code does not trip on alternative representations, we'd have to
- ensure those functions just pass the address around inertly,
- duplicate each xblock field which we know holds addresses and have a version of the field for each repr,
- ensure each access stipulates what type of address it wants (and provides the course_id), or
- tell the modulestore which representation to populate into the reference fields according to which app requested it.
- use separate code branch and servers:
- leave the current as-is and focus on making a split only version
- use the current to migrate courses over to the new
- in the new, use only Locators no Locations
- verify that student state, analytics, and drupal will accept Locators or
- convert the Locators to Locations for any outside service which cannot handle Locators or
- update the outside services to handle Locators perhaps also on separate branch w/ separate servers?
- Make mixed modulestore know whether the caller and whether the lower level persistence layer wants Locators or Locations and convert all references on the way in and out to the appropriate type.
The architecture becomes the following where most of the location mapping is done at the modulestore layer and only inadvertent references get mapped in the apps. The xblock runtime may need to use the loc_mapper as well.
The hybrid approaches for running split mongo alongside or on a separate server from old mongo have several control dimensions:
- Which courses persist in split and which into old mongo?
- All courses: one time big bang conversion--unlikely approach.
- All current and future courses: leave archived courses alone but don't allow access from Studio--also unlikely.
- Any course being edited in Studio: proactively move any course which should be accessible in Studio and have Studio only use split mongo. (not possible in the separate server implementation)
- Lazily any course being edited in Studio: read from either store, but only allow writes to split mongo. This was the approach I was working on. It would force migration from old to split upon first update attempt.
- All new courses, but leave old ones in old mongo: this strategy doesn't save any work but may reduce risk for running courses by ensuring that no addresses change. It requires having Studio able to read and write to both stores and having LMS able to read from both (all of the below do as well).
- All new courses plus a gradually increasing set of other deliberately migrated courses.
- Should Studio use split but LMS use old mongo?
- Requires writing a publish mechanism from split to old mongo.
- Still requires determining strategy for when to move courses to split for Studio.
- Should Studio broadcast updates to both stores to enable easy roll-back?
- Will require some additional work as well as analysis as what information is lost in the old mongo version and whether we care about that loss.
Whatever choice we make is an interim choice; so, we need to patch together a path from all old mongo to all split no matter how hypothetical that end point may be.
Needed for cohabiting approaches, not the dual server approach.
- Locator w/ Location veneer
- Remove assumptions that Location is a tuple
- Change loc_mapper to know that these are not distinct types
- Upon instantiation, loc_map each Location to populate the Locator fields.
- Upon Location attr access, loc_map each Locator to get its Location fields (map once, cache)
- Lazy Locator w/ Location veneer. Location is a separate class. Attempts to access Location attrs on a Locator forces mapping and vice versa.
- Don't need to change any access patterns
- Upon Locator attr access, loc_map each Location to get its Locator fields (map once, cache)
- Upon Location attr access, loc_map each Locator to get its Location fields (map once, cache)
- Make app ignore reference type and have all mapping done at modulestore.
- Update all modulestore functions to take either Locator or Location and call loc_mapper if it didn't get the type it expected.
- Deserialize ids and children etc from persistence into either Locator or Location according to
- runtime preference?
- type info in serialized form?
- Have app tier pass Location/Locator around as intact objects rather than fields:
- Change each url and view function to accept either set of fields and cons up the appropriate object
- Refactor each app tier access of Location/Locator attrs to get via correct pattern or not use attr
The cohabiting approaches require that the app (at least Studio) never tries to write to any particular modulestore but lets the mixed modulestore layer figure out the routing. That entails changing all modulestore() and get_modulestore() calls as well as writing router logic in mixed.
Any implementation in which Studio must support both old and split mongo for write access is roughly equivalent in effort and additional effort over the existing planned work (which assumes we're converting all writes to Split). I'm fairly concerned about not only the amount of work for this simultaneous support but also the functional limitations as I was envisioning the app tier quickly becoming more version aware so that we can begin implementing conflict management (detection as well as resolution), reuse (e.g., spoc reuse of course, course reuse of modules), version comparison, history tracking (show the user who and when last changed each xblock), etc.
The signature and pattern of the methods for split modulestore's create, update, and delete methods differs from old mongo; so, simultaneous support will require that mixed modulestore mediate that difference and we code the app tier to the superset of functionality with some way of handling attempts to use split functionality on a non-split course.
The upside to a gradual migration is that if the version spamming of split has performance implications, the gradual migration will give us time to implement larger granularity version updates as well as other performance improvements before all authors are affected.
The additional work for broadcast update over supporting either will mainly impact performance, but it will require a few days of development work at the mixed modulestore layer.
If split does not have to lazily convert any courses, it does remove that behavior from mixed modulestore which will save some work (not a lot).
Keeping LMS using only old mongo reduces work on LMS but adds the task of either reverse migration from split to old mongo or, better, changing publish from split to support writing to old mongo. In the short-run, the main impact will be flattening out reuse references into copies of reused content in old mongo. In the long-run, it will keep LMS from tracking versions and thus reconstructing what the student saw or controlling what the student sees by giving the student a consistent version w/in some courseware scope. As long as the publish action also publishes to split's published branch, we'd be able to reconstruct what the student saw via the edited on dates of the published versions in split.