Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADR about server side filtering #729

Merged
merged 5 commits into from
Jan 14, 2025
Merged

Conversation

leplatrem
Copy link
Contributor

No description provided.

@leplatrem leplatrem force-pushed the adr-server-side-filtering branch from 24ee112 to d4cee9d Compare January 7, 2025 17:16
@leplatrem leplatrem marked this pull request as ready for review January 7, 2025 17:34
@leplatrem leplatrem requested a review from a team as a code owner January 7, 2025 17:34
@ncloudioj
Copy link
Member

@leplatrem Thanks for putting this up together!

I agree with you that "option 2" appears to be the most reasonable choice at the moment for its low complexity & cost and its ability to meet our present business needs.

Longer term, from a consumer's perspective, I'd be curious about the viability of "option 4" specifically regarding how to handle the increased signing workload for server-side filtering. If the scalability problem is intrinsically hard to solve, would it be reasonable for us to opt out content signing for certain use cases to work around that challenging? An interesting observation we have made from building user-facing content serving features in Firefox over the past years, we've been gradually shifting our mental model from error prevention to error tolerance & rapid responding. Granted it will lose certain security protections w/o content signing, though that lose could be totally acceptable if we have other mechanisms in place for emergency/incident handling. As an RS consumer, having such flexibility seems quite useful and compelling in certain cases especially if it'd unlock other capabilities.

In order to choose our solution we considered the following criteria:

- **Complexity**: Low → High: how complex is the solution
- **User experience**: Bad → Good: how adapted to users is the solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to take into account 2 user experiences here. The remote-settings admins, and everybody that uses firefox (and other potential remote-settings clients).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean developer experience? Or client API complexity?

I could be wrong but I think Firefox users should not be impacted by this internal implementation details. Do you have something concrete in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good question. Maybe I'm overthinking it but I suppose there is:

  • remote-settings admins - How hard is it to change these records and understand what is happening in the background.
  • client developers - How hard is it to use and understand these enhancements. We do occasionally run into a latency question because of CDN caching, and this is adding one more layer on top of that.
  • end users - Is my computer running better because of these changes? Am I using more or less power/bandwidth? etc. I guess at an individual level the impact will be minimal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this makes sense.
I rephrased some parts but didn't find justified to create a criteria for each of these. Let me know if you don't agree, I'd be happy to adjust more until it's good enough

For example, if records have `regions` and `locales` fields:

```
{
Copy link
Contributor

@alexcottner alexcottner Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could collection config look like in this example to know how the records should be pivoted? Something like?

{
  ...
  pivotBy: [ "regions", "locales" ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with records should be pivoted?
How would this pivotBy field be used?

We could add a few lines below to show the client code required to leverage this option.
It would look like this:

ChromeUtils.defineESModuleGetters(lazy, {
  Region: "resource://gre/modules/Region.sys.mjs",
});

const region = lazy.Region.home || "default";
const locale = Services.locale.appLocaleAsBCP47.substr(0, 2);

client = RemoteSettings(`shops-${region}-${locale}`);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe this was a bad assumption on my part. I assumed there would be a generic automated process that looks for these collections that need to create/update the read-only virtual collections. And that process would need to know how to disperse those records to the virtual collections when it runs.

In my head this is basically a generic ETL process that takes the records from the collection, and applies some "pivot table" like logic to disperse these records out.

I realize now there could be multiple processes that have their own custom logic in them, one for each client or collection. But I still like the idea of having a paved road for a generic process where we could list which properties we want to create collections for and it would just run with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you were looking for a generic way to create and populate all the sub-collections, and on the client side "route" them to the right sub-collection.

With the proposed solution here, the sub-collections are hardcoded in the cronjobs manifests, and the clients have hardcoded logics to pick the proper sub-collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly that. But you were thinking about storing that config in the cronjobs. I guess the end result is the same and this is down to small implementation details. Although if we put the config in the collection properties, admins could see it (and even play with it in dev) 🤔 . That could allow for some self-service.

Either way, we don't need to decide it here.

Copy link
Contributor Author

@leplatrem leplatrem Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic/automatic/self-serve solution is neat, but unlike the proposed option, it requires some (minor) development.
Shall we add another entry in this doc and evaluate?
The configuration of this "pivot" would require 2 aspects to be configured: the list of fields, and the pattern of the side collections.
For example pivotBy: ["region", "locale"], and targetPattern: "shop-{region}-{locale}".
I think it would still require us to create the side collections manually, and the cronjob would just "ignore" the records if they no existing collection matches (eg. if there's no shop-oceania-de collection, do not put this record anywhere)

- **Cost of operation**: Low. Clients download only the subset of data, saving bandwidth, and signing datasets on each publication is relatively cheap.


### Option 4 - Implement dynamic signatures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a hybrid approach here. Where we create the result on the first request and cache it. However if many clients make the request at the same time, we would need to implement a queue so we don't do the same work for each of them.

@alexcottner
Copy link
Contributor

I think option 2 is our best value without major changes.

Option 4 feels like it's not quite complete but could be great. We could setup a custom sync endpoint where the client provides info on it's current state, and a signed bundle is returned with exactly what it needs to be brought up to date. Something similar to couchDB's _changes endpoint, but with a signature attached. But we'd need some additional infrastructure to prevent competing requests from being processed at the same time.

@leplatrem
Copy link
Contributor Author

Option 4 feels like it's not quite complete but could be great.

Should I add more details?

We could setup a custom sync endpoint where the client provides info on it's current state, and a signed bundle is returned with exactly what it needs to be brought up to date.

This could be one idea of improvement, but I feel that I does not relate directly to server side filtering.

@alexcottner
Copy link
Contributor

This could be one idea of improvement, but I feel that I does not relate directly to server side filtering.

You're right, that is a more generic solution and doesn't apply directly to server side filtering. We can leave that out for now.

@alexcottner
Copy link
Contributor

Because I like diagrams, does this look right?
jpg_diagram

@leplatrem
Copy link
Contributor Author

Because I like diagrams, does this look right?

Yes! I would define virtual here, because with Option2 they are real/concrete collections

Copy link
Contributor

@alexcottner alexcottner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it

@alexcottner
Copy link
Contributor

Because I like diagrams, does this look right?

Yes! I would define virtual here, because with Option2 they are real/concrete collections

Yeah virtual is the wrong word here. Maybe "staging" or "calculated" or "pre-filtered" or something?

@leplatrem
Copy link
Contributor Author

Because I like diagrams, does this look right?

Yes! I would define virtual here, because with Option2 they are real/concrete collections

Yeah virtual is the wrong word here. Maybe "staging" or "calculated" or "pre-filtered" or something?

Calculated or computed seems better 👌

@leplatrem
Copy link
Contributor Author

Loving it

Shall I merge? Did you want to add a diagram or more options? We can also always iterate...

@alexcottner
Copy link
Contributor

Added the diagram to the decision outcome section. If this looks good, let's merge it and start getting this built out.

@leplatrem leplatrem merged commit d0a7b01 into main Jan 14, 2025
6 checks passed
@leplatrem leplatrem deleted the adr-server-side-filtering branch January 14, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants