-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MLBF validation logic and corresponding management command #22983
base: master
Are you sure you want to change the base?
Conversation
- Implemented a `validate` method in the `MLBF` class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type. - Added a new management command `validate_mlbf` to facilitate validation of MLBF directories from the command line. - Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.
Verified on production data that we do in fact have duplicates and the verification command catches them. Now to find out how that happened in the first place and how to prevent. FYI @Rob--W you're theory seems to be correct. |
Note: although In my local testing that lead to mozilla/addons#15261 (comment), I generated a new MLBF solely based on the |
I'm not sure I follow this logic.
What do you mean by that? like what code did you actually run?
That makes sense assuming the way you generated the filter is the same as the way production generated it. My understanding of that is based on the answer to the first question. But even if that were the case, if you did not de-duplicate the file and generated a filter that is smaller.. how did that happen? Isn't it even more odd if we get different sized filters on the same data set in different environments? I must be missing something from your comment. |
The production file size is
This is the code using I've attached the command output to https://mozilla-hub.atlassian.net/browse/AMOENG-1332?focusedCommentId=988314 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this fix up so quickly!
The fix of the primary issue looks good to me.
I'll defer to AMO engineers for the approving sign-off, in case they have thoughts on the tests or implementation details.
) | ||
|
||
reused_addon.update(guid=GUID_REUSE_FORMAT.format(addon.id)) | ||
reused_addon.addonguid.update(guid=addon.guid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sanity check that this does indeed generate duplicate entries? Otherwise the test may pass trivially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I would do that outside of what is already here. I'm explicitly setting the addonguid.guid of reused_addon to the guid of addon... neither are blocked so I cannot check for 2 non existent blocks.. so scratching my head on what exactly to verify?..
deduped_versions = [] | ||
|
||
for version in versions: | ||
if version not in unique_versions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is there a benefit to looping over the input and checking it for each entry, over using list(set(versions))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to set and back to list breaks order in Python. And it is not faster to do so. So there is actually no real benefit other than code prettiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested this in other PRs too, but we could just keep everything as set
s, and then sorted(...)
before being written to the JSON file. Python has efficient set handling (and the syntax is prettier too).
@@ -77,7 +77,7 @@ def generate_mlbf(stats, include, exclude): | |||
# Extends the BlockType enum to include versions that have no block of any type | |||
MLBFDataType = Enum( | |||
'MLBFDataType', | |||
[block_type.name for block_type in BlockType] + ['NOT_BLOCKED', 'not_blocked'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the primary fix to the cause of mozilla/addons#15261 (comment)
Fixes: mozilla/addons#15281
Description
validate
method in theMLBF
class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.validate_mlbf
to facilitate validation of MLBF directories from the command line.Context
We have discovered that MLBFs created since November 2024 are larger than expected. This validation is one step to ensure we do not see unchecked growth in the mlbf by verifying the underlying data set used to produce the filter.
Testing
mlbf
directory into your "storage" directory in the repositoryWhere
id
is the timestamp directory name./mlbf/<id>/
.Checklist
#ISSUENUM
at the top of your PR to an existing open issue in the mozilla/addons repository.