Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create json tracking protection lists #176

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

say-yawn
Copy link
Contributor

@say-yawn say-yawn commented Sep 28, 2021

About this PR

Generate JSON files of the ETP lists for every versioned branch on shavar-prod-lists including the main branch.

Acceptance Criteria

  • Versioned ETP list is created with naming convention {version}-{list name}. For example, 93.0 versioned fingerprinting list looks like: 93.0-base-fingerprinting-track-digest256.json
  • JSON files are uploaded to cloud under the respective folder for with the file name. For example for 93.0 versioned fingerprinting list in JSON format, the file should be uploaded to the folder base-fingerprinting-track-digest256/93.0 in the bucket for shavar-list-creation.
  • JSON files are uploaded to cloud if and only if there are updates to the list that should be pushed to cloud for. This means NO EXTRA LOGIC IS NEEDED and we can just use the existing logic for checking if the files should be uploaded.

STR

Generate versioned list to upload to shavar-prod-lists

  1. Copy the prod.ini from shavar-list-creation-config here into the shavar_list_creation.ini locally.
  2. Within your python environment run lists2safebrowsing.py
  3. Check that you get versioned JSON files for:
  • ads-track-digest256.json
  • analytics-track-digest256.json
  • base-cryptomining-track-digest256.json
  • base-fingerprinting-track-digest256.json
  • base-track-digest256.json
  • content-cryptomining-track-digest256.json
  • content-fingerprinting-track-digest256.json
  • content-track-digest256.json
  • social-track-digest256.json
  • social-tracking-protection-digest256.json
  • social-tracking-protection-facebook-digest256.json
  • social-tracking-protection-linkedin-digest256.json
  • social-tracking-protection-twitter-digest256.json
  • social-tracking-protection-youtube-digest256.json
  1. Get the versioned branch of the base-fingerprinting-track-digest256.json you need and copy and rename the file to base-fingerprinting-track.json
  2. Go to shavar-prod-list repository and checkout the versioned branch you chose on Step 4.
  3. Create a folder normalized-listsfrom the project root folder and add the base-fingerprinting-track.json filer under the newly created the folder.
  4. Create a PR against the shavar-prod-lists repo with the changes merging to the versioned branch the JSON file was for.

@isabelrios
Copy link

I have checked this PR locally and I can see the files generated.
For Firefox I see these files match:

Firefox generated files * shavar-list-creation (Se Yeon branch)
disconnect-block-advertising.json ads-track-digest256.json
disconnect-block-cryptomining.json base-cryptomining-track-digest256.json
disconnect-block-analytics.json analytics-track-digest256.json
disconnect-block-fingerprinting.json base-fingerprinting-track-digest256.json
disconnect-block-social.json social-track-digest256.json
disconnect-block-cookies-advertising.json ads-track-digest256.json
disconnect-block-cookies-social.json social-track-digest256.json
disconnect-block-cookies-content.json content-track-digest256.json
disconnect-block-cookies-analytics.json analytics-track-digest256.json
  • Those files in Firefox/Focus are generated from disconnect-blacklist.json and disconnect-entitylist.json in ContentBlockerGen, in addition to base-fingerprinting-track.json, that we take directly from shavar repo (v86)

@say-yawn can you confirm if that's correct? I see there are 5 tests failing that also fail for me locally, does that mean the json are not correct? don't have enough knowledge about those yet but if the lists are fine and the tests just need some changes to pass, we could start working on simplifying the iOS scripts to use these lists.
Thanks!
cc @st3fan

@st3fan
Copy link

st3fan commented Sep 30, 2021

The tests have not been updated to take the JSON output into account.

I've been trying to update them but it is very challenging because the tests mock the open() and write() calls and compare those calls to expected values. Which have obviously changed with the introduction of JSON output.

I wonder if we should rewrite the tests without mocking and instead just write the generated files to a temporary directory where they can be inspected/validated. I think that would make it simpler to support multiple output formats.

What do you think @say-yawn

@st3fan
Copy link

st3fan commented Sep 30, 2021

I'm going to pick up this PR and refactor it a bit to make testing simpler.

Temporarily comment out the publish to cloud to remove noise/errors
@say-yawn
Copy link
Contributor Author

I pushed some changes to generate the JSON list for all versioned branches too. The naming convention for the versioned ETP list is as mentioned in this code snippet. Before, the script was generating and replacing the JSON files for every time the get_tracker_lists was called. This meant, that before the two commits I just pushed, after the lists2safebrowsing was finish running the JSON file is whatever the last versioned list was being generated with get_tracker_lists.

@st3fan
Copy link

st3fan commented Oct 1, 2021

@say-yawn I can't get this to work. I copy the config file from the config repo into place and run the tool, but it fails with an error:

(env) stefan@Discovery ~/M/shavar-list-creation (creat-json-tracking-protection-lists)> cp ../shavar-list-creation-config/prod.ini shavar_list_creat_latest_prod.ini
(env) stefan@Discovery ~/M/shavar-list-creation (creat-json-tracking-protection-lists)> ./lists2safebrowsing.py
Traceback (most recent call last):
  File "/Users/stefan/Mozilla/shavar-list-creation/./lists2safebrowsing.py", line 35, in <module>
    from publish2cloud import (
  File "/Users/stefan/Mozilla/shavar-list-creation/publish2cloud.py", line 31, in <module>
    REMOTE_SETTINGS_BUCKET = CONFIG.get('main', 'remote_settings_bucket')
  File "/opt/homebrew/Cellar/[email protected]/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/configparser.py", line 781, in get
    d = self._unify_values(section, vars)
  File "/opt/homebrew/Cellar/[email protected]/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/configparser.py", line 1152, in _unify_values
    raise NoSectionError(section) from None
configparser.NoSectionError: No section: 'main'

I don't understand this error because it is definitely in the file:

(env) stefan@Discovery ~/M/shavar-list-creation (creat-json-tracking-protection-lists) [0|1]> head shavar_list_creat_latest_prod.ini
[main]
default_disconnect_url=https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/master/disconnect-blacklist.json
s3_bucket=net-mozaws-prod-shavar
s3_upload=true

# DNT="", all categories except content category
[tracking-protection-base]
output=base-track-digest256
s3_key=tracking/base-track-digest256
versioning_needed=true

@isabelrios
Copy link

isabelrios commented Oct 4, 2021

I was able to run the script and get the json files before latest commit landed. On 4a7c13a it is working (I get the json files although they are not versioned)
But on 2c7b3a2 I get the error:
Error loading shavar_list_creation.ini

I have followed your steps above to copy the prod.ini into shavar_list_creation.ini

@isabelrios
Copy link

I was able to run the script and get the json files before latest commit landed. On 4a7c13a it is working (I get the json files although they are not versioned) But on 2c7b3a2 I get the error: Error loading shavar_list_creation.ini

I have followed your steps above to copy the prod.ini into shavar_list_creation.ini

I was able to make this working. I had to rename: sample_shavar_list_creation.ini to shavar_list_creation.ini (as this file did not exist in this branch) Then modify the script to use this file, as it was before instead of shavar_list_creat_latest_prod.ini`. I guess that naming was used while degugging @say-yawn?
I hope those changes make sense... @st3fan, in case you want give this a try...

After that I run ./lists2safebrowsing.py and got all the versioned json files! 👏
See some of the files:

Captura de pantalla 2021-10-04 a las 11 12 04

@isabelrios
Copy link

Hello @say-yawn, please let me try to go back to this and see if we can move it a bit forward if you have some time at some point.

Once I was able to get the lists as commented above, I tried to match them with current list used in Firefox, this is what I got and that we will need to confirm with you if that's right:

Firefox generated files * shavar-list-creation (Se Yeon branch)
disconnect-block-advertising.json ads-track-digest256.json
disconnect-block-cryptomining.json base-cryptomining-track-digest256.json
disconnect-block-analytics.json analytics-track-digest256.json
disconnect-block-fingerprinting.json base-fingerprinting-track-digest256.json
disconnect-block-social.json social-track-digest256.json
disconnect-block-cookies-advertising.json ads-track-digest256.json
disconnect-block-cookies-social.json social-track-digest256.json
disconnect-block-cookies-content.json content-track-digest256.json
disconnect-block-cookies-analytics.json analytics-track-digest256.json

I would say it looks good and we are a step closer to continue improving the environment.

  • Next steps would be:
    1-Confirm if that is right
    2-Fix the failing tests to get the PR in main
    3-Create PR to upload those files to S3
    4-Modify the code as we get those json files in Firefox/Focus

I can take care f 3 and 4 but need your help for 1 and 2. Thank you in advance!

@say-yawn
Copy link
Contributor Author

say-yawn commented Dec 1, 2021

@isabelrios hello! Thanks for being patient with me

  • Generated files and mapping to ETP functionality table
Firefox generated files * shavar-list-creation (Se Yeon branch)
disconnect-block-cookies-content.json content-track-digest256.json

I am unsure what this is referring to but content-track-digest256.json is only used for content blocking, not cookie blocking. You can see this doc for the mapping between the ETP feature and the Shavar list consumed by Firefox. The Cookie blocking list, once the content-track-digest256.json looks correct as well as the other tracker blocking.

@isabelrios
Copy link

I am unsure what this is referring to but content-track-digest256.json is only used for content blocking, not cookie blocking. You can see this doc for the mapping between the ETP feature and the Shavar list consumed by Firefox

Thanks for your reply @say-yawn !! Your comment makes sense, in fact that list is not being used for cookie blocking, as you can see here: https://github.com/mozilla-mobile/firefox-ios/blob/9a9b927484cf49a2cde438e3db71ee5203f88d66/content-blocker-lib-ios/src/ContentBlocker.swift#L41 which will match with the doc you shared cell B8 (if I'm reading that table correctly)

If the other iOS lists match with the json files you can create with this PR I think we are on a good path here. Happy to work with you, clarify what's needed so that we can move to next steps :)

@lmarceau
Copy link

I had a look and also come to the same mapping conclusion as @isabelrios described in #176 (comment) 👍

I think the next step is to 2-Fix the failing tests to get the PR in main. Once we have that PR merged, we'll be able to move forward with automating getting the generated files for Firefox for iOS (step 3 ans 4 has changed since the original comment has been written, since S3 bucket won't be used anymore).

@lmarceau
Copy link

Hello! So turns out we'll need one more file to be generated with this PR, which is the entity-list as we had on the shavar-prod-lists. With that we'll be able to generate all files necessary for the Firefox for iOS project

@lmarceau
Copy link

lmarceau commented Dec 7, 2022

Hello! All in all, on this PR the only thing missing is that we get the entity-list from it (as is, no change needed in format).

@isabelrios
Copy link

@say-yawn would it be possible to get that entity-list json file when running your script lists2safebrowsing.py? I'm working on a solution to run it from github action, save the desired json lists to be consumed by firefox-ios.. but if we can't get that one from here... it may not work in the end... do we need to run other script? or is there a way we can get it automatically? Thanks!

@say-yawn
Copy link
Contributor Author

say-yawn commented Dec 19, 2022

Hi, if you can pull the entity lists from shavar-prod-lists that would be great. I believe that was the last thing that was identified as a need to merge this PR and I will not be able to make the changes on the script as I need to prioritize other work I was asked to deliver.

@isabelrios and @lmarceau, if this is an acceptable step forward will either of you review and merge the PR so you can use the changes from main rather than a feature branch?

@isabelrios
Copy link

Hi, if you can pull the entity lists from shavar-prod-lists that would be great. I believe that was the last thing that was identified as a need to merge this PR and I will not be able to make the changes on the script as I need to prioritize other work I was asked to deliver.

@isabelrios and @lmarceau, if this is an acceptable step forward will either of you review and merge the PR so you can use the changes from main rather than a feature branch?

Hello @say-yawn, thanks for the update. I think we can move forward with what we have, there area few questions though we would need to clarify:

  • About entity lists. Yes, we can grab this list from shavar-prod-lists repo, but should we use the one in main or the one that is available in the latest branch created? (in this case the one in v108, and when a new branch is created, use the newer one?)
  • About this PR, we are not confident to land it. It may be too risky.. There were failures in one of the commits and we don't see any CI running on latests commits. We don't have enough knowledge to fix something related to this PR if it fails. Also, there are modifications we have to apply locally so that the script lists2safebrowsing.py works. Additionally, there is code commented (like # publish_to_cloud(config, chunknum, check_versioning=True)) that we don't know if will affect other projects / functionality .. That said, we can use this branch for now. Actually, I have a POC working that uses this branch.

In summary, the questions are:

  • Should we use main or latest branch for entity lists in shavar-prod-lists?
  • Is it fine if we continue using this branch instead of landing this PR as it is?

Thank you! @lmarceau please add anything I may have missed... 🤔

@isabelrios
Copy link

I have one more question @say-yawn, just to double check that we are doing this correctly... when I run the lists2safebrowsing.py script I get versioned files (from version 69.0), but the latest version is 98.0. Then there are the same files files without version that are the ones I am using...is that right? Thanks

@isabelrios
Copy link

@say-yawn, I forgot to comment about our final solution to store the JSON and manage the files we generate.. they will live in a public GCP bucket, is that fine or do we need to check with someone? I rememeber that you said we would need a license if we wanted to have them in the repo but since they will not be there... please confirm this with us to be sure we are doing this right. Thanks!

@say-yawn
Copy link
Contributor Author

say-yawn commented Jan 4, 2023

I responded in a separate email about the public GCP bucket. The list should be fine to use if it's not public and may be fine to use on public. I recommend reaching out to legal to check if the public bucket is alright.

@isabelrios
Copy link

Thank you @say-yawn for your detailed response. It helps a lot to understand how to continue with our solution.
The only pending issue we have is about legal, to be sure we can have the files stored there. We will reach out to the legal team to check this.

lists2safebrowsing.py Outdated Show resolved Hide resolved
lists2safebrowsing.py Outdated Show resolved Hide resolved
lists2safebrowsing.py Outdated Show resolved Hide resolved
@say-yawn say-yawn changed the title Creat json tracking protection lists Create json tracking protection lists Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants