Skip to content

Commit

Permalink
Docs fixing (#1049)
Browse files Browse the repository at this point in the history
* docs(papers): adding papers list

* docs(results): reference WebTAP data

* docs(results): add footnotes

* docs(README): redirect to correct link
  • Loading branch information
vringar authored Sep 19, 2023
1 parent a4d2fba commit b21b914
Show file tree
Hide file tree
Showing 11 changed files with 310 additions and 49 deletions.
1 change: 1 addition & 0 deletions Extension/.prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ coverage
*.log

yarn.lock
package-lock.json

# built extension artifacts
dist
Expand Down
2 changes: 1 addition & 1 deletion Extension/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@
"singleQuote": false,
"trailingComma": "all"
}
}
}
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ Further information is available at [OPENWPM's Documentation Page](https://openw

## Advice for Measurement Researchers

OpenWPM is [often used](https://webtap.princeton.edu/software/) for web
OpenWPM is [often used](https://openwpm.readthedocs.io/Papers.html) for web
measurement research. We recommend the following for researchers using the tool:

**Use a versioned [release](https://github.com/openwpm/OpenWPM/releases).** We
Expand Down
259 changes: 259 additions & 0 deletions docs/Papers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
Studies using OpenWPM
======================

Data collected by WebTAP
------------------------

Since 2015, WebTAP has conducted a web census to study third-party online tracking.
Each month between 2015-2018, they visited the web’s 1 million most popular sites using
OpenWPM and record data pertaining
to user privacy, including cookies, fingerprinting scripts, the effect of browser privacy tools,
and the exchange of tracking data between different sites (“cookie syncing”).

WebTAP has `released <https://webtransparency.cs.princeton.edu/webcensus/data-release/>`_
the entire Princeton Web Census data — about 15 terabytes — containing
privacy measurements of 1 million sites conducted each month from December 2015 to June 2018.

List of Studies that have used OpenWPM
---------------------------------------

.. list-table::
:widths: 5 25 70
:header-rows: 1

* - Year
- Venue
- Study Name
* - 2014
- ACM CCS
- `The Web Never Forgets: Persistent Tracking Mechanisms in the Wild <https://securehomes.esat.kuleuven.be/~gacar/persistent/>`_
* - 2014
- ACM CoSN
- `Cognitive disconnect: Understanding Facebook Connect login permissions <http://cosn.acm.org/2014/files/cosn040f-robinsonA.pdf>`_
* - 2015
- WWW
- `Cookies that give you away: The surveillance implications of web tracking <http://senglehardt.com/papers/www15_cookie_surveil.pdf>`_
* - 2015
- NDSS
- `Upgrading HTTPS in midair: HSTS and key pinning in practice <https://www.internetsociety.org/sites/default/files/Upgrading%20HTTPS%20in%20Mid-Air-%20An%20Empirical%20Study%20of%20Strict%20Transport%20Security%20and%20Key%20Pinning.pdf>`_
* - 2015
- Tech Science
- `Web privacy census <http://techscience.org/a/2015121502/>`_
* - 2015
- W2SP
- `Variations in tracking in relation to geographic location <http://www.ieee-security.org/TC/SPW2015/W2SP/papers/W2SP_2015_submission_20.pdf>`_
* - 2016
- IFIP AICT
- `Evaluating Websites and Their Adherence to Data Protection Principles <https://dataskydd.net/sites/default/files/evaluating_websites.pdf>`_
* - 2016
- ACM CCS
- `Online Tracking: A 1-million-site Measurement and Analysis <http://senglehardt.com/papers/ccs16_online_tracking.pdf>`_
* - 2016
- WWW
- `No honor among thieves: A large-scale analysis of malicious web shells <https://www.securitee.org/files/webshells_www2016.pdf>`_
* - 2017
- NDSS
- `Dial One for Scam: A Large-Scale Analysis of Technical Support Scams <https://www.securitee.org/files/tss_ndss2017.pdf>`_
* - 2017
- PETS
- `Cross-Device Tracking: Measurement and Disclosures <https://petsymposium.org/2017/papers/issue2/paper29-2017-2-source.pdf>`_
* - 2017
- CODASPY
- `Identifying HTTPS-Protected Netflix Videos in Real-Time <https://dl.acm.org/citation.cfm?id=3029821>`_
* - 2017
- WWW
- `De-anonymizing Web Browsing Data with Social Networks <http://randomwalker.info/publications/browsing-history-deanonymization.pdf>`_ [#f1]_
* - 2017
- IWPE
- `Battery Status Not Included: Assessing Privacy in Web Standards <http://randomwalker.info/publications/battery-status-case-study.pdf>`_
* - 2017
- Annual Privacy Forum
- `PrivacyScore: Improving Privacy and Security via Crowd-Sourced Benchmarks of Websites <https://link.springer.com/chapter/10.1007/978-3-319-67280-9_10>`_
* - 2017
- arXiv
- `Horcrux: A Password Manager for Paranoids <https://arxiv.org/pdf/1706.05085.pdf>`_
* - 2017
- USENIX Security
- `Measuring the Insecurity of Mobile Deep Links of Android <http://people.cs.vt.edu/gangwang/deep17.pdf>`_
* - 2017
- Applied Economics Letters
- `Online advertising networks and consumer perceptions of privacy <http://www.tandfonline.com/doi/full/10.1080/13504851.2017.1366634>`_
* - 2018
- PETS
- `When the cookie meets the blockchain: Privacy risks of web payments via cryptocurrencies <https://www.petsymposium.org/2018/files/papers/issue4/popets-2018-0038.pdf>`_
* - 2018
- PETS
- `I never signed up for this! Privacy implications of email tracking <https://senglehardt.com/papers/pets18_email_tracking.pdf>`_
* - 2018
- ACM TOIT
- `Measuring third party tracker power across web and mobile <https://dl.acm.org/citation.cfm?id=3176246>`_
* - 2018
- CALIcon
- `Third Party Trackers on Law School Library Websites <https://docs.google.com/presentation/d/1kwJs5Tb2R93a8AQFBq_8oI97QwnhuyaXm7ZuhZhgjBs/edit?usp=sharing>`_
* - 2018
- Master Thesis, Delft University of Technology
- `Tracking Cookies in the European Union, an Empirical Analysis of the Current Situation <https://repository.tudelft.nl/islandora/object/uuid%3A50d04f83-222d-479e-8ddd-661d2243857a?collection=education>`_
* - 2018
- ACM CCS
- `The Web’s Sixth Sense: A Study of Scripts Accessing Smartphone Sensors <https://sensor-js.xyz/webs-sixth-sense-ccs18.pdf>`_
* - 2018
- ACSAC
- `Raising the Bar: Evaluating Origin-wide Security Manifests <https://danielhausknecht.eu/papers/originmanifest_acsac2018.pdf>`_
* - 2018
- arXiv
- `The Unwanted Sharing Economy: An Analysis of Cookie Syncing and User Transparency under GDPR <https://arxiv.org/pdf/1811.08660.pdf>`_
* - 2018
- PhD thesis, Princeton University
- `Automated discovery of privacy violations on the web <https://senglehardt.com/papers/princeton_phd_dissertation_englehardt.pdf>`_
* - 2018
- AINTEC’18
- `Understanding abusive web resources: characteristics and counter-measures of malicious web resources and cryptocurrency mining <https://dl.acm.org/citation.cfm?id=3289174>`_
* - 2018
- ACSAC
- `Raising the Bar: Evaluating Origin-wide Security Manifests <https://dl.acm.org/citation.cfm?id=3274701>`_
* - 2018
- SSRN
- `Acquisitions in the Third Party Tracking Industry: Competition and Data Protection Aspects <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3269473>`_
* - 2019
- Communications in Computer and Information Science
- `Transparency in Keyword Faceted Search: An Investigation on Google Shopping <https://link.springer.com/chapter/10.1007/978-3-030-11226-4_3>`_
* - 2019
- arXiv
- `The Price of Free Illegal Live Streaming Services <https://arxiv.org/abs/1901.00579>`_
* - 2019
- Advances in Intelligent Systems and Computing
- `Usage of HTTPS by Municipal Websites in Portugal <https://link.springer.com/chapter/10.1007/978-3-030-16184-2_16>`_
* - 2019
- ConPro
- `The Impact of User Location on Cookie Notices (Inside and Outside of the European Union) <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3361360>`_
* - 2019
- WWW
- `Before and After GDPR: The Changes in Third Party Presence at Public and Private European Websites <https://dl.acm.org/citation.cfm?doid=3308558.3313524>`_
* - 2019
- IEEE EuroS&P
- `TraffickStop: Detecting and Measuring Illicit Traffic Monetization Through Large-Scale DNS Analysis <https://www.luchaoyi.com/uploads/1/2/0/2/120274471/eurosp19.pdf>`_
* - 2019
- SSRN
- `The Market for Data Privacy <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3352175>`_
* - 2019
- ACM CSCW
- `Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites <https://arxiv.org/pdf/1907.07032.pdf>`_
* - 2019
- Computer Communications
- `A comparison of web privacy protection techniques <https://arxiv.org/pdf/1712.06850.pdf>`_
* - 2019
- DPM
- `On Privacy Risks of Public WiFi Captive Portals <https://arxiv.org/pdf/1907.02142.pdf>`_
* - 2019
- Computers & Security
- `Towards a global perspective on web tracking <https://www.sciencedirect.com/science/article/pii/S0167404818314007>`_
* - 2019
- APF
- `Towards Transparency in Email Tracking <https://link.springer.com/chapter/10.1007/978-3-030-21752-5_2>`_
* - 2019
- RAID
- `Talon: An Automated Framework for Cross-Device Tracking Detection <https://arxiv.org/pdf/1812.11393.pdf>`_
* - 2019
- ACM CCS
- `Watching You Watch: The Tracking Ecosystem of Over-the-Top TV Streaming Devices <https://www.princeton.edu/~pmittal/publications/tv-tracking-ccs19.pdf>`_
* - 2019
- ACM IMC
- `Tales from the Porn: A Comprehensive Privacy Analysis of the Web Porn Ecosystem <http://www1.icsi.berkeley.edu/~narseo/papers/pornweb2019_preprint.pdf>`_
* - 2019
- IEEE EuroS&P
- `TraffickStop: Detecting and Measuring Illicit Traffic Monetization Through Large-scale DNS Analysis <https://www.researchgate.net/profile/Zhou_Li24/publication/332544947_TraffickStop_Detecting_and_Measuring_Illicit_Traffic_Monetization_Through_Large-Scale_DNS_Analysis/links/5cbb9445299bf12097747a16/TraffickStop-Detecting-and-Measuring-Illicit-Traffic-Monetization-Through-Large-Scale-DNS-Analysis.pdf>`_
* - 2019
- The New York Times
- `I Visited 47 Sites. Hundreds of Trackers Followed Me. <https://www.nytimes.com/interactive/2019/08/23/opinion/data-internet-privacy-tracking.html>`_
* - 2019
- The Washington Post
- `Think you’re anonymous online? A third of popular websites are ‘fingerprinting’ you. <https://www.washingtonpost.com/technology/2019/10/31/think-youre-anonymous-online-third-popular-websites-are-fingerprinting-you/>`_
* - 2019
- ESORICS
- `Fingerprint surface-based detection of web bot detectors <http://www.open.ou.nl/hjo/papers/ESORICS19.pdf>`_
* - 2019
- DPM
- `A Study on Subject Data Access in Online Advertising after the GDPR <https://www.researchgate.net/profile/Tobias_Urban2/publication/334706961_A_Study_on_Subject_Data_Access_in_Online_Advertising_after_the_GDPR/links/5d47eff492851cd046a26e5b/A-Study-on-Subject-Data-Access-in-Online-Advertising-after-the-GDPR.pdf>`_
* - 2019
- IEEE SPW
- `After GDPR, Still Tracking or Not? Understanding Opt-Out States for Online Behavioral Advertising <https://ieeexplore.ieee.org/document/8844599>`_
* - 2020
- PETS
- `Missed by Filter Lists: Detecting Unknown Third-Party Trackers with Invisible Pixels <https://www.petsymposium.org/2020/files/papers/issue2/popets-2020-0038.pdf>`_
* - 2020
- PETS
- `Inferring Tracker-Advertiser Relationships in the Online Advertising Ecosystem using Header Bidding <https://www.petsymposium.org/2020/files/papers/issue1/popets-2020-0005.pdf>`_
* - 2020
- PETS
- `A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments <https://petsymposium.org/2020/files/papers/issue2/popets-2020-0016.pdf>`_
* - 2020
- PETS
- `No boundaries: data exfiltration by third parties embedded on web pages <https://petsymposium.org/2020/files/papers/issue4/popets-2020-0068.pdf>`_
* - 2020
- PETS
- `In-Depth Evaluation of Redirect Tracking and Link Usage <https://petsymposium.org/2020/files/papers/issue4/popets-2020-0077.pdf>`_
* - 2020
- The Web Conference
- `The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing <https://research.mozilla.org/files/2020/02/Jestr_vs_crawl_WWW20202.pdf>`_ [#f2]_
* - 2020
- The Web Conference
- `Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web <https://sparta.cs.uiowa.edu/docs/www-2020.pdf>`_
* - 2020
- The Web Conference
- `Stop Tracking me Bro! Differential Tracking of User Demographics on Hyper-partisan Websites <https://arxiv.org/pdf/2002.00934.pdf>`_ [#f2]_
* - 2020
- The Web Conference
- `Beyond the Front Page: Measuring Third Party Dynamics in the Field <https://arxiv.org/pdf/2001.10248.pdf>`_
* - 2020
- ACM ASIACCS
- `Measuring the Impact of the GDPR on Data Sharing in Ad Networks <https://www.researchgate.net/profile/Tobias_Urban2/publication/337487496_Measuring_the_Impact_of_the_GDPR_on_Data_Sharing_in_Ad_Networks/links/5ddbc21792851c1fedafc663/Measuring-the-Impact-of-the-GDPR-on-Data-Sharing-in-Ad-Networks.pdf>`_
* - 2020
- arXiv
- `Actions speak louder than words: Semi-supervised learning for browser fingerprinting detection <https://arxiv.org/pdf/2003.04463.pdf>`_
* - 2020
- PAM
- `Extortion or Expansion? An investigation into the costs and consequences of ICANN’s gTLD experiments <https://people.cs.umass.edu/~shahrooz/Shahrooz_gTLDtmPAM202.pdf>`_
* - 2020
- Bachelor Thesis, Radboud University
- `Design and implementation of a stealthy OpenWPM web scraper <http://www.cs.ru.nl/bachelors-theses/2020/Daniel_Go%C3%9Fen___4751051___Design_and_implementation_of_a_stealthy_OpenWPM_web_scraper.pdf>`_
* - 2020
- IWPE
- `On Compliance of Cookie Purposes with the Purpose Specification Principle <https://hal.inria.fr/hal-02567022/document>`_
* - 2020
- FTC PrivacyCon
- `Unaccounted Privacy Violation: A Comparative Analysis of Persistent Identification of Users Across Social Contexts <https://www.ftc.gov/system/files/documents/public_events/1548288/privacycon-2020-ido_sivan-sevilla.pdf>`_
* - 2020
- IEEE EuroS&P
- `Multi-country Study of Third Party Trackers from Real Browser Histories <https://nms.kcl.ac.uk/nishanth.sastry/pdf/2020/EurpSPMultiCountryStudy.pdf>`_
* - 2020
- TMA
- `Characterizing CNAME Cloaking-Based Tracking on the Web <https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf>`_
* - 2020
- TMA
- `Clash of the Trackers: Measuring the Evolution of the Online Tracking Ecosystem <https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper36.pdf>`_
* - 2020
- WEIS
- `The Impact of the GDPR on Content Providers <https://weis2020.econinfosec.org/wp-content/uploads/sites/8/2020/06/weis20-final43.pdf>`_
* - 2020
- PhD Thesis, University of Michigan
- `Enhancing System Transparency, Trust, and Privacy with Internet Measurement <https://benvds.com/papers/dissertation.pdf>`_
* - 2020
- Masters Thesis, Concordia University
- `A Large-Scale Evaluation of Privacy Practices of Public WiFi Captive Portals <https://users.encs.concordia.ca/~mmannan/student-resources/Thesis-MASc-AliSuzan-2020.pdf>`_
* - 2020
- IEEE Globecom
- `A machine learning approach for detecting CNAME cloaking-based tracking on the Web <https://arxiv.org/pdf/2009.14330.pdf>`_
* - 2021
- NDSS
- `Reining in the Web’s Inconsistencies with Site Policy <https://swag.cispa.saarland/papers/calzavara2021reining.pdf>`_
* - 2021
- PETS
- `Unveiling Web Fingerprinting in the Wild Via Code Mining and Machine Learning <https://petsymposium.org/2021/files/papers/popets-2021-0004.pdf>`_
* - 2021
- IEEE S&P
- `Fingerprinting the Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors <https://arxiv.org/abs/2008.04480>`_

.. rubric:: Footnotes

.. [#f1] Uses data released by us.
.. [#f2] Studies OpenWPM’s behavior.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ We're hoping to improve this setup in the future.

Configuration

Papers

.. toctree::
:maxdepth: 4
:caption: Developer documentation
Expand Down
28 changes: 16 additions & 12 deletions openwpm/command_sequence.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,19 +156,23 @@ def recursive_dump_page_source(self, suffix="", timeout=30):
stored in `manager_params.source_dump_path` and is keyed by the
current `visit_id` and top-level url. The source dump is a gzipped json
file with the following structure:
{
'document_url': "http://example.com",
'source': "<html> ... </html>",
'iframes': {
'frame_1': {'document_url': ...,
'source': ...,
'iframes: { ... }},
'frame_2': {'document_url': ...,
'source': ...,
'iframes: { ... }},
'frame_3': { ... }
.. code-block:: JSON
:linenos:
{
"document_url": "http://example.com",
"source": "<html> ... </html>",
"iframes": {
"frame_1": {"document_url": "...",
"source": "...",
"iframes": "{ ... }"},
"frame_2": {"document_url": "...",
"source": "...",
"iframes": "{ ... }"},
"frame_3": "{ ... }"
}
}
}
"""
self.total_timeout += timeout
if not self.contains_get_or_browse:
Expand Down
16 changes: 6 additions & 10 deletions openwpm/commands/browser_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@


def bot_mitigation(webdriver):
"""performs three optional commands for bot-detection
mitigation when getting a site"""
"""Performs three optional commands for bot-detection mitigation when getting a site"""

# bot mitigation 1: move the randomly around a number of times
window_size = webdriver.get_window_size()
Expand Down Expand Up @@ -86,9 +85,7 @@ def close_other_windows(webdriver):


def tab_restart_browser(webdriver):
"""
kills the current tab and creates a new one to stop traffic
"""
"""kills the current tab and creates a new one to stop traffic"""
# note: this technically uses windows, not tabs, due to problems with
# chrome-targeted keyboard commands in Selenium 3 (intermittent
# nonsense WebDriverExceptions are thrown). windows can be reliably
Expand All @@ -114,9 +111,7 @@ def tab_restart_browser(webdriver):


class GetCommand(BaseCommand):
"""
goes to <url> using the given <webdriver> instance
"""
"""goes to <url> using the given <webdriver> instance"""

def __init__(self, url, sleep):
self.url = url
Expand Down Expand Up @@ -467,6 +462,7 @@ def collect_source(webdriver, frame_stack, rv={}):

class FinalizeCommand(BaseCommand):
"""This command is automatically appended to the end of a CommandSequence
It's apperance means there won't be any more commands for this
visit_id
"""
Expand Down Expand Up @@ -494,8 +490,8 @@ def execute(


class InitializeCommand(BaseCommand):
"""The command is automatically prepended to the beginning of a
CommandSequence
"""The command is automatically prepended to the beginning of a CommandSequence
It initializes state both in the extensions as well in as the
StorageController
"""
Expand Down
Loading

0 comments on commit b21b914

Please sign in to comment.