Skip to content

Commit

Permalink
Update developer docs for backend with more notes
Browse files Browse the repository at this point in the history
* Split it out into separate files
* Add notes about status of docs when relevant
* Integrate documentation from docs into main backend repo
  • Loading branch information
hellais committed Jan 6, 2025
1 parent 032eefd commit 8ce2279
Show file tree
Hide file tree
Showing 15 changed files with 3,992 additions and 518 deletions.
113 changes: 111 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,112 @@
# OONI backend
# OONI Backend

Welcome to the OONI backend!
The backend infrastructure performs multiple functions:

- Provide APIs for data consumers

- Instruct probes on what measurements to perform

- Receive measurements from probes, process them and store them in the database

- Upload new measurements to a bucket on [S3 data bucket](#s3-data-bucket)

- Fetch data from external sources e.g. fingerprints from a GitHub repository

## Main data flows

OONI Probes will run generally once every hour or every day, depending on the platform.
As part of these runs the sequence diagram of a probe run looks like the following:

```mermaid
sequenceDiagram
participant OONIProbe as OONI Probe
participant ProbeServices as OONI Backend
participant Internet
OONIProbe ->>+ Internet: lookupProbeMeta()
Internet ->>- OONIProbe: ProbeMeta
OONIProbe ->>+ ProbeServices: checkIn(ProbeMeta)
ProbeServices -->>- OONIProbe: []Targets
loop Every target
OONIProbe ->>+ Internet: runExperiment(target)
opt Control
OONIProbe ->>+ ProbeServices: runControl(target)
ProbeServices ->>- OONIProbe: CtrlMeasurement
end
Internet ->>- OONIProbe: Measurement
OONIProbe ->> ProbeServices: upload(Measurement)
end
```

The following diagram on the other hand, represents the main flow of measurement data.

The dark rectangles represent processes. The cilinders represent data at rest:
as files on disk, files on S3 or records in database tables.

```mermaid
flowchart LR
A(("Measurement")):::measurement --> B["Measurement is uploaded"]
B --> C["Fastpath (realtime)"]:::gray8Node & D["Disk Queue"]
C --> E["Fastpath Table"]:::gray3Node@{ shape: cyl}
D --> F["S3 Uploader (every hour)"]:::gray8Node
F --> G["s3://ooni-data-eu-fra bucket"]@{shape: cyl}
E --> H["OONI API"]:::gray8Node
D --> decision{"`is older than 1h?`"}
G --> decision
decision --> H
G --> PipelineV5["OONI Pipeline v5 (every day)"]:::gray8Node
PipelineV5 --> O["Observation Tables"]:::gray3Node@{ shape: cyl}
O --> H
classDef measurement fill:#0588cb,color:#fff
classDef gray2Node fill:#e9ecef,color:#000000
classDef gray3Node fill:#ced4da,color:#000000
classDef gray8Node fill:#343a40,color:#fff
```

Probes submit measurements to the API with a POST at the following path:
<https://api.ooni.io/apidocs/#/default/post_report__report_id_> The
measurement is optionally decompressed if zstd compression is detected.
It is then parsed and added with a unique ID and saved to disk. Very
little validation is done at this time in order to ensure that all
incoming measurements are accepted.

Measurements are enqueued on disk using one file per measurement. On
hourly intervals they are batched together, compressed and uploaded to
S3 by the [Measurement uploader](#measurement-uploader)&thinsp;⚙. The batching is
performed to allow efficient compression. See the
[dedicated subchapter](#measurement-uploader)&thinsp;⚙ for details.

The measurement is also sent to the [Fastpath](#fastpath)&thinsp;⚙. The
Fastpath runs as a dedicated daemon with a pool of workers. It
calculates scoring for the measurement and writes a record in the
fastpath table. Each measurement is processed individually in real time.
See the [dedicated subchapter](#fastpath)&thinsp;⚙ below.

The disk queue is also used by the API to access recent measurements
that have not been uploaded to S3 yet. See the
[measurement API](#getting-measurement-bodies)&thinsp;🐝 for details.

## Reproducibility

The measurement processing pipeline is meant to generate outputs that
can be equally generated by 3rd parties like external researchers and
other organizations.

This is meant to keep OONI accountable and as a proof that we do not
arbitrarily delete or alter measurements and that we score them as
accessible/anomaly/confirmed/failure in a predictable and transparent
way.

> **important**
> The only exceptions were due to privacy breaches that required removal
> of the affected measurements from the [S3 data bucket](#s3-data-bucket)&thinsp;💡
> bucket.
As such, the backend infrastructure is
[FOSS](https://en.wikipedia.org/wiki/Free_and_open-source_software) and
can be deployed by 3rd parties. We encourage researchers to replicate
our findings.

Incoming measurements are minimally altered by the
[Measurement uploader](#measurement-uploader)&thinsp;⚙ and uploaded to S3.
Loading

0 comments on commit 8ce2279

Please sign in to comment.