Storing large real-time data streams in pod using LDES

The corresponding challenge is #82, which contributes to scenario #16.

Problem

Data streams are becoming omnipresent and are a crucial component in many use cases. Storing streams in a low-cost file-based Web environment could be done using Linked Data Event Streams (LDES). However, pushing large volumes of high volatile data into a Solid-based LDES is not possible because the current solution does the partitioning of the data after all data has been retrieved, instead of in a streaming fashion. This crashes the Community Solid Server due to the high load on the server when repartitioning large amounts of data. We use data from the DAHCC dataset for this challenge, which contains data streams describing the behaviour of various patients. The streams contain over 100.000 events.

We want a streaming Solid-based LDES connector that replays data and partitions this data in a streaming fashion when retrieving the data, instead of needing to wait till the whole dataset is received. Besides avoid the high load on the server, replaying allows to mimic the real-time behaviour of data streams, even though the data is historical data. This allows to showcase how solutions can process live data streams and how they can handle different data rates.

Approved solution

We developed a tool that replays captured streams. The tool has the following features:

The tool is pull-based as it pulls the observations from n-triples files.
The tool adds new events to the right bucket.
If new buckets are required, the tool updates the LDES.
The user can parameterize the bucket size.

The repository contains both a Web app and a engine, but only the engine is relevant for challenge #82 and this report.

The solution builds on this code. The code is not included as a dependency as it was not designed as a library. We copy and pasted the parts of the code that were useful for our solution.

We made the following important technological decisions and assumptions:

JavaScript/TypeScript as main development language to be in-line with current practices.
LDES to represent a stream of events detected by IOT-related devices.
Implements paginated to support loading large datasets.
Data sources are N-triples files, as they represent the state of the devices and observations as originally persisted using Apache Feather.
Single-user demonstration implementation, as the main goal is validating the approach, not developing a industry-ready implementation.
N3.js for streaming and high-throughput handling of RDF data.
Implements recursive version of the merge sort algorithm to sort the observation based on timestamp (configurable).
Implements pointer-based algorithm to keep track of last observation/event that has been replayed. The pointer can be advanced according to either the ordering by the end-user or automatically.
Implements optimisation to manage the size of the pods.

User flow

Actors/actresses

User of the tool

Preconditions

Configure the engine via the following steps:

Clone the repository via

git clone https://github.com/SolidLabResearch/LDES-in-SOLID-Semantic-Observations-Replay

Navigate to LDES-in-SOLID-Semantic-Observations-Replay via
```
cd LDES-in-SOLID-Semantic-Observations-Replay
```

Start an instance of the Community Solid Server via

docker run --rm -p 3000:3000 -it solidproject/community-server:latest -c config/default.json

Open a new terminal at the same location.
Navigate to engine via
```
cd engine
```
Install dependencies via
```
npm i
```

Download the example DAHCC dataset via

curl -L https://cloud.ilabt.imec.be/index.php/s/8BatNcg2iEyJktR/download -o data/dataset_participant1_100obs

Set the value of datasetFolders to the full path of the folder engine/data in the file src/config/replay_properties.json.
Start the engine via
```
npm start
```

If you get an error, see the README of the repository.

Steps

Get all loadable datasets using a GET request via

curl http://localhost:3001/datasets

You get something like

["dataset_participant1_100obs","dataset_participant2_100obs"]

Load a particular dataset using a GET request via

curl http://localhost:3001/loadDataset?dataset=dataset_participant1_100obs

You get an empty result.

Check the loading progress (in quad count) using a GET request via
```
curl http://localhost:3001/checkLoadingSize
```
You get something like
```
[500]
```
Get the actual observation count (quads / observation) using a GET request via
```
curl http://localhost:3001/checkObservationCount
```
You get something like
```
[100]
```

Sort the loaded observations (as according to the configured TreePath) using a GET request via

curl http://localhost:3001/sortObservations

You get something like

[["https://dahcc.idlab.ugent.be/Protego/_participant1/obs0","https://dahcc.idlab.ugent.be/Protego/_participant1/obs1","https://dahcc.idlab.ugent.be/Protego/_participant1/obs2" ... ]]

Get a sample (as in the configured chunk) set of observations using a GET request via

curl http://localhost:3001/getObservations

You get something like

[{"termType":"NamedNode","value":"https://dahcc.idlab.ugent.be/Protego/_participant1/obs0"},{"termType":"NamedNode","value":"https://dahcc.idlab.ugent.be/Protego/_participant1/obs1"} ...}]

Replay one next observation using a GET request via
```
curl http://localhost:3001/advanceAndPushObservationPointer
```
You get something like
```
[1]
```
This represents the pointer to the next replayable observation. Checking the LDES in the Solid pod (default: http://localhost:3000/test/), you should see at least two containers (the inbox and the LDES buckets), where the LDES buckets should now contain the replayed observation, for example http://localhost:3000/test/1641197095000/aa28a2fa-010f-4b81-8f3c-a57f45e13758.

Replay all remaining observations using a GET request via

curl http://localhost:3001/advanceAndPushObservationPointerToTheEnd

Postconditions

All observations are in the pod.

Follow-up actions

None.

Future work

Real-time replay including throttling. Challenge #83 is relevant, together with the roadmap in the solution's repository.
Elaborate filtering on the datasets used, such as selecting specific metrics that need to be replayed rather than the entire dataset.
Improve the LDES in LDP approach, if possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storing-large-real-time-data-streams-in-pod-using-LDES.md

storing-large-real-time-data-streams-in-pod-using-LDES.md

Storing large real-time data streams in pod using LDES

Problem

Approved solution

User flow

Actors/actresses

Preconditions

Steps

Postconditions

Follow-up actions

Future work

Files

storing-large-real-time-data-streams-in-pod-using-LDES.md

Latest commit

History

storing-large-real-time-data-streams-in-pod-using-LDES.md

File metadata and controls

Storing large real-time data streams in pod using LDES

Problem

Approved solution

User flow

Actors/actresses

Preconditions

Steps

Postconditions

Follow-up actions

Future work