The corresponding challenge is #82, which contributes to scenario #16.
Data streams are becoming omnipresent and are a crucial component in many use cases. Storing streams in a low-cost file-based Web environment could be done using Linked Data Event Streams (LDES). However, pushing large volumes of high volatile data into a Solid-based LDES is not possible because the current solution does the partitioning of the data after all data has been retrieved, instead of in a streaming fashion. This crashes the Community Solid Server due to the high load on the server when repartitioning large amounts of data. We use data from the DAHCC dataset for this challenge, which contains data streams describing the behaviour of various patients. The streams contain over 100.000 events.
We want a streaming Solid-based LDES connector that replays data and partitions this data in a streaming fashion when retrieving the data, instead of needing to wait till the whole dataset is received. Besides avoid the high load on the server, replaying allows to mimic the real-time behaviour of data streams, even though the data is historical data. This allows to showcase how solutions can process live data streams and how they can handle different data rates.
We developed a tool that replays captured streams. The tool has the following features:
- The tool is pull-based as it pulls the observations from n-triples files.
- The tool adds new events to the right bucket.
- If new buckets are required, the tool updates the LDES.
- The user can parameterize the bucket size.
The repository contains both a Web app and a engine, but only the engine is relevant for challenge #82 and this report.
The solution builds on this code. The code is not included as a dependency as it was not designed as a library. We copy and pasted the parts of the code that were useful for our solution.
We made the following important technological decisions and assumptions:
- JavaScript/TypeScript as main development language to be in-line with current practices.
- LDES to represent a stream of events detected by IOT-related devices.
- Implements paginated to support loading large datasets.
- Data sources are N-triples files, as they represent the state of the devices and observations as originally persisted using Apache Feather.
- Single-user demonstration implementation, as the main goal is validating the approach, not developing a industry-ready implementation.
- N3.js for streaming and high-throughput handling of RDF data.
- Implements recursive version of the merge sort algorithm to sort the observation based on timestamp (configurable).
- Implements pointer-based algorithm to keep track of last observation/event that has been replayed. The pointer can be advanced according to either the ordering by the end-user or automatically.
- Implements optimisation to manage the size of the pods.
- User of the tool
Configure the engine via the following steps:
-
Clone the repository via
git clone https://github.com/SolidLabResearch/LDES-in-SOLID-Semantic-Observations-Replay
-
Navigate to
LDES-in-SOLID-Semantic-Observations-Replay
viacd LDES-in-SOLID-Semantic-Observations-Replay
-
Start an instance of the Community Solid Server via
docker run --rm -p 3000:3000 -it solidproject/community-server:latest -c config/default.json
-
Open a new terminal at the same location.
-
Navigate to
engine
viacd engine
-
Install dependencies via
npm i
-
Download the example DAHCC dataset via
curl -L https://cloud.ilabt.imec.be/index.php/s/8BatNcg2iEyJktR/download -o data/dataset_participant1_100obs
-
Set the value of
datasetFolders
to the full path of the folderengine/data
in the filesrc/config/replay_properties.json
. -
Start the engine via
npm start
If you get an error, see the README of the repository.
-
Get all loadable datasets using a GET request via
curl http://localhost:3001/datasets
You get something like
["dataset_participant1_100obs","dataset_participant2_100obs"]
-
Load a particular dataset using a GET request via
curl http://localhost:3001/loadDataset?dataset=dataset_participant1_100obs
You get an empty result.
-
Check the loading progress (in quad count) using a GET request via
curl http://localhost:3001/checkLoadingSize
You get something like
[500]
-
Get the actual observation count (quads / observation) using a GET request via
curl http://localhost:3001/checkObservationCount
You get something like
[100]
-
Sort the loaded observations (as according to the configured TreePath) using a GET request via
curl http://localhost:3001/sortObservations
You get something like
[["https://dahcc.idlab.ugent.be/Protego/_participant1/obs0","https://dahcc.idlab.ugent.be/Protego/_participant1/obs1","https://dahcc.idlab.ugent.be/Protego/_participant1/obs2" ... ]]
-
Get a sample (as in the configured chunk) set of observations using a GET request via
curl http://localhost:3001/getObservations
You get something like
[{"termType":"NamedNode","value":"https://dahcc.idlab.ugent.be/Protego/_participant1/obs0"},{"termType":"NamedNode","value":"https://dahcc.idlab.ugent.be/Protego/_participant1/obs1"} ...}]
-
Replay one next observation using a GET request via
curl http://localhost:3001/advanceAndPushObservationPointer
You get something like
[1]
This represents the pointer to the next replayable observation. Checking the LDES in the Solid pod (default: http://localhost:3000/test/), you should see at least two containers (the inbox and the LDES buckets), where the LDES buckets should now contain the replayed observation, for example http://localhost:3000/test/1641197095000/aa28a2fa-010f-4b81-8f3c-a57f45e13758.
-
Replay all remaining observations using a GET request via
curl http://localhost:3001/advanceAndPushObservationPointerToTheEnd
All observations are in the pod.
None.
- Real-time replay including throttling. Challenge #83 is relevant, together with the roadmap in the solution's repository.
- Elaborate filtering on the datasets used, such as selecting specific metrics that need to be replayed rather than the entire dataset.
- Improve the LDES in LDP approach, if possible.