Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing data already ingested into a timeline when the related search index is being updated #3219

Open
jbaptperez opened this issue Oct 30, 2024 · 4 comments · May be fixed by #3241
Open

Seeing data already ingested into a timeline when the related search index is being updated #3219

jbaptperez opened this issue Oct 30, 2024 · 4 comments · May be fixed by #3241

Comments

@jbaptperez
Copy link

Is your feature request related to a problem? Please describe.
I cannot see data already ingested into a timeline when the related search index is being updated with new data (a new Plaso file is being updated).
An automated process send a set of Plaso files successively to the same timeline/searchindex and I can only see the final result, late.
I need access to the current ingested data, as soon as possible, even if incomplete.

Describe the solution you'd like
The frontend and the backend should not exclude the search indices which status is "running".

Describe alternatives you've considered
Waiting for the complete set of Plaso to be integrated.

Additional context
The whole timeline status depends on status of its data sources (see the timesketch.lib.tasks._set_datasource_status method).

Both the frontend and the backend filter out timelines with status not ready or fail.

Moreover, assuming the modification above is done, it would require another frontend change for consistency:
Allowing to manipulate include/exclude a timeline in a query, that is to change the timeline widget to behave like if the one is ready (no spinning wheel).
I imagine an identical behavior (full access to the menu), but only the spinning wheel appears in the case the timeline is being updated.
The regular polling mechanism (for now running when status is running) would be propagated to all states of a timeline, for a real-time adjustment. Otherwise, a manual refresh is necessary, like now.

@jkppr
Copy link
Collaborator

jkppr commented Nov 1, 2024

Hi @jbaptperez,

I want to make sure I fully understand your workflow and needs.

To clarify, it seems like you're:

  1. Uploading multiple Plaso files to the same timeline (same name, multiple data sources).
  2. These files are being processed sequentially, and you'd like to be able to query the data from the already processed files even while other files are still being indexed.

Is that correct?

  • Could you share a bit more about how you're uploading the Plaso files to Timesketch? Are you using the importer client, the API, or another method?
  • How often are you adding new Plaso files to the timeline? How large are these files typically?
  • Have you explored any workarounds or alternative solutions? For example, could you potentially create separate timelines for each Plaso file? Would it be an option to merge them after the indexing is finished?
  • When you say you want to "manipulate include/exclude a timeline in a query," can you elaborate on what you envision? How would you like to interact with the timelines that are still being indexed?
    • This would probably be a separate feature request.

Would you be willing to contribute to such a feature?

@jbaptperez
Copy link
Author

Hi @jkppr,

I am not at work for the end of the week (with my project source code using Timesketch), but I can already answer most of your questions.

To clarify, it seems like you're:

Uploading multiple Plaso files to the same timeline (same name, multiple data sources).
These files are being processed sequentially, and you'd like to be able to query the data from the already processed files even while other files are still being indexed.

Is that correct?

Yes, that's exactly what I want.
Actually, I work for a Product Owner who needs to get a timeline content as soon as it arrives, even with partial data.
Time is a key parameter in my project.

Could you share a bit more about how you're uploading the Plaso files to Timesketch? Are you using the importer client, the API, or another method?

Plaso files are sent using the importer client (one chunk, but using chunks would not change anything here).
The amount of data can be big (up some a couple of gigabytes).
As time is important, we have already split the amount of data into chunks (100 MiB directories) that are processed and result in multiple "small" Plaso files (instead of a single big one).
The process is a little more complex but that's enough for my explanation.

Upstream, we parallelise this and we send those small Plaso files to Timesketch.

How often are you adding new Plaso files to the timeline? How large are these files typically?

A single timeline ingests a set of 100 MiB Plaso files once and that's it in the main case.
Not that actually, the Plaso itself is not 100 MiB but it represents 100 MiB of data.

Another situation can lead to other sets of Plaso files that complete the first ones (in different ranges of time).

A concerte example could be posting 20 100-MiB Plaso files using 5 POST requests in parallel (the parallelism can be adjusted) with the importer client.
An extreme case could be sending 80 Plaso files using 20 POST requests at the same time.

Have you explored any workarounds or alternative solutions? For example, could you potentially create separate timelines for each Plaso file? Would it be an option to merge them after the indexing is finished?

I'm trying to find a proper workaround but I haven't found one yet.
I don't know whether it is possible to build multiple timelines and merge them (there is a single OpenSearch index per timeline, but maybe does this feature already exist?).

Your idea is interesting, in particular if the a single timeline can be "completed" with data coming from others (the user only has to refresh the same page), and assuming the other ones are deleted when merged.

Actually, I really need your advice to find a correct solution that respects the philosophy of Timesketch.

The 2 solutions I thought for now are the one described in the issue and a very ugly quick and dirty one: Forcing constantly the status ready for a timeline in the database with a scheduled task (an UPDATE SQL statement every n seconds).
I really don't want to apply such kind of "fix" but my Product Owner needs the feature.

When you say you want to "manipulate include/exclude a timeline in a query," can you elaborate on what you envision? How would you like to interact with the timelines that are still being indexed?
This would probably be a separate feature request.

First of all, I don't master OpenSearch but, as a database, I would bet we can request one of its index as the one it is being updated (getting partial data).
Tell me if I make a mistake.
Then, assuming that, the idea is to unlock the fact that Timesketch excludes a timeline (actually its search index) from a request when this timeline has a status that is different from ready or fail.

I studied the source code, and I could see that both the frontend and the backend exclude a running search index (i.e. timeline) from an OpenSearch request.
That means to do at least a change to remove the filter (but maybe it has drawback and actually, I don't know and I'm afraid about that).

Moreover, the frontend part seems harder because a timeline widget behaves differently, depending of its status.
In the case running (spinning wheel), it is excluded from a request in the code, its menu is unaccessible and a polling is constantly made to check its status.
I don't remember whether the user can enable or disable a timeline in that state by clicking on it.
Otherwise (ready or fail), there is no polling, the menu is available (data sources, etc.) and the timeline can properly be manually integrated to or excluded from an OpenSearch request.

So, the idea of "unlocking" timelines would mean to rethink the way it appears an it can be manipulated in Timesketch.
To be consistent, I thought about the following change to reach this behaviour:

  • Whatever the state of a timeline, a user can enable or disable it (click), open the menu and get all functionalities,
  • Only in the running state, the spinning wheel appears beside,
  • The polling of timeline states becomes permanent:
    • Allows a constant refreshing of a timeline widget (seeing in live when the new data are being imported),
    • Only the widget is refreshed, not the OpenSearch results,
    • OpenSearch result refreshes would be manual (simpler), or a toggle could also enable a live result refresh of an OpenSearch request.

This is actually a big change, and I'm afraid I'll face technical / philosophical limitations, that's why I need clarification for such a feature.

Would you be willing to contribute to such a feature?

Yes, definitely.
I need a proper "path" to do it and your acceptance so that a future PR would be accepted.

I'll work on that feature from the next week and the whole month (November).

To sum up, my target is "just" to be able to see timeline data while it is being imported.
If I send 100 Plaso files in sequence (or in parallel), I would like to see data as it arrives, just by refreshing the OpenSearch request.
Moreover, if data could come at the granularity of a single Plaso file (and not more fine grained), I think it would be acceptable (remember that we split data upstream, so we could split in smaller chunks).

Thank you for your help.

@jkppr
Copy link
Collaborator

jkppr commented Nov 6, 2024

Thanks for the additional context, @jbaptperez. Also summoning @berggren here for his expertise.

I've considered your feature request and how it can be implemented within Timesketch's design philosophy. These suggestions assume you're using the latest Timesketch release with the default (frontend-ng) UI.

Clarifications

  • Timelines and Search Indices: Timesketch creates one OpenSearch index per sketch and data source type (Plaso, CSV/JSONL). Multiple Plaso files uploaded within a sketch will be indexed into the same OpenSearch index. The timeline logic (grouping events) is handled by adding a __ts_timeline_id field to each event.
  • Querying Processing Timelines: Allowing queries against processing timelines was the previous behavior in Timesketch. However, it led to user confusion due to incomplete results. If reintroduced, it needs to be configurable and clearly indicate to the user that data is still being ingested.
  • Forcing "Ready" Status: Continuously updating the timeline status to "ready" is not a recommended solution. It could interfere with features like analyzers, which rely on the timeline status to determine when to run. Running analyzers on incomplete data could lead to incorrect or misleading results.

Any solution would require:

  • Configurability: The feature should be enabled via a backend configuration flag (timesketch.conf) or a user setting.
  • Transparency: The UI should clearly communicate to the user that search results against processing timelines may be incomplete and subject to change (e.g. via a banner above the timeline).
  • Analyzer Compatibility: The solution should not prevent analyzers from correctly identifying when a timeline is fully processed and ready for analysis.

Allow Querying Processing Timelines/Indices:

  • Implementation: Modify frontend and backend to include OpenSearch indices of processing timelines in search queries. This involves changes in:
    • Backend: timesketch.lib.datastores.opensearch.OpenSearchDataStore.search, timesketch.api.v1.resources.explore.ExploreResource.post, timesketch.models.sketch.Sketch.active_timelines, and potentially other locations.
    • Frontend: TimelinePicker.vue, TimelineChip.vue (remove status filtering, add visual cues for status), Explore.vue (adapt search method), EventList.vue (handle incomplete data, display a dismissable warning banner, and potentially disable aggregation features).
  • Pros:
    • Directly addresses your need to access data as it's indexed, even if incomplete.
    • Probably needs no changes to the data models.
  • Cons:
    • Data Consistency: Requires careful handling to ensure users understand results might be incomplete.
    • Performance: Querying actively indexing indices could impact performance and requires testing/optimization.
    • Complexity: Involves non-trivial changes in both frontend and backend.

Alternative to be considered:
Timesketch uses "Data Sources" to represent each imported file and "Timelines" to group one or more Data Sources. Since processing occurs at the Data Source level, the search restriction could be applied there instead of at the Timeline level. This would allow querying a timeline as soon as at least one Data Source is "ready," even if others are still processing.

However, this requires a significant backend change. Currently, events in OpenSearch are linked to timelines via the __ts_timeline_id field. To enable Data Source-level filtering, a new field like __ts_datasource_id needs to be added to each event during indexing. This change would impact how Timesketch queries and filters events, and also affects analyzer execution (which relies on knowing when all data sources in a timeline are ready).

What do you think? Would this be a solution you are willing to tackle?

@jkppr
Copy link
Collaborator

jkppr commented Nov 7, 2024

After some internal discussion with @berggren we recommend to proceed with the proposed "Allow Querying Processing Timelines/Indices" solution and ignore the alternative for now since it would need to much changes of our existing backend logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants