Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete old pipeline logs after X days or Y new runs #1068

Open
3 tasks done
qwerty287 opened this issue Aug 3, 2022 · 19 comments
Open
3 tasks done

Delete old pipeline logs after X days or Y new runs #1068

qwerty287 opened this issue Aug 3, 2022 · 19 comments
Labels
feature add new functionality

Comments

@qwerty287
Copy link
Contributor

Clear and concise description of the problem

From https://codeberg.org/Codeberg-CI/feedback/issues/63:

It would be nice to have an option to delete old logs of pipeline entries. Either automatic (delete logs of succesful runs after X days or Y new runs?) or with some sort of checkbox selection and action?

Suggested solution

An option that automatically deletes old pipeline logs

Alternative

No response

Additional context

No response

Validations

@qwerty287 qwerty287 added the feature add new functionality label Aug 3, 2022
@maxkratz
Copy link

Indeed, an option to manually delete pipeline runs (for example, after debugging the CI config on a testing branch) would be nice to have.

@anbraten
Copy link
Member

anbraten commented Mar 27, 2024

I would like to add support for automatic deletion of old pipelines.
Therefore I would suggest the "keep max last x pipelines" approach as it seems to be covering the issue (I want to keep the database size to a minimum by deleting unused content) and seems to be the easiest solution to implement keeping things simple. This would make special api endpoints as requested in #3506 obsolete for now.

My plan would be the following:

  • add UI option to set max amount of pipelines to keep per repo (we could later on add a server max amount as well)
  • everytime a new pipeline is created we simply check the amount and delete the oldest ones to get back to amount x specified

@xoxys
Copy link
Member

xoxys commented Mar 28, 2024

I don't get it. Why not keep the API first approach? Why would you hardcode this feature instead of a reusable API?

@anbraten
Copy link
Member

anbraten commented Mar 28, 2024

I don't get it. Why not keep the API first approach? Why would you hardcode this feature instead of a reusable API?

How should any other users benefit from this feature, will you open-source it as a service? If the plan is to introduce it as an internal feature later on, wouldn't those new api endpoints be unnecessary? I guess we don't plan to call the api endpoints from the code?

@xoxys
Copy link
Member

xoxys commented Mar 28, 2024

Still don't understand your point...

How should any other users benefit from this feature, will you open-source it as a service?

Of course not. Why should I create an external service? I just said, let's keep the PRs small, implement the API first and let's create a second PR that implements the UI components based on the API. I would have also added a CLI command to e.g. run woodpecker pipeline prune --older-than 30d based on the API. This cli command can be executed manually whenever needed or as a scheduled job.

Nobody has said that we don't want proper UI integration. As I wrote in the PR, the API can be called from the UI (that's how the UI works in general, right?). The API can also be called from the CLI (the cli client uses the go sdk, which only makes simple API calls).

If this is not true, why do we provide APIs at all? Creating an internal-only function prevents manual cleanups whenever a user wants to do so with any scope they want to clean up. If you don't plan to create a reusable API first, what should the UI implementation even look like? Nevertheless, I thought Woodpecker is following an api-first approach. This means that users are able to write their own CLI clients or even their own web front-ends if they want to. That's why I don't understand why you don't want a public API (especially if such exists already that exposes a lot of Pipeline/Repo related operations).

@pat-s
Copy link
Contributor

pat-s commented Apr 9, 2024

Reading this thread, @xoxys reasoning makes a lot of sense to me (having API endpoints which can be called from the UI and CLI) and would also cover the intentions of @anbraten AFAICS.

@anbraten Can you eloborate what the downsides would be in your view to the approach of @xoxys?

@anbraten
Copy link
Member

anbraten commented Apr 9, 2024

I thought about adding an api in the beginning as well. However to have an automatic deletion / retention of pipelines this has to be integrated into some kind of routine. At first there was the idea to have it executed by a cron scheduler, but then someone in chat suggested to simply remove old pipelines each time a new pipeline is created which seems pretty smart in terms of not checking inactive repos on large instances. This would make an api for this kinda obsolete. The main question remaining from what I read is if a keep x pipelines setting or a keep pipelines not older than x days setting is more useful for the majority of users.

Keep x pipelines seems to be nice as projects often have a different amounts of activity over time. Based on our UI most users are probably looking at sth like the last 10 pipelines in the UI. So keeping the 100 latest pipeline could do the job. If there is no activity for a year or so on a project you would still have some pipelines once you come back to the project.

Keeping pipelines for x days could make more sense for legal requirements, but I haven't heard of someone that this is a thing for CI pipelines yet.

@xoxys
Copy link
Member

xoxys commented Apr 9, 2024

You still just ignore my main points.

If you don't plan to create a reusable API first, what should the UI implementation even look like?

Is not answered.

This would make an api for this kinda obsolete.

No because users should still have the ability to delete pipelines whenever they want and the amount they want. I can't understand why you try to avoid an API at all costs. However, I have outlined and explained my points pretty clear multiple times now.

@xoxys
Copy link
Member

xoxys commented Apr 9, 2024

Keep x pipelines seems to be nice as projects often have a different amounts of activity over time. Based on our UI most users are probably looking at sth like the last 10 pipelines in the UI. So keeping the 100 latest pipeline could do the job. If there is no activity for a year or so on a project you would still have some pipelines once you come back to the project.

I have chosen the time bases approach for the API because it can handle "Keep x pipelines" as well. If we prefer this for e.g. the Woodpecker UI, just do:

  • get last 100 pipelines via "get pipelines api" /repos/{repo_id}/pipelines [get] (exists already)
  • read "create date" from last pipeline in the returned list
  • call "delete pipelines" api /repos/{repo_id}/pipelines [delete] with /?before=... (was proposed in the linked PR)

If some people want to use a time-based approach, they can use the same API while doing it the other way around (implementing an API to "Keep x pipelines") prevents any time-based (e.g. older than 30d) deletion.

In the real world scenario, IMO time-based is preferred in general because "Keep x pipelines" has a drawback. Let's say someone is spamming your repo with PR's (by accident or as an "attack") while "Keep 100 pipelines" is configured. You will lose your build history in seconds (yes, one could enabled required approvals).

@anbraten
Copy link
Member

You still just ignore my main points.

If you don't plan to create a reusable API first, what should the UI implementation even look like?

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job. Therefore a setting like the following which enables an automatic cleanup would be sufficient for the majority:

image

If a user still has specific needs the /repos/{repo_id}/pipelines endpoint could be used to receive a list of pipelines and we could consider adding a generic endpoint for deleting just a single pipeline as suggested before. For even more specific needs there is always the option to manually interact with the database directly.

@xoxys
Copy link
Member

xoxys commented Apr 10, 2024

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job.

You are making assumptions about what users want instead of letting them decide for themselves. As I am also a user, I can tell you that I do not want such a setting, and it would not help me at all.

Therefore a setting like the following which enables an automatic cleanup would be sufficient for the majority:

I tried to find a middle ground and clearly showed how this can also be achieved with the proposed API. I was also willing to do exactly what you want for the web UI (just using APIs instead)... That way everyone would get something out of it: you get the setting in the UI, and we get an API. However, I'm out at this point 👋.

@zc-devs
Copy link
Contributor

zc-devs commented Apr 13, 2024

What a hot discussion! It's funny and sad at the same time...

TLDR: go on and implement it in your way. Who wants API/whatever will implement it on top of your service.


why you try to avoid an API at all costs

I doubt it is the main reason. The main is avoiding scheduling service, I think. It is and it was. I don't understand why though...
If one has concerns in regards HA, then

  1. It doesn't work now anyway;
  2. We can develop separate micro service as Harbor did Job service, for example.

The second one is Why one should implement functionality, which is orthogonal to what he wants/needs (requirements)?
Why orthogonal? Because current feature requires just one integer - the setting of retention policy - not low-level API.

would like to add support for automatic deletion of old pipelines

Note automatic.

I have chosen the time bases approach ... just do ...

Who? Me? Manually? ^ Automatic.

users are able to write their own CLI clients or even their own web front-ends

What a beautiful definition of a user!

To achieve the goal "cleanup the pipeline list / database to not fill up the disk" most user probably don't wont to create an external service / cron-job

Exactly! I'm in that bucket of users. We do not consume cool Application Programming Interfaces, nor write frontends. We use GUI (CLI at worst).

I can tell you that I do not want such a setting, and it would not help me at all

Don't want - don't use. It would not help me either, but it might help somebody.


Anbraten are going to write internal service (file/class) to clean up logs. That service will be triggered by pipeline creation (I prefer cron, as you know:). There will be retention policy setting in GUI.
Who wants to expose that service via REST API, GRPC, CLI / whatever are welcomed, I guess.

If a user still has specific needs the /repos/{repo_id}/pipelines endpoint could be used to receive a list of pipelines and we could consider adding a generic endpoint for deleting just a single pipeline as suggested before. For even more specific needs there is always the option to manually interact with the database directly.

But it's a different task.

@xoxys
Copy link
Member

xoxys commented Apr 13, 2024

You missed the entire point...

  • I have created a PR to expose an API to delete pipelines and it wad refused
  • I have not refused to implement this feature in the UI I have not even refused the exact way Anbraten had suggested, Im fine with it and as I said multiple times in this thread of course we need an easy to use UI implementation for automated cleanups
  • I just wanted to create the API first because thats how the Woodpecker UI and the cli works already by using the API of the woodpecker server
  • The automstic cleanup feature in the UI can just re-use the proposed API and we avoid code duplication
  • This way power users can use the API in the way they want and everybody else can use the easy to use UI setting for automated pipeline cleanups. Everybody wins, but that was refused also

@zc-devs
Copy link
Contributor

zc-devs commented Apr 13, 2024

  1. That is sad, but not related to implementation in discussion.
  2. --
  3. Your API is completely different to what current implementation requires.
  4. I doubt. Are you suggesting to run this algorithm in frontend? When, considering automatic aspect?
  5. It is completely different features.

@xoxys
Copy link
Member

xoxys commented Apr 13, 2024

Having an API to delete pipelines is a different feature than having a UI setting to do it? Why do we have an API to delete a Repo if we have a button in the UI to do it? And what is the red button "Delete Repo" doing? It calls the DeleteRepo API.

And thats exactly what Im talking about.

@xoxys
Copy link
Member

xoxys commented Apr 13, 2024

I doubt. Are you suggesting to run this algorithm in frontend? When, considering automatic aspect?

I agree the UI might be the wrong place to automate it.

@zc-devs
Copy link
Contributor

zc-devs commented Apr 13, 2024

Having an API to delete pipelines is a different feature than having a UI setting to do it?

Having an API to delete pipelines is a different feature than having an UI to set a retention policy / setting up a trigger.

Also, I've just read through #3506:

Note: These APIs are about deleting the entire pipeline, not only it logs.

And this issue title:

Delete old pipeline logs


deleting the entire pipeline

Don't want to search, but there was concern about audit.

Why do we have an API to delete a Repo if we have a button in the UI to do it? And what is the red button "Delete Repo" doing? It calls the DeleteRepo API. And thats exactly what Im talking about.

I understood it. If Anbranten wanted just button to delete pipeline/s (or logs?;), then your points are absolutely valid.

@xoxys
Copy link
Member

xoxys commented Apr 13, 2024

Having an API to delete pipelines is a different feature than having an UI to set a retention policy / setting up a trigger.

You are right.

Don't want to search, but there was concern about audit.

If that would have been the mail problem I could understand it as well but nobody said something like this not in the PR not in this discussion.

Anyway I agree it might be better to separate these topics.

@zc-devs
Copy link
Contributor

zc-devs commented Apr 15, 2024

I've been thinking for a while...

Hooking into pipeline lifecycle is fine if we could decouple it. Instead of calling directly LogsCleaningService from PipelineService, we could emit event PipelineCreatedEvent and handle it in cleaning service.

  1. LogsCleaningService is singleton, loads configuration (retention policy) at startup.
  2. Pipeline runs. At some point it is saved to DB in PipelineService (should already exist, didn't look for a name in the code).
  3. Right after that PipelineCreatedEvent is created/emitted.
  4. LogsCleaningService catches event, gets repo ID.
  5. Loads (via PipelineService probably) pipeline runs IDs by condition from retention policy: older than X days / X count.
  6. Calls method (supplying IDs from previous step) of PipelineService (or maybe some dedicated to logs exists PipelineService) to remove log entries like removeLogsOf(int[]{1,2,3}).

Looking forward to implementation of this feature, cause I could apply this approach to the agents cleaning also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature add new functionality
Projects
None yet
Development

No branches or pull requests

6 participants