Skip to content

Commit

Permalink
chore: backend blackouts to maintenance windows
Browse files Browse the repository at this point in the history
  • Loading branch information
talboren committed Sep 2, 2024
1 parent 5094c6c commit b1ed3c1
Show file tree
Hide file tree
Showing 16 changed files with 387 additions and 360 deletions.
File renamed without changes
2 changes: 1 addition & 1 deletion docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"overview/enrichment/mapping"
]
},
"overview/blackouts",
"overview/maintenance-windows",
"overview/examples",
"overview/comparison"
]
Expand Down
59 changes: 0 additions & 59 deletions docs/overview/blackouts.mdx

This file was deleted.

59 changes: 59 additions & 0 deletions docs/overview/maintenance-windows.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: "Maintenance Windows"
---

# Alert Management: Maintenance Windows

Keep's Maintenance Windows feature provides a critical mechanism for managing alert noise during scheduled maintenance periods or other planned events. By defining Maintenance Window rules, users can suppress alerts that are irrelevant during these times, ensuring that only actionable alerts reach the operations team.

## Introduction

In dynamic IT environments, it's common to have periods where certain alerts are expected and should not trigger incident responses. Keep's Maintenance Windows feature allows users to define specific rules that temporarily suppress alerts based on various conditions, such as time windows or alert attributes. This helps prevent unnecessary alert fatigue and ensures that teams can focus on critical issues.

## How It Works

1. **Maintenance Window Rule Definition**: Users define Maintenance Window rules specifying the conditions under which alerts should be suppressed.
2. **Condition Specification**: A CEL (Common Expression Language) query is associated with each Maintenance Window rule to define the conditions for suppression.
3. **Time Window Configuration**: Maintenance Window rules can be set for specific start and end times, or based on a relative duration.
4. **Alert Suppression**: During the active period of a Maintenance Window rule, any alerts matching the defined conditions are suppressed and not forwarded to the alerting system.

## Practical Example

Suppose your team schedules a database upgrade that could trigger numerous non-critical alerts. You can create a Maintenance Window rule that suppresses alerts from the database service during the upgrade window. This ensures that your operations team isn't overwhelmed by non-actionable alerts, allowing them to focus on more critical issues.

## Core Concepts

- **Maintenance Window Rules**: Configurations that define when and which alerts should be suppressed based on time windows and conditions.
- **CEL Query**: A query language used to specify the conditions under which alerts should be suppressed. For example, a CEL query might suppress alerts where the source is a specific service during a maintenance window.
- **Time Window**: The specific start and end times or relative duration during which the Maintenance Window rule is active.
- **Alert Suppression**: The process of ignoring alerts that match the Maintenance Window rule's conditions during the specified time window.

## Status-Based Filtering in Maintenance Windows

In Keep, certain alert statuses are automatically ignored by Maintenance Window rules. Specifically, alerts with the statuses RESOLVED and ACKNOWLEDGED are not suppressed by Maintenance Window rules. This is intentional to ensure that resolving alerts can still be processed and appropriately close or update active incidents.

### Why Are Some Statuses Ignored?

• RESOLVED Alerts: These alerts indicate that an issue has been resolved. By allowing these alerts to bypass Maintenance Window rules, Keep ensures that any active incidents related to the alert can be properly closed, maintaining the integrity of the alert lifecycle.
• ACKNOWLEDGED Alerts: These alerts have been acknowledged by an operator, signaling that they are being addressed. Ignoring these alerts in Maintenance Windows ensures that operators can track the progress of incidents and take necessary actions without interference.

By excluding these statuses from Maintenance Window suppression, Keep allows for the continuous and accurate management of alerts, even during Maintenance Window periods, ensuring that resolution processes are not disrupted.

## Creating a Maintenance Window Rule

To create a Maintenance Window rule:

<Frame width="100" height="200">
<img height="10" src="/images/maintenance-window-creation.png" />
</Frame>

1. **Define the Maintenance Window Name and Description**: Provide a name and optional description for the Maintenance Window rule to easily identify its purpose.
2. **Specify the CEL Query**: Use CEL to define the conditions under which alerts should be suppressed (e.g., `source == "database"`).
3. **Set the Time Window**: Choose a specific start and end time, or define a relative duration for the Maintenance Window.
4. **Enable the Rule**: Decide whether the rule should be active immediately or scheduled for future use.

## Best Practices

- **Plan Maintenance Windows in Advance**: Schedule Maintenance Window periods in advance for known maintenance windows to prevent unnecessary alerts.
- **Use Specific Conditions**: Define precise CEL queries to ensure only the intended alerts are suppressed.
- **Review and Update Maintenance Windows**: Regularly review active Maintenance Window rules to ensure they are still relevant and adjust them as necessary.
4 changes: 2 additions & 2 deletions keep/api/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@
actions,
ai,
alerts,
blackout,
dashboard,
extraction,
healthcheck,
incidents,
maintenance,
mapping,
metrics,
preset,
Expand Down Expand Up @@ -220,7 +220,7 @@ def get_app(
)
app.include_router(dashboard.router, prefix="/dashboard", tags=["dashboard"])
app.include_router(tags.router, prefix="/tags", tags=["tags"])
app.include_router(blackout.router, prefix="/blackout", tags=["blackout"])
app.include_router(maintenance.router, prefix="/maintenance", tags=["maintenance"])
app.include_router(topology.router, prefix="/topology", tags=["topology"])

# if its single tenant with authentication, add signin endpoint
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@
from keep.api.core.db import get_session_sync
from keep.api.models.alert import AlertDto, AlertStatus
from keep.api.models.db.alert import AlertActionType, AlertAudit
from keep.api.models.db.blackout import BlackoutRule
from keep.api.models.db.maintenance_window import MaintenanceWindowRule
from keep.api.utils.cel_utils import preprocess_cel_expression


class BlackoutsBl:
class MaintenanceWindowsBl:

ALERT_STATUSES_TO_IGNORE = [
AlertStatus.RESOLVED.value,
Expand All @@ -23,41 +23,43 @@ def __init__(self, tenant_id: str, session: Session | None) -> None:
self.logger = logging.getLogger(__name__)
self.tenant_id = tenant_id
self.session = session if session else get_session_sync()
self.blackouts: list[BlackoutRule] = (
self.session.query(BlackoutRule)
.filter(BlackoutRule.tenant_id == tenant_id)
.filter(BlackoutRule.enabled == True)
.filter(BlackoutRule.end_time >= datetime.datetime.now())
self.maintenance_rules: list[MaintenanceWindowRule] = (
self.session.query(MaintenanceWindowRule)
.filter(MaintenanceWindowRule.tenant_id == tenant_id)
.filter(MaintenanceWindowRule.enabled == True)
.filter(MaintenanceWindowRule.end_time >= datetime.datetime.now())
.all()
)

def check_if_alert_in_blackout(self, alert: AlertDto) -> bool:
def check_if_alert_in_maintenance_windows(self, alert: AlertDto) -> bool:
extra = {"tenant_id": self.tenant_id, "fingerprint": alert.fingerprint}

if not self.blackouts:
if not self.maintenance_rules:
self.logger.debug(
"No blackout rules for this tenant", extra={"tenant_id": self.tenant_id}
"No maintenance window rules for this tenant",
extra={"tenant_id": self.tenant_id},
)
return False

if alert.status in self.ALERT_STATUSES_TO_IGNORE:
self.logger.debug(
"Alert status is set to be ignored, not blacking out",
"Alert status is set to be ignored, ignoring maintenance windows",
extra={"tenant_id": self.tenant_id},
)
return False

self.logger.info("Checking blackout for alert", extra=extra)
self.logger.info("Checking maintenance window for alert", extra=extra)
env = celpy.Environment()

for blackout in self.blackouts:
if blackout.end_time <= datetime.datetime.now():
for maintenance_rule in self.maintenance_rules:
if maintenance_rule.end_time <= datetime.datetime.now():
# this is wtf error, should not happen because of query in init
self.logger.error(
"Fetched blackout which already ended by mistake, should not happen!"
"Fetched maintenance window which already ended by mistake, should not happen!"
)
continue

cel = preprocess_cel_expression(blackout.cel_query)
cel = preprocess_cel_expression(maintenance_rule.cel_query)
ast = env.compile(cel)
prgm = env.program(ast)

Expand All @@ -76,28 +78,29 @@ def check_if_alert_in_blackout(self, alert: AlertDto) -> bool:
raise
if cel_result:
self.logger.info(
"Alert is blacked out", extra={**extra, "blackout_id": blackout.id}
"Alert is in maintenance window",
extra={**extra, "maintenance_rule_id": maintenance_rule.id},
)

try:
audit = AlertAudit(
tenant_id=self.tenant_id,
fingerprint=alert.fingerprint,
user_id="Keep",
action=AlertActionType.BLACKOUT.value,
description=f"Alert was blacked out due to rule `{blackout.name}`",
action=AlertActionType.MAINTENANCE.value,
description=f"Alert in maintenance due to rule `{maintenance_rule.name}`",
)
self.session.add(audit)
self.session.commit()
except Exception:
self.logger.exception(
"Failed to write audit for alert blackout",
"Failed to write audit for alert maintenance window",
extra={
"tenant_id": self.tenant_id,
"fingerprint": alert.fingerprint,
},
)

return True
self.logger.info("Alert is not blacked out", extra=extra)
self.logger.info("Alert is not in maintenance window", extra=extra)
return False
2 changes: 1 addition & 1 deletion keep/api/core/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@
from keep.api.models.alert import IncidentDtoIn
from keep.api.models.db.action import Action
from keep.api.models.db.alert import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.blackout import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.dashboard import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.extraction import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.maintenance_window import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.mapping import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.preset import * # pylint: disable=unused-wildcard-import
from keep.api.models.db.provider import * # pylint: disable=unused-wildcard-import
Expand Down
2 changes: 1 addition & 1 deletion keep/api/models/db/alert.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,4 +239,4 @@ class AlertActionType(enum.Enum):
# commented
COMMENT = "a comment was added to the alert"
UNCOMMENT = "a comment was removed from the alert"
BLACKOUT = "Alert is blacked out"
MAINTENANCE = "Alert is in maintenance window"
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from sqlmodel import Column, Field, Index, SQLModel, func


class BlackoutRule(SQLModel, table=True):
class MaintenanceWindowRule(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str
tenant_id: str = Field(foreign_key="tenant.id")
Expand All @@ -30,12 +30,12 @@ class BlackoutRule(SQLModel, table=True):
enabled: bool = True

__table_args__ = (
Index("ix_blackout_tenant_id", "tenant_id"),
Index("ix_blackout_tenant_id_end_time", "tenant_id", "end_time"),
Index("ix_maintenance_rule_tenant_id", "tenant_id"),
Index("ix_maintenance_rule_tenant_id_end_time", "tenant_id", "end_time"),
)


class BlackoutRuleCreate(BaseModel):
class MaintenanceRuleCreate(BaseModel):
name: str
description: Optional[str] = None
cel_query: str
Expand All @@ -44,7 +44,7 @@ class BlackoutRuleCreate(BaseModel):
enabled: bool = True


class BlackoutRuleRead(BaseModel):
class MaintenanceRuleRead(BaseModel):
id: int
name: str
description: Optional[str]
Expand Down
2 changes: 1 addition & 1 deletion keep/api/models/db/migrations/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
from keep.api.core.db_utils import create_db_engine
from keep.api.models.db.action import *
from keep.api.models.db.alert import *
from keep.api.models.db.blackout import *
from keep.api.models.db.dashboard import *
from keep.api.models.db.extraction import *
from keep.api.models.db.maintenance_window import *
from keep.api.models.db.mapping import *
from keep.api.models.db.preset import *
from keep.api.models.db.provider import *
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""Your message
"""Maintenance Windows
Revision ID: dfeaa5383cc6
Revision ID: 70671c95028e
Revises: 1c650a429672
Create Date: 2024-09-01 17:23:44.757056
Create Date: 2024-09-02 12:07:09.147349
"""

Expand All @@ -11,7 +11,7 @@
from alembic import op

# revision identifiers, used by Alembic.
revision = "dfeaa5383cc6"
revision = "70671c95028e"
down_revision = "1c650a429672"
branch_labels = None
depends_on = None
Expand All @@ -20,7 +20,7 @@
def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table(
"blackoutrule",
"maintenancewindowrule",
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
Expand All @@ -43,19 +43,23 @@ def upgrade() -> None:
),
sa.PrimaryKeyConstraint("id"),
)
with op.batch_alter_table("blackoutrule", schema=None) as batch_op:
batch_op.create_index("ix_blackout_tenant_id", ["tenant_id"], unique=False)
with op.batch_alter_table("maintenancewindowrule", schema=None) as batch_op:
batch_op.create_index(
"ix_blackout_tenant_id_end_time", ["tenant_id", "end_time"], unique=False
"ix_maintenance_rule_tenant_id", ["tenant_id"], unique=False
)
batch_op.create_index(
"ix_maintenance_rule_tenant_id_end_time",
["tenant_id", "end_time"],
unique=False,
)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table("blackoutrule", schema=None) as batch_op:
batch_op.drop_index("ix_blackout_tenant_id_end_time")
batch_op.drop_index("ix_blackout_tenant_id")
with op.batch_alter_table("maintenancewindowrule", schema=None) as batch_op:
batch_op.drop_index("ix_maintenance_rule_tenant_id_end_time")
batch_op.drop_index("ix_maintenance_rule_tenant_id")

op.drop_table("blackoutrule")
op.drop_table("maintenancewindowrule")
# ### end Alembic commands ###
Loading

0 comments on commit b1ed3c1

Please sign in to comment.