Building Alerts w/ Clickhouse. The MVP Approach.
Follow my journey as I design and build near real-time, pull-based alerts using Clickhouse. The anti-over engineered approach.
"Clickhouse, like celestial beacons in the cosmic expanse, illuminate the path, signaling alerts in the vast ocean of real-time data." - Galileo Galilei
The Context
Helicone is an at-scale LLM monitoring platform (with tons of amazing features). Recently, we’ve been receiving a surge of requests for alerting, especially for cost, error rate, and latency. In this blog, we delve into the details of our design and build process with the rationale behind our approach.
Requirements
Approaching requirements at a startup requires a different mindset than you might be used to at larger, more established companies. It is critical to prioritize rapid development to find product-market fit. Perfection can wait. Below is a demonstration of the process we go through to determine the MVP.
Versatile Alerts:Support different alert typesLet's begin with error rate alerts and determine if there's user demand for them before we branch out into more types.
Real-time push-based:Instant notificationNear real-time pull-based works
Efficient Data Management:Store logs in minute blocksAll in Clickhouse <3
Smart Alert Resolution: Avoid constant re-triggering
Necessary for MVP or alerts could become useless. 5-minute cooldown before an alert can be resolved after it goes below the threshold.
Easy Integration:Easily add new typesMeh, figure this out later
Scalable:Handle all the scale!As long as it can handle a few months of scale we’re good. If alerts become a hit, then it’s time to scale it up! We can always throw money at this problem and scale vertically
Customizable Thresholds: Threshold to signal alert
Custom thresholds are enabled as some users expect a certain % of errors, which is a necessity.
Customizable Time Windows: Timespan to check the alert threshold
Custom time windows are not a necessity, but easy to implement…
Customizable Minimum Request Threshold: Minimum requests that are required before an alert can be triggered
Necessary to prevent low-traffic errors from triggering alerts, avoiding the need for higher threshold adjustments in high-traffic periods.
Multi-channel Notifications:SMS, email, webhooks, etcEmail works
Real-Time Dashboards:Instantly be aware of the alertNear real-time seems to do the trick
“The most effective startup engineers are those who know the right way to do a thing, but actively choose not to do it” - Justin Torre
Implementation Details
Configured alerts will simply be stored in Postgres, however, we need a solution for checking the alert state (triggered, resolved, cooldown). We compared the following 3 options:
Postgres
Postgres, while reliable, falls short in scaling, especially for time-sensitive metric aggregation.
Durable Objects
A good fit with our Cloudflare Workers stack but adds complexity in managing alerts across both Postgres and Durable Objects.
Clickhouse
Our choice. We currently log requests/responses to Clickhouse to support aggregations. We can leverage Clickhouse by adding a cron job that runs every minute to check the alert state. The max time window is 1 hour, so it will be quite efficient. If we hit a scaling bottleneck? We’re ready to scale financially (vertically) and technically as needed. Cue the all-nighters with Red Bull and coffee (not sponsored).
Postgres Table Definitions For Alerts & Alert History
Some denormalization takes place for faster querying.
CREATE TABLE alert (
id uuid not null default gen_random_uuid() primary key,
org_id UUID NOT NULL REFERENCES public.organization(id),
metric TEXT NOT NULL,
threshold DECIMAL NOT NULL,
time_window BIGINT NOT NULL,
time_block_duration BIGINT NOT NULL DEFAULT 60000,
emails TEXT [] NOT NULL,
status TEXT NOT NULL DEFAULT 'resolved',
name TEXT NOT NULL,
soft_delete boolean not null default false,
created_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW()
);
CREATE TABLE alert_history (
id uuid not null default gen_random_uuid() primary key,
org_id UUID NOT NULL REFERENCES public.organization(id),
alert_id UUID REFERENCES alert(id) NOT NULL,
alert_start_time TIMESTAMP WITH TIME ZONE NOT NULL,
alert_end_time TIMESTAMP WITH TIME ZONE,
alert_metric TEXT NOT NULL,
alert_name TEXT NOT NULL,
triggered_value TEXT NOT NULL,
status TEXT NOT NULL,
soft_delete boolean not null default false,
created_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW()
);
Yes, RLS is enabled.
Initialization: AlertManager
Class
export class AlertManager {
COOLDOWN_PERIOD_MS = 5 * 60 * 1000;
private utilityKv: Env["UTILITY_KV"];
private resendApiKey: Env["RESEND_API_KEY"];
constructor(private alertStore: AlertStore, private env: Env) {
this.utilityKv = env.UTILITY_KV;
this.resendApiKey = env.RESEND_API_KEY;
}
// ...
This class is the heart of our alerting logic. It initializes with essential configurations, such as cooldown periods KV, and API keys.
The Core Function: checkAlerts
Fetch alerts from Postgres
const { data: alerts, error: alertsErr } = await this.supabaseClient
.from("alert")
.select("*")
.eq("soft_delete", false);
Retrieve the alert’s current state from Clickhouse (error count, total count, etc)
SELECT
COUNT() AS totalCount,
COUNTIf(status BETWEEN 400 AND 599) AS errorCount
FROM response_copy_v3
WHERE
organization_id = {val_0: UUID} AND
request_created_at >= toDateTime64(now(), 3) - INTERVAL {val_1: Int64} MILLISECOND
This query calculates the total count of events and counts the number of error events (status between 400 and 599) for a specific organization within a given time window.
Alert States
The AlertState
object obtained from the Clickhouse query is pivotal for our alerting logic. It looks like this:
type AlertState = {
totalCount: number;
errorCount: number;
};
Check For Alert State Updates
For each AlertState, we need to calculate if there is an update.
const rate = alertState.totalCount > 0 ? (alertState.errorCount / alertState.totalCount) * 100 : 0;
Based on the rate and the alert state, the system marks alerts as 'triggered', 'resolved', or 'unchanged'. Here is the logic:
Check For Alert State Updates Code:
// Function to determine the update needed for an alert's state
async getAlertStateUpdate(alert, alertState, timestamp) {
// Calculate the error rate
const rate = alertState.totalCount > 0 ? (alertState.errorCount / alertState.totalCount) * 100 : 0;
// Check if the error rate is below the threshold
if (rate < alert.threshold) {
// Handle scenario when the error rate is below the threshold
return this.handleRateBelowThreshold(alert, timestamp);
} else {
// Handle scenario when the error rate is above or equal to the threshold
return this.handleRateAboveThreshold(alert, rate, timestamp);
}
}
// Handle alerts when the error rate is below the threshold
async handleRateBelowThreshold(alert, timestamp) {
// If alert is already resolved, no change is needed
if (alert.status === "resolved") {
return { status: "unchanged", timestamp, alert };
}
// Get the start time of the cooldown period
const cooldownStart = await this.getCooldownStart(alert.id);
// If cooldown hasn't started, begin the cooldown period
if (!cooldownStart) {
await this.setCooldownStart(alert.id, timestamp);
return { status: "unchanged", timestamp, alert };
}
// If cooldown has elapsed, mark the alert as resolved
if (timestamp - cooldownStart >= this.COOLDOWN_PERIOD_MS) {
await this.deleteCooldown(alert.id);
return { status: "resolved", timestamp, alert };
}
// If cooldown is still active, no change is needed
return { status: "unchanged", timestamp, alert };
}
// Handle alerts when the error rate is above or equal to the threshold
async handleRateAboveThreshold(alert, rate, timestamp) {
// If alert was previously resolved, trigger a new alert
if (alert.status === "resolved") {
await this.deleteCooldown(alert.id);
return { status: "triggered", timestamp, triggeredThreshold: rate, alert };
}
// If alert is already active, no change is needed
return { status: "unchanged", timestamp, alert };
}
Handling Alert State Updates
If an alert is triggered
Update the alert records to be triggered
const { error: alertUpdateErr } = await this.supabaseClient
.from("alert")
.update({
status: status,
updated_at: new Date().toISOString(),
})
.in("id", alertIds);
Insert alert history records
const { error: alertHistoryInsErr } = await this.supabaseClient
.from("alert_history")
.insert(alertHistories);
If an alert is resolved
Update the alert records to be resolved
const { error: alertUpdateErr } = await this.supabaseClient
.from("alert")
.update({
status: status,
updated_at: new Date().toISOString(),
})
.in("id", alertIds);
Update the alert history records, changing the status to resolved and adding the alert_end_time
const { error: alertHistoryUpdateErr } = await this.supabaseClient
.from("alert_history")
.update({
status: status,
alert_end_time: alertEndTime,
updated_at: new Date().toISOString(),
})
.in("alert_id", alertIds)
.eq("status", "triggered");
Send Emails
Send alert update emails to all users subscribed to the alert. For this, we use Resend.
Conclusion
That’s it! Simple implementation and surprisingly scalable. These Clickhouse aggregation queries are incredibly quick. Consistently <0.05s to execute the query over the past 5/10 minutes. Running all alert checks asynchronously should not be an issue for a while.
What would I have done differently?
My first attempt at this was over-engineered. I went for push-based real-time alerts using Cloudflare Durable Objects. After realizing the extra complexity was unnecessary and would beget more complexity, I scrapped it and rewrote it in a day using Clickhouse. It was a painful lesson, but useful as I will now approach future projects with a different mindset.