Building Alerts w/ Clickhouse. The MVP Approach.

Follow my journey as I design and build near real-time, pull-based alerts using Clickhouse. The anti-over engineered approach.

Dec 18, 2023

"Clickhouse, like celestial beacons in the cosmic expanse, illuminate the path, signaling alerts in the vast ocean of real-time data." - Galileo Galilei

The Context

Helicone is an at-scale LLM monitoring platform (with tons of amazing features). Recently, we’ve been receiving a surge of requests for alerting, especially for cost, error rate, and latency. In this blog, we delve into the details of our design and build process with the rationale behind our approach.

Requirements

Approaching requirements at a startup requires a different mindset than you might be used to at larger, more established companies. It is critical to prioritize rapid development to find product-market fit. Perfection can wait. Below is a demonstration of the process we go through to determine the MVP.

~~Versatile Alerts:~~ ~~Support different alert types~~
1. Let's begin with error rate alerts and determine if there's user demand for them before we branch out into more types.
~~Real-time push-based:~~ ~~Instant notification~~
1. Near real-time pull-based works
~~Efficient Data Management:~~ ~~Store logs in minute blocks~~
1. All in Clickhouse <3
Smart Alert Resolution: Avoid constant re-triggering
1. Necessary for MVP or alerts could become useless. 5-minute cooldown before an alert can be resolved after it goes below the threshold.
~~Easy Integration:~~ ~~Easily add new types~~
1. Meh, figure this out later
~~Scalable:~~ ~~Handle all the scale!~~
1. As long as it can handle a few months of scale we’re good. If alerts become a hit, then it’s time to scale it up! We can always throw money at this problem and scale vertically
Customizable Thresholds: Threshold to signal alert
1. Custom thresholds are enabled as some users expect a certain % of errors, which is a necessity.
Customizable Time Windows: Timespan to check the alert threshold
1. Custom time windows are not a necessity, but easy to implement…
Customizable Minimum Request Threshold: Minimum requests that are required before an alert can be triggered
1. Necessary to prevent low-traffic errors from triggering alerts, avoiding the need for higher threshold adjustments in high-traffic periods.
~~Multi-channel Notifications:~~ ~~SMS, email, webhooks, etc~~
1. Email works
~~Real-Time Dashboards:~~ ~~Instantly be aware of the alert~~
1. Near real-time seems to do the trick

“The most effective startup engineers are those who know the right way to do a thing, but actively choose not to do it” - Justin Torre

Implementation Details

Configured alerts will simply be stored in Postgres, however, we need a solution for checking the alert state (triggered, resolved, cooldown). We compared the following 3 options:

Postgres

Postgres, while reliable, falls short in scaling, especially for time-sensitive metric aggregation.

Durable Objects

A good fit with our Cloudflare Workers stack but adds complexity in managing alerts across both Postgres and Durable Objects.

Clickhouse

Our choice. We currently log requests/responses to Clickhouse to support aggregations. We can leverage Clickhouse by adding a cron job that runs every minute to check the alert state. The max time window is 1 hour, so it will be quite efficient. If we hit a scaling bottleneck? We’re ready to scale financially (vertically) and technically as needed. Cue the all-nighters with Red Bull and coffee (not sponsored).

Postgres Table Definitions For Alerts & Alert History

Some denormalization takes place for faster querying.

CREATE TABLE alert (
    id uuid not null default gen_random_uuid() primary key,
    org_id UUID NOT NULL REFERENCES public.organization(id),
    metric TEXT NOT NULL,
    threshold DECIMAL NOT NULL,
    time_window BIGINT NOT NULL,
    time_block_duration BIGINT NOT NULL DEFAULT 60000,
    emails TEXT [] NOT NULL,
    status TEXT NOT NULL DEFAULT 'resolved',
    name TEXT NOT NULL,
    soft_delete boolean not null default false,
    created_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW()
);

CREATE TABLE alert_history (
    id uuid not null default gen_random_uuid() primary key,
    org_id UUID NOT NULL REFERENCES public.organization(id),
    alert_id UUID REFERENCES alert(id) NOT NULL,
    alert_start_time TIMESTAMP WITH TIME ZONE NOT NULL,
    alert_end_time TIMESTAMP WITH TIME ZONE,
    alert_metric TEXT NOT NULL,
    alert_name TEXT NOT NULL,
    triggered_value TEXT NOT NULL,
    status TEXT NOT NULL,
    soft_delete boolean not null default false,
    created_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE NULL DEFAULT NOW()
);

Yes, RLS is enabled.

Initialization: `AlertManager` Class

export class AlertManager {
  COOLDOWN_PERIOD_MS = 5 * 60 * 1000;
  private utilityKv: Env["UTILITY_KV"];
  private resendApiKey: Env["RESEND_API_KEY"];

  constructor(private alertStore: AlertStore, private env: Env) {
    this.utilityKv = env.UTILITY_KV;
    this.resendApiKey = env.RESEND_API_KEY;
  }
  // ...

This class is the heart of our alerting logic. It initializes with essential configurations, such as cooldown periods KV, and API keys.

The Core Function: `checkAlerts`

Fetch alerts from Postgres

const { data: alerts, error: alertsErr } = await this.supabaseClient
      .from("alert")
      .select("*")
      .eq("soft_delete", false);

Retrieve the alert’s current state from Clickhouse (error count, total count, etc)

SELECT
    COUNT() AS totalCount,
    COUNTIf(status BETWEEN 400 AND 599) AS errorCount
  FROM response_copy_v3
  WHERE 
    organization_id = {val_0: UUID} AND
    request_created_at >= toDateTime64(now(), 3) - INTERVAL {val_1: Int64} MILLISECOND

This query calculates the total count of events and counts the number of error events (status between 400 and 599) for a specific organization within a given time window.

Alert States

The AlertState object obtained from the Clickhouse query is pivotal for our alerting logic. It looks like this:

type AlertState = {
  totalCount: number;
  errorCount: number;
};

Check For Alert State Updates

For each AlertState, we need to calculate if there is an update.

const rate = alertState.totalCount > 0 ? (alertState.errorCount / alertState.totalCount) * 100 : 0;

Based on the rate and the alert state, the system marks alerts as 'triggered', 'resolved', or 'unchanged'. Here is the logic:

Check For Alert State Updates Code:

// Function to determine the update needed for an alert's state
async getAlertStateUpdate(alert, alertState, timestamp) {
  // Calculate the error rate
  const rate = alertState.totalCount > 0 ? (alertState.errorCount / alertState.totalCount) * 100 : 0;

  // Check if the error rate is below the threshold
  if (rate < alert.threshold) {
    // Handle scenario when the error rate is below the threshold
    return this.handleRateBelowThreshold(alert, timestamp);
  } else {
    // Handle scenario when the error rate is above or equal to the threshold
    return this.handleRateAboveThreshold(alert, rate, timestamp);
  }
}

// Handle alerts when the error rate is below the threshold
async handleRateBelowThreshold(alert, timestamp) {
  // If alert is already resolved, no change is needed
  if (alert.status === "resolved") {
    return { status: "unchanged", timestamp, alert };
  }

  // Get the start time of the cooldown period
  const cooldownStart = await this.getCooldownStart(alert.id);

  // If cooldown hasn't started, begin the cooldown period
  if (!cooldownStart) {
    await this.setCooldownStart(alert.id, timestamp);
    return { status: "unchanged", timestamp, alert };
  }

  // If cooldown has elapsed, mark the alert as resolved
  if (timestamp - cooldownStart >= this.COOLDOWN_PERIOD_MS) {
    await this.deleteCooldown(alert.id);
    return { status: "resolved", timestamp, alert };
  }

  // If cooldown is still active, no change is needed
  return { status: "unchanged", timestamp, alert };
}

// Handle alerts when the error rate is above or equal to the threshold
async handleRateAboveThreshold(alert, rate, timestamp) {
  // If alert was previously resolved, trigger a new alert
  if (alert.status === "resolved") {
    await this.deleteCooldown(alert.id);
    return { status: "triggered", timestamp, triggeredThreshold: rate, alert };
  }

  // If alert is already active, no change is needed
  return { status: "unchanged", timestamp, alert };
}

Handling Alert State Updates

If an alert is triggered

Update the alert records to be triggered

const { error: alertUpdateErr } = await this.supabaseClient
      .from("alert")
      .update({
        status: status,
        updated_at: new Date().toISOString(),
      })
      .in("id", alertIds);

Insert alert history records

 const { error: alertHistoryInsErr } = await this.supabaseClient
      .from("alert_history")
      .insert(alertHistories);

If an alert is resolved

Update the alert records to be resolved

const { error: alertUpdateErr } = await this.supabaseClient
      .from("alert")
      .update({
        status: status,
        updated_at: new Date().toISOString(),
      })
      .in("id", alertIds);

Update the alert history records, changing the status to resolved and adding the alert_end_time

const { error: alertHistoryUpdateErr } = await this.supabaseClient
      .from("alert_history")
      .update({
        status: status,
        alert_end_time: alertEndTime,
        updated_at: new Date().toISOString(),
      })
      .in("alert_id", alertIds)
      .eq("status", "triggered");

Send Emails

Send alert update emails to all users subscribed to the alert. For this, we use Resend.

The code is available here…

Conclusion

That’s it! Simple implementation and surprisingly scalable. These Clickhouse aggregation queries are incredibly quick. Consistently <0.05s to execute the query over the past 5/10 minutes. Running all alert checks asynchronously should not be an issue for a while.

What would I have done differently?

My first attempt at this was over-engineered. I went for push-based real-time alerts using Cloudflare Durable Objects. After realizing the extra complexity was unnecessary and would beget more complexity, I scrapped it and rewrote it in a day using Clickhouse. It was a painful lesson, but useful as I will now approach future projects with a different mindset.

Helicone