DeepInterview

Webhook Delivery System

System Design2025/11

Description

Company: OpenAI Date: 25 days ago

Project Requirements

Design a highly scalable webhook delivery system that allows users to register callback URLs for specific events.

Key constraints:

  • Scale: 1 billion events every day (~11,500 events/sec).
  • Reliability: Must deliver messages successfully and retry if delivery fails.
  • Features: Security, Monitoring (observability), Error handling.

Candidate Interview Story

"I was asked to build a Webhook service. Users register a URL and an eventId. When that eventId triggers, the system calls the registered URL. I could assume that one eventId triggers exactly one URL. The interviewer asked about the REST API, how to use caching, how to design the database, and specifically how to handle failures using a message queue."

Key Topics Discussed

  1. REST API Design: Endpoints for registering (POST /webhooks) and viewing webhooks.
  2. Caching Strategy: Caching active webhooks (config data) to reduce DB load. Trade-offs between consistency and speed.
  3. Database Design: Schema for registrations and delivery logs. Indexing for speed.
  4. Failure and Retry Logic:
    • Using message queues (SQS/Kafka).
    • Exponential Backoff: Waiting longer between retries.
    • Dead Letter Queue (DLQ): Handling permanently failed messages.
    • Retry Storms: Preventing system crashes during outages.

What OpenAI Usually Asks

  1. Real Implementation Details: Define actual URLs, specific DB columns, and specific Queue features (visibility timeout).
  2. Handling Failures: Detailed plan for when user servers are down (Backoff, DLQ, Isolation).
  3. Speed and Performance: Caching strategies for reading configuration.
  4. Massive Scale: Database Sharding and Horizontal Scaling for 1B events/day.

Step-by-Step Design Plan

Phase 1: Clarify Requirements

  • One eventId -> One callback.
  • Scale: 1B/day.
  • Retries required.

Phase 2: High-Level Design

APIDatabaseWorkersMessage Queue

  • Registration: User POSTs to API -> Save to DB.
  • Event Flow: Event happens -> Find webhook -> Send to Queue -> Worker delivers -> Log result.

Phase 3: Deep Dive Implementation

REST API Design

POST /webhooks
Body: { "event_id": "user.created", "callback_url": "https://...", "headers": {...} }

GET /webhooks/:webhook_id
Response: Webhook configuration

GET /webhooks/:webhook_id/deliveries?status=failed&limit=50
Response: Paginated delivery history

Database Schema

CREATE TABLE webhooks (
  webhook_id UUID PRIMARY KEY,
  user_id UUID NOT NULL,
  event_id VARCHAR NOT NULL,
  callback_url TEXT NOT NULL,
  is_active BOOLEAN DEFAULT true,
  created_at TIMESTAMP,
  UNIQUE(user_id, event_id)
);

CREATE INDEX idx_event_active ON webhooks(event_id, is_active);

CREATE TABLE webhook_deliveries (
  delivery_id UUID PRIMARY KEY,
  webhook_id UUID REFERENCES webhooks,
  status VARCHAR, -- pending, success, failed, retrying
  attempt_count INT,
  next_retry_at TIMESTAMP,
  response_code INT,
  error_message TEXT,
  created_at TIMESTAMP
);

Caching Strategy

  • Cache webhook settings using event_id as key in Redis.
  • TTL: 5 minutes.
  • Invalidate cache on update/delete.

Retry Logic (Critical)

  • Queue: SQS (Visibility Timeout) or Kafka.
  • Backoff: 1 min -> 2 min -> 4 min.
  • DLQ: Move after 5-10 failed attempts.
  • Isolation: Ensure one bad endpoint doesn't backlog the entire queue (Tenant isolation or separate queues for slow consumers).

Traps to Avoid

  • Vague API: Define exact endpoints and payloads.
  • Lazy DB: Write out schemas and indexes.
  • Basic Retry: Explain how the queue handles delays (visibility timeout).
  • Ignoring Scale: Mention sharding and horizontal scaling immediately.
Loading editor…
OUTPUTLast run results appear here.
No output yet. Click "Run Code" to see results.