Description
Company: OpenAI Date: 25 days ago
Project Requirements
Design a highly scalable webhook delivery system that allows users to register callback URLs for specific events.
Key constraints:
- Scale: 1 billion events every day (~11,500 events/sec).
- Reliability: Must deliver messages successfully and retry if delivery fails.
- Features: Security, Monitoring (observability), Error handling.
Candidate Interview Story
"I was asked to build a Webhook service. Users register a URL and an eventId. When that eventId triggers, the system calls the registered URL. I could assume that one eventId triggers exactly one URL. The interviewer asked about the REST API, how to use caching, how to design the database, and specifically how to handle failures using a message queue."
Key Topics Discussed
- REST API Design: Endpoints for registering (
POST /webhooks) and viewing webhooks. - Caching Strategy: Caching active webhooks (config data) to reduce DB load. Trade-offs between consistency and speed.
- Database Design: Schema for registrations and delivery logs. Indexing for speed.
- Failure and Retry Logic:
- Using message queues (SQS/Kafka).
- Exponential Backoff: Waiting longer between retries.
- Dead Letter Queue (DLQ): Handling permanently failed messages.
- Retry Storms: Preventing system crashes during outages.
What OpenAI Usually Asks
- Real Implementation Details: Define actual URLs, specific DB columns, and specific Queue features (visibility timeout).
- Handling Failures: Detailed plan for when user servers are down (Backoff, DLQ, Isolation).
- Speed and Performance: Caching strategies for reading configuration.
- Massive Scale: Database Sharding and Horizontal Scaling for 1B events/day.
Step-by-Step Design Plan
Phase 1: Clarify Requirements
- One eventId -> One callback.
- Scale: 1B/day.
- Retries required.
Phase 2: High-Level Design
API → Database ← Workers ← Message Queue
- Registration: User POSTs to API -> Save to DB.
- Event Flow: Event happens -> Find webhook -> Send to Queue -> Worker delivers -> Log result.
Phase 3: Deep Dive Implementation
REST API Design
POST /webhooks
Body: { "event_id": "user.created", "callback_url": "https://...", "headers": {...} }
GET /webhooks/:webhook_id
Response: Webhook configuration
GET /webhooks/:webhook_id/deliveries?status=failed&limit=50
Response: Paginated delivery history
Database Schema
CREATE TABLE webhooks (
webhook_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
event_id VARCHAR NOT NULL,
callback_url TEXT NOT NULL,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP,
UNIQUE(user_id, event_id)
);
CREATE INDEX idx_event_active ON webhooks(event_id, is_active);
CREATE TABLE webhook_deliveries (
delivery_id UUID PRIMARY KEY,
webhook_id UUID REFERENCES webhooks,
status VARCHAR, -- pending, success, failed, retrying
attempt_count INT,
next_retry_at TIMESTAMP,
response_code INT,
error_message TEXT,
created_at TIMESTAMP
);
Caching Strategy
- Cache webhook settings using
event_idas key in Redis. - TTL: 5 minutes.
- Invalidate cache on update/delete.
Retry Logic (Critical)
- Queue: SQS (Visibility Timeout) or Kafka.
- Backoff: 1 min -> 2 min -> 4 min.
- DLQ: Move after 5-10 failed attempts.
- Isolation: Ensure one bad endpoint doesn't backlog the entire queue (Tenant isolation or separate queues for slow consumers).
Traps to Avoid
- Vague API: Define exact endpoints and payloads.
- Lazy DB: Write out schemas and indexes.
- Basic Retry: Explain how the queue handles delays (visibility timeout).
- Ignoring Scale: Mention sharding and horizontal scaling immediately.
Loading editor…
OUTPUTLast run results appear here.
No output yet. Click "Run Code" to see results.