DeepInterview

Description

Company: Anthropic

Problem Statement

You are given:

A starting web address called startUrl.
A tool called HtmlParser that finds all links on a specific web page.

You need to build a web crawler. It must find all URLs that you can reach from the startUrl.

Important Rule: You must only keep URLs that belong to the same hostname as the startUrl. The order of the list does not matter.

Here is the interface for the HtmlParser:

interface HtmlParser {
    // Returns all URLs from a given page URL.
    public List<String> getUrls(String url);
}

Your function signature will look like this: List<String> crawl(String startUrl, HtmlParser htmlParser).

Core Rules

Your crawler must follow these steps:

Start at the startUrl.
Use HtmlParser.getUrls(url) to get all links from that page.
Avoid duplicates: Do not visit the same URL more than once.
Check the Hostname: Only follow links if the hostname matches the startUrl.
Assume all links use http and do not use port numbers.

Things to Clarify

URL Fragments: Ask the interviewer about links with # (like http://example.com/page#section1). Should you treat them as the same page or a new page?
URL Normalization: Ask if you need to clean up or standardise the URLs. Usually, you can assume this is not needed for the basic part.

Part 2: Multithreading (Important!)

First, solve the problem using a single thread. Once that works, make a multithreaded or concurrent version to make it faster.

Goals

Run in Parallel: Download multiple pages at the same time.
Thread Safety: Make sure your data (like the visited list) doesn't get corrupted when many threads touch it at once.
Avoid Duplicates: Even with multiple threads, never visit a URL twice.
Check the Hostname: Keep following the hostname rule.

Hint for the Candidate

Use a Thread Pool to manage your threads.
Do not make a new thread for every single URL. This will crash the system.
A thread pool limits how many threads run at once.
Common setup: Use a fixed size (like 10-20 threads) and a queue for tasks.

Solution: JavaScript Implementation

Below is a full JavaScript solution. It handles tasks concurrently and cleans up URLs to prevent duplicates.

`ThreadPool` Class

class ThreadPool {
    constructor(maxConcurrency) {
        this.maxConcurrency = maxConcurrency;
        this.queue = [];
        this.runningTasks = 0;
        this.waitForFinishPromise = null;
        this.waitForFinishResolve = null;
    }

    // Add a task to the queue and start processing
    execute(task) {
        return new Promise((resolve, reject) => {
            this.queue.push({
                task,
                resolve,
                reject
            });
            this._processQueue();
        });
    }

    _processQueue() {
        // Stop if we reached the max number of running tasks
        if (this.runningTasks >= this.maxConcurrency) {
            return;
        }

        // If queue is empty, check if all work is fully done
        if (this.queue.length === 0) {
            if (this.runningTasks === 0) {
                this.waitForFinishResolve && this.waitForFinishResolve();
                this.waitForFinishPromise = null;
                this.waitForFinishResolve = null;
            }
            return;
        }

        // Run the next task from the queue
        const { task, resolve, reject } = this.queue.shift();
        this.runningTasks++;
        Promise.resolve()
            .then(task)
            .then(resolve, reject)
            .finally(() => {
                this.runningTasks--;
                this._processQueue();
            });
    }

    // Returns a promise that finishes when all tasks are done
    waitForFinish() {
        if (this.runningTasks === 0 && this.queue.length === 0) {
            return Promise.resolve();
        }

        if (!this.waitForFinishPromise) {
            this.waitForFinishPromise = new Promise((resolve) => {
                this.waitForFinishResolve = resolve;
            });
        }
        return this.waitForFinishPromise;
    }
}

URL Helper Function

function getProcessedUrlObject(url) {
    const newUrl = new URL(url);
    // Remove the hash (#) part so http://example.com/page#1
    // and http://example.com/page#2 are seen as the same URL
    newUrl.hash = "";
    return newUrl;
}

Main Crawler Logic

async function crawl(startUrl, htmlParser, maxConcurrency = 10) {
    const processedStartUrlObject = getProcessedUrlObject(startUrl);
    const startHostName = processedStartUrlObject.hostname;

    const result = [processedStartUrlObject.toString()];
    const visited = new Set([processedStartUrlObject.toString()]);
    const pool = new ThreadPool(maxConcurrency);

    const crawlUrl = async (url) => {
        try {
            const nextUrls = await htmlParser.getUrls(url);
            for (const nextUrl of nextUrls) {
                const nextUrlProcessedObject = getProcessedUrlObject(nextUrl);
                const nextUrlProcessedString = nextUrlProcessedObject.toString();

                // Only crawl if hostname matches and we haven't visited it yet
                if (nextUrlProcessedObject.hostname === startHostName
                    && !visited.has(nextUrlProcessedString)) {
                    visited.add(nextUrlProcessedString);
                    result.push(nextUrlProcessedString);
                    // Add to the pool (fire and forget)
                    pool.execute(() => crawlUrl(nextUrl)).catch((err) => {
                        console.error(`Error crawling ${nextUrl}:`, err);
                    });
                }
            }
        } catch (error) {
            console.error(`Error fetching URLs from ${url}:`, error);
        }
    };

    // Start crawling from the first URL
    await pool.execute(() => crawlUrl(startUrl));
    // Wait for everything to finish
    await pool.waitForFinish();

    return result;
}

How to Run the Code

// Example with a fake HtmlParser
class HtmlParser {
    constructor(urls) {
        this.urls = urls;
    }

    async getUrls(url) {
        // Fake a network delay
        await new Promise(resolve => setTimeout(resolve, Math.random() * 15));
        return this.urls[url] || [];
    }
}

const urls = {
    "http://news.yahoo.com": [
        "http://news.yahoo.com/news/topics/",
        "http://news.yahoo.com/news"
    ],
    "http://news.yahoo.com/news/topics/": [
        "http://news.yahoo.com/news",
        "http://news.yahoo.com/news/sports"
    ],
    "http://news.yahoo.com/news": [
        "http://news.google.com"
    ],
    "http://news.yahoo.com/news/sports": []
};

const parser = new HtmlParser(urls);
const result = await crawl("http://news.yahoo.com", parser, 2);
console.log("Crawled URLs:", result);
// Output should be yahoo.com URLs only (no google.com)

Why We Built It This Way

ThreadPool: We made a custom queue. This lets us limit how many downloads run at once so we don't overwhelm the system.
URL Normalization: We remove the # part of URLs. This ensures we don't treat the same page as two different links.
Thread Safety: We use a Set to track visited pages. Because JavaScript is single-threaded (event loop), we don't have race conditions here.
Error Handling: If one link fails, we log the error, but the crawler keeps going.
Concurrency Control: We limit the number of parallel requests to be polite to the server.

System Design Questions

The interviewer may ask these questions verbally. You usually don't need to code these, but you should know how to explain them.

1. Threads vs Processes

Question: What is the difference between a thread and a process? Which one is better for a web crawler?

Answer Strategy:

Threads: These are "lightweight." They share memory. They are great for tasks that wait a lot (like waiting for a website to load). This is called I/O-bound.
Processes: These are "heavy." They have their own separate memory. They are better for tasks that do heavy math (CPU-bound).
For Crawling: Use Threads. Crawling is mostly waiting for the internet, and threads are more efficient for that.

2. Scaling to Many Machines

Question: If we have millions of URLs, one computer isn't enough. How do we build a distributed system?

Answer Strategy:

URL Distribution: How do we split the work? We can use Consistent Hashing on the hostname to decide which machine crawls which website.
Coordination: Machines need to talk to check for duplicates. We can use a shared cache like Redis.
Load Balancing: We need to make sure every machine has roughly the same amount of work.
Fault Tolerance: If a machine crashes, another machine needs to pick up its work.

3. Being "Polite" (Rate Limiting)

Question: If we crawl too fast, we might crash the website. How do we prevent this?

Answer Strategy:

robots.txt: Always check this file first. It tells us what we are allowed to crawl.
Rate limiting: Set a limit. For example, "only 1 request per second for yahoo.com."
Throttling: If the server starts responding slowly, our crawler should slow down automatically.
Distributed Control: If multiple machines are crawling the same site, they need to coordinate so they don't attack the site together.

4. Handling Duplicate Pages

Question: Many URLs point to the exact same content. How do we detect this so we don't waste space?

Answer Strategy:

Fingerprinting: Create a hash (like MD5 or SHA-256) of the page content. If the hash matches one we already have, it's a duplicate.
URL Normalization: Clean up the URL text (remove tracking IDs, sort parameters) before fetching.
Similarity Check: Use algorithms like Simhash or Jaccard similarity to find pages that are mostly the same, even if a few words changed.

Description

Problem Statement

Core Rules

Things to Clarify

Part 2: Multithreading (Important!)

Goals

Hint for the Candidate

Solution: JavaScript Implementation

ThreadPool Class

URL Helper Function

Main Crawler Logic

How to Run the Code

Why We Built It This Way

System Design Questions

1. Threads vs Processes

2. Scaling to Many Machines

3. Being "Polite" (Rate Limiting)

4. Handling Duplicate Pages

`ThreadPool` Class