The data pipeline — Common Crawl to training tokens

Section 16.2

The data pipeline — Common Crawl to training tokens

For a long time the LLM community treated data as the boring part — the interesting stuff was the architecture, the optimizer, the scaling laws. That was wrong. The single biggest predictor of an LLM’s quality, after parameter count, is the quality of its pretraining corpus. A 7B model trained on FineWeb (Allal 2024) beats a 13B model trained on Common Crawl raw. The data pipeline that turns 10 PB of internet sludge into a few trillion high-quality tokens is now arguably the most-important engineering capability inside frontier labs. This section walks the five stages — URL filtering, language ID, content extraction, quality classification, deduplication — and the empirical evolution from C4 → RefinedWeb → FineWeb.

Common Crawl as the universal raw material

Common Crawl is a public, free archive of the internet. Each crawl (released monthly) is ~250-300 TB compressed and covers ~3 billion web pages. The cumulative archive over 12 years is ~10 PB raw. Every major LLM’s pretraining corpus starts here.

Common Crawl is the floor. Building a usable training corpus from it is a 4-5 stage pipeline that throws away 99%+ of the input.

Pipeline volumes (FineWeb example, ~2024): Stage 0: Common Crawl (90 dumps merged) ~96 trillion tokens Stage 1: URL filter (block + allow lists) ~ removes 15% — adult, malware, ML-tainted Stage 2: language ID (English only) ~ keeps 50% — English-only filter Stage 3: trafilatura content extraction ~ HTML → clean text, lossy Stage 4: quality classifier + heuristic rules ~ keeps 25% (filters spam, gibberish, low-info) Stage 5: deduplication (MinHash LSH near-dup) ~ removes 30-60% (huge yield) Final: ~15 trillion English tokens Llama 3 used a 15T-token dataset built this way (though Meta's exact pipeline is private).

Stage by stage

Stage 1 — URL filtering. A massive blocklist and a smaller allowlist. The blocklist removes adult content (Common Crawl’s “adult URL list” is ~50M URLs), known malware hosts, mirror-of-mirrors, and increasingly ML-tainted content (sites known to be largely AI-generated like content-farm SEO sites). The allowlist boosts trusted sources (Wikipedia, arxiv, Stack Exchange, official government documents).

Stage 2 — Language identification. fastText’s lang-ID model (Joulin 2016) is the workhorse. Computes a language probability vector from byte-level n-grams in ~100 microseconds per document. English-only training corpora keep documents with P(en) > 0.65 (a conservative cutoff that throws away ambiguous content).

Stage 3 — Content extraction. HTML pages are 5-30% navigation, ads, headers, JavaScript, CSS — and 70-95% actual content. Extractors like trafilatura content extraction library An open-source HTML-to-text extraction library specifically tuned for high-quality content extraction at web scale. Uses DOM-tree heuristics + readability-style scoring + dropdown-aware boilerplate removal. The de facto standard in modern LLM data pipelines (FineWeb, FineWeb-Edu, RedPajama-2 all use it). ~5× faster and 15-30% more accurate than the standard 'beautifulsoup + readability' baseline. (the standard since 2023) strip the boilerplate using DOM-tree heuristics + content-length scoring. The output is plain text with paragraph boundaries preserved. A 100 KB HTML page typically becomes 5-15 KB of text.

Stage 4 — Quality classification. Two layers:

Heuristic rules: minimum word count, maximum repeated n-gram ratio (catches transcript spam), proper sentence/punctuation density (filters bad OCR), profanity/toxicity scoring.
Learned quality classifier: a small (~100M param) BERT-style model trained on positive examples (Wikipedia + arxiv) vs negative (CC pages that obvious-spam rules tagged). Outputs a quality score per document; keep top 25-50%.

Stage 5 — Deduplication. This is the surprising one — the biggest single yield bump, and the technique that distinguishes serious data pipelines from amateur ones.

Deduplication via MinHash LSH (locality-sensitive hashing): 1. For each document, compute MinHash signature: - tokenise into shingles (sliding 5-grams) - hash each shingle with K=128 independent hash functions - for each hash function, store the MIN of its values across all shingles Result: 128 hashes per document = the signature. 2. The Jaccard similarity of two documents ≈ fraction of MinHash entries that agree. For documents with ≥0.8 Jaccard similarity (substantial overlap), the MinHash signatures agree on a predictable fraction of indices. 3. LSH: split the K=128 indices into B=20 bands of R=6 hashes each. Documents that match in ANY one band's worth of 6 hashes go into the same bucket. The math: P(match | similarity s) = 1 - (1 - s^R)^B — a sigmoid curve. 4. Within each bucket: compare all pairs; keep one representative per near-duplicate cluster. Bandwidth: O(N) for hashing; O(N log N) for the LSH lookup. Result: ~30-60% of Common Crawl documents are near-duplicates of others. Removing duplicates SHRINKS the corpus and IMPROVES model quality.

Lee 2022 “Deduplicating Training Data Makes Language Models Better” showed empirically that deduplication helps even when keeping training compute constant — the model learns more per token from a deduplicated corpus because it isn’t re-seeing the same content. This was a revelation; before 2022 most pretraining pipelines just kept everything.

— think, then check —

Stage 1 — URL filter: blocklist (adult, malware, ML-tainted) + allowlist (Wikipedia, arxiv, etc.). Throws away ~15% by URL match.

Yield: ~80T tokens.

Stage 2 — Language ID: fastText classifier keeps English-only (P(en) > 0.65). Throws away ~50% (non-English content).

Yield: ~40T tokens.

Stage 3 — Content extraction: trafilatura strips HTML boilerplate. Reduces token count by ~75% (most HTML is navigation/ads/CSS).

Yield: ~50T → ~10T after extraction? Actually the prior stages reduce CC’s WARC HTML; this stage reduces by ~5× from raw HTML to clean text, but in pipeline order it precedes some filtering steps. Net yield contribution: significant.

Stage 4 — Quality classification: heuristic rules (n-gram repetition, sentence density) + learned quality classifier (kept if score > threshold). Throws away ~50-75% of remaining content.

Yield: ~10T → ~3-5T tokens.

Stage 5 — Deduplication (MinHash LSH): finds near-duplicates (Jaccard ≥ 0.8) and keeps one representative per cluster. Removes ~30-60% of remaining content.

Yield: ~15T final tokens.

The point: the pipeline is mostly subtraction. Going from 96T → 15T is throwing away 84% of the input as “not worth training on.” The quality of what remains is what determines model performance — and the same input run through a SLOPPIER pipeline (e.g., C4’s original 2020 pipeline) produces a measurably worse training corpus despite having more tokens.

↳ §16.2 pipeline

C4 → RefinedWeb → FineWeb — the evolution

Quality progression visible in downstream eval: Dataset Tokens Model trained HellaSwag eval (0-shot) C4 (2020) 750B T5 (11B) ~50% baseline The Pile (2020) 300B GPT-NeoX 20B ~58% RefinedWeb (2023) 600B Falcon 40B ~75% ← huge jump RedPajama-2 (2023) 1.5T OpenLLaMA 13B ~75% FineWeb (2024) 15T FineWeb-trained 7B ~75% ← matches 40B from RefinedWeb

The headline is in the last row: a 7B model trained on FineWeb performs about as well as a 40B model trained on a 2-years-older corpus. Data quality compounded into a ~6× model-size equivalent.

— think, then check —

Setup: each document has a K-dimensional MinHash signature (K = 128 typical). Two documents with Jaccard similarity s have ~ s probability of agreeing on any single MinHash entry.

LSH bands: split K = 128 into B = 20 bands of R = 6 hashes each. Two documents are “candidate duplicates” if they agree on ALL R hashes in any one band.

P(agree on full band | similarity s) = s^R = s⁶.

P(at least one band matches | similarity s) = 1 − (1 − s⁶)^B = 1 − (1 − s⁶)²⁰.

This is a sigmoid in s — flat near 0, sharp transition around s ≈ 0.55, saturates near 1.

Trade-off:

More bands B (fewer hashes per band R): more sensitive — catches lower-similarity duplicates, more false positives.
Fewer bands B (more hashes per band R): more precise — only catches very-similar pairs, more false negatives.

Typical choice: B = 20, R = 6, K = 120 catches duplicates with similarity ≥ ~0.7 with high precision (~95%) and high recall (~90%).

Sublinear lookup: hash each document by its 20 band-keys. Documents sharing any band-key are placed in the same bucket. Pairwise comparison happens only within buckets. For a corpus of N docs with average bucket size √N, total comparisons are O(N log N) — vastly cheaper than naive O(N²) pairwise.

Why this is good enough for LLM data: we don’t need exact duplicates — we want to catch “the same article paraphrased across 50 sites” and “Wikipedia content scraped onto 100 mirrors.” Jaccard ≥ 0.7 captures these; LSH makes it tractable at petabyte scale.

↳ §16.2 + MinHash LSH

What “high-quality” actually means

The quality classifier is the hardest part. What does “high-quality text” mean operationally?

The empirical answer (Penedo 2023, Falcon RefinedWeb): text that looks like Wikipedia, arxiv, or curated books — coherent paragraphs, complete sentences, factually-supportable claims, low repetition, varied vocabulary. The training signal is built like a binary classifier:

Positive examples: Wikipedia, arxiv preprints, reputable news sites, well-curated public-domain books, Stack Exchange answers.
Negative examples: SEO spam, content-farm articles, comment threads, link directories, machine-translated content, OCR errors.

A ~100M-param classifier trained on this signal scores each document; keep the top 25-50%. The classifier’s threshold is the main tunable.

— think, then check —

The headline fact: 7B (FineWeb) ≈ 40B (CC-2020 era). The 6× model-size gap collapses with better data.

The deeper lesson:

Chinchilla scaling laws (§16.3) tell you the optimal (N, D) trade-off ASSUMING data quality is constant. But data quality isn’t fixed — it’s a third axis you can move along, and it interacts non-linearly with the others.

Better data lets the model:

Learn more per token (less wasted learning capacity on garbage).
Avoid memorising spurious patterns from spam/SEO content.
Develop higher-quality “reasoning behaviour” because reasoning is more visible in well-written text.

Mathematically: if you write a quality-aware scaling law L(N, D, Q) where Q is data quality, the scaling exponent in N grows substantially with Q. At low Q (raw CC): adding parameters helps but plateaus quickly. At high Q (FineWeb): adding parameters gives near-linear improvement for much longer.

Why this matters economically:

Parameter count is expensive: 2× model size = 2-4× training compute + 2× inference cost forever. Data quality is “cheap” — a one-time data engineering investment of ~$1M of compute can improve every subsequent training run.

The implication: frontier labs spend more on data pipelines than on architecture research. Meta’s “Llama 3 data crew” has more people than the architecture team. Same for Anthropic, OpenAI, DeepMind.

The newer wrinkle (post-2024): as ML-generated content floods the internet, raw Common Crawl is increasingly polluted with AI-generated text. The data pipeline now has to detect and filter “training-tainted” content. This is an active research area — distinguishing AI-generated from human-written at scale is technically difficult.

↳ §16.2 + Chinchilla preview

Next: §16.3 — Chinchilla scaling laws. The empirical L(N, D) function, the compute-optimal frontier, why GPT-3 was under-trained and what changed with Llama 2/3.