READING RESEARCH LIKE A RESEARCHER
Section 27.2
02

What frontier-lab work actually looks like

A typical paper from OpenAI or Anthropic represents 6-18 months of work by a team of 5-30 people. What you see in the paper: the final architecture, the training recipe, the result. What you don’t see: the 90% of experiments that failed, the infrastructure debugging that took 3 months, the data team’s years of work on the corpus, the alignment team’s back-and-forth with the pre-training team, the deployment engineers preparing for inference. This section is a candid description of what daily work actually looks like in 2025-2026 at the frontier labs — for someone considering joining one, or just to calibrate your reading of their papers.

The role mix at a frontier lab

A modern frontier lab (OpenAI, Anthropic, Google DeepMind, Meta AI, xAI) has roughly these technical roles:

Pre-training research (~15-25% of staff): Architecture experiments, data curation, training recipe design. "What model should we train next, with what data, for how long?" Pre-training engineering (~15-20%): Distributed training infrastructure, performance optimisation. "How do we keep 16K GPUs running for 30 days without losing the run?" Alignment research (~15-25%): RLHF, DPO, constitutional AI, red-teaming, safety evaluation. "How do we make this model useful and safe?" Alignment engineering (~10-15%): Reward modeling infra, preference data collection pipelines. "How do we collect 1M preference comparisons efficiently?" Inference & deployment (~10-15%): Serving infrastructure, optimization (vLLM-style), capacity planning. "How do we serve this model at $/1K tokens cheap enough to ship?" Product & research engineering (~10-15%): Building tools (Claude Code, ChatGPT, Gemini), eval frameworks, APIs. "How do we make this useful to humans?" Research operations + management (~5-10%): Coordinating across teams, hiring, infrastructure budgets. "Who's working on what, with what GPUs, why?"

The exact split varies, but pre-training, alignment, and inference together are typically 60-70% of staff. Product and research engineering have grown substantially in 2024-2025.

A typical day in pre-training

Pre-training researcher (typical week): Monday: Architecture experiment fails. Loss exploding around step 50K. Spend morning debugging. Find it's an interaction with the new tokenizer's special tokens. Tuesday: Re-run with fix. Loss looks normal. Now wait 6 days for training to finish. Wednesday: Meanwhile, kick off architecture variant 2 on a fresh cluster. Discuss with data team about a curriculum issue. Thursday: Variant 2 runs out of memory. Tensor parallel layout was wrong. Half a day fixing. Re-launch. Friday: Read 3 papers people on team flagged as important. Realise variant 3 idea — schedule for next week. 2 weeks later: One of the 3 variants beats baseline by 0.02 perplexity. Spend 3 weeks figuring out WHY (ablations). Spend 2 weeks writing internal report. 3 months total for one "experiment" that may or may not ship.

The reality: lots of waiting (training runs are slow), lots of debugging (everything breaks), lots of negative results (most experiments don’t work), lots of communication overhead (your decisions affect 10 other teams).

A typical day in alignment

Alignment researcher (typical week): Monday: Reward model is rejecting valid outputs. Inspect logs. Find that the RM hates long responses, biasing the policy. Tuesday: Retrain RM with debiased data. Wait 12 hours. Find that policy now games the new bias differently. Wednesday: Manual review of 100 responses. Discuss with PM team about "what does 'helpful' mean for this user segment." Thursday: Run a smaller-scale experiment. Compare 5 different reward formulations. Pick the best (or least bad). Friday: Red-teaming session: try to break the model. Find 20 issues. Triage which are alignment issues vs prompt-injection vs capability gaps. 2 months later: Ship an improvement. No formal paper — most alignment work doesn't get published for capabilities + safety reasons. Internal: a 5-page memo describing what was tried. External: a blog post or model card update.

Alignment work is heavier on judgment calls than pre-training. Most decisions don’t have clean metrics; you’re balancing many constraints (helpful, harmless, honest, factual, etc.). Failures are sometimes more informative than successes.

A typical day in inference

Inference engineer (typical week): Monday: Production serving cluster has elevated latency for some requests. Profile. Find that GPU 7 is degraded — thermal throttle. Tuesday: Push out a hotfix to route around GPU 7. Coordinate with hardware team to schedule replacement. Wednesday: Meanwhile, work on the next model release. Optimise the new speculative decoder. Find a kernel inefficiency in attention kernel for long contexts. Thursday: Implement the fix. Test on standard benchmarks. Verify no quality regression. Friday: Review changes to the inference engine from other team members. Discuss new feature for the next quarter. 3 months later: Average serving latency drops 12%. Cost per token down 8%. No paper, but a quarterly report internal that highlights the improvements.

Inference engineering is high-leverage but rarely glamorous. You’re optimising the SAME model that’s already deployed — every percentage point of improvement compounds across billions of requests.

— think, then check —

Realistic roles for ML-strong / systems-weak:

1. Pre-training research — your strongest match. Algorithms, architectures, data, scaling. You can pick up the systems side gradually.

2. Alignment research — also strong match. Mostly ML + judgment. Less infrastructure-heavy.

3. Research engineering — bridges ML and systems. You’d implement research ideas, build evaluation frameworks. Decent transition path; the systems work is well-bounded.

Less realistic:

4. Inference engineering — heavy systems focus. CUDA, distributed systems, latency optimisation. Would need substantial systems study first.

5. Pre-training engineering — heaviest systems focus. Distributed training at scale. Year+ of systems work before being effective.

Typical interview process (5-8 rounds):

  1. Initial chat (30 min): high-level fit, what you’re working on, what excites you.
  2. Background interview (45 min): deep dive on your most recent work. Be prepared to defend technical choices.
  3. Technical interview (60 min): ML knowledge. Could be paper discussion, system design, or coding.
  4. Coding (60 min): implement something non-trivial. Often a numpy implementation of a known algorithm (LayerNorm, attention, etc.). Tests fluency, not creativity.
  5. Research interview (60 min): present research direction or critique a paper. Tests taste and depth.
  6. Manager interview (45 min): cultural fit, what kind of work environment you want.
  7. Final loop (3-5 hours): meeting with potential team members. Often pair-programming or whiteboard discussions.

What they’re actually evaluating:

  • Technical depth: can you reason about systems / models / math correctly?
  • Speed of thought: how quickly do you get to the right framing?
  • Judgment / taste: what would you research; what would you NOT research?
  • Collaboration: can you debate without being dogmatic?
  • Communication: can you explain complex ideas clearly?

Common failure modes:

  • Strong on math, weak on engineering practicality.
  • Strong on engineering, weak on motivating research directions.
  • Strong on knowing existing techniques, weak on extending them.
  • Knowing every paper but not having an opinion about which are important.

Realistic timelines:

  • From “deciding to apply” to “offer in hand”: 2-6 months.
  • From “joining” to “shipping first thing”: 6-12 months.
  • From “joining” to “having opinions that influence direction”: 1-2 years.

The labs hire slowly and selectively. Be patient.

What papers hide

Papers polish, abstract, and sanitise. What’s missing:

What you won't see in a paper: 1. THE FAILED EXPERIMENTS. For every "result" reported, dozens of variants were tried. Lattice search through architecture, hyperparams, data mixes. The paper shows the winners. 2. THE INFRASTRUCTURE DEBUGGING. Training runs fail. GPUs die. Storage corrupts. NCCL hangs. Months can be lost to systems-level issues that don't appear in the paper. 3. THE DATA WORK. The corpus took 1-2 years to build. Cleaning, dedup, filtering, quality classification — none of which fits in the paper. 4. THE INTERNAL POLITICS. Multiple teams compete for cluster time, GPU allocations, researcher attention. Decisions involve people, not just ideas. 5. THE FAILED MODELS. For every shipped model, several were trained and abandoned. Llama 4 had multiple internal variants. GPT-4o had earlier experiments. None of this is publicly described. 6. THE LATE-STAGE CHANGES. The "final" architecture in the paper may differ from what was actually deployed. Last-minute changes for alignment reasons, safety improvements, performance tweaks. 7. THE PEOPLE INVOLVED. The author list shows ~10-30 people. The actual work involved ~50-100 contributors (data team, infra team, eval team, etc.) whose names aren't listed. 8. THE TIMELINE. Papers list "5 months of training" but the project started 18 months earlier with planning, experimentation, etc.
— think, then check —

First 3 months: realistic expectations:

1. You won’t ship anything significant. Frontier labs have enormous existing tooling, codebases, and conventions. Learning these takes weeks to months.

2. You’ll be assigned to a team. Could be pretraining, alignment, inference, etc. Your work scope is initially narrow.

3. You’ll work on small contributions first. Implementing a small experiment, writing an eval, fixing a bug, contributing to a meeting note. The “small first” is by design — you need to demonstrate competence at scale before getting bigger projects.

4. You’ll attend many meetings. Frontier labs are heavily coordinated; daily standups, weekly team meetings, monthly all-hands. Initially, much of this is over your head; that’s normal.

5. You’ll feel slow. Compared to academic work where you could iterate freely, lab work has more constraints (CUDA cluster scheduling, code review, alignment with team direction). This is frustrating initially.

How to ramp up effectively:

  1. Learn the codebase, not the science. The papers you’ve read aren’t directly applicable; the existing code IS. Spend the first month reading code, running existing scripts, understanding how data flows.
  2. Find a “small but meaningful” first project. Ideally something where you can demonstrate competence in 2-4 weeks. Avoid biting off too much.
  3. Build relationships with 3-5 people. Senior people you can ask questions, peers you can collaborate with, junior people whose work you can boost. Network in the building.
  4. Attend the weekly group meetings of adjacent teams. You’ll learn what’s happening across the lab. This builds the implicit knowledge of “what’s working, what’s not, what’s coming.”
  5. Don’t try to publish a paper in your first 6 months. Counterintuitive: focus on internal contributions, learn the systems, build reputation. Papers come naturally from solid contributions, not from rushed first-author attempts.
  6. Read the internal documentation. Frontier labs have extensive internal wikis, design docs, decision logs. Reading these is far more efficient than learning by trial.

Common mistakes:

  • Trying to “fix” the existing system before understanding why it’s that way.
  • Proposing too-ambitious research directions without buy-in.
  • Working in isolation instead of pair-programming with team members.
  • Saying “I can do that in a weekend” — usually it takes 3 weeks.
  • Underestimating the importance of evaluation and reproducibility.

Where the value comes from in 6-18 months:

  • You contribute to a real shipped model improvement.
  • You become the “owner” of some system or experiment series.
  • You influence the team’s research direction.
  • You write a paper (internal or external) on your work.
  • You mentor newer hires through their first 3 months.

The trajectory is slower than academic research but the impact per unit of effort is much higher because of the scale of resources.

What the published papers do represent

Despite hiding much, papers ARE valuable signal. They represent:

When you read papers, treat them as a sanitised report — much was tried that isn’t there, but what IS there is real and important. The skill is to read them as signal of how the field is moving, not as instruction manuals.

— think, then check —

What papers don’t tell you:

1. The hyperparameter search budget. Did they try 100 configs or 1000? You probably can’t afford the same exploration.

2. Negative results. Their first attempt may have been worse than baseline; the published result is the polished version.

3. Infrastructure assumptions. The paper assumes you have access to certain training infrastructure (kernels, distributed primitives) that you may not have.

4. Data quality dependencies. The improvement may only show up at certain data scales / qualities / distributions.

5. Interaction with other techniques. The improvement might combine with other techniques (RoPE, MoE, etc.) in non-obvious ways.

6. Failure modes at smaller scale. Many techniques work at 70B but not at 7B (or vice versa). Often unstated.

7. What “the baseline” actually was. Was it tuned aggressively? Or just standard hyperparameters?

8. The compute budget. “We trained for 100K steps” — but on how many GPUs? You may not have the same compute.

How to extract more information:

1. Read released code carefully. Often more detailed than the paper. Look at hyperparameters, schedules, and special handling.

2. Check GitHub issues on the paper’s repository. Other replicators often discuss what they had to change.

3. Reach out to authors. Be specific: “I’m trying to apply technique X to model Y, but seeing Z. Any thoughts on what might be different?“

4. Check follow-up papers. Other groups that have built on this work usually report what they had to do differently.

5. Look for community resources. Twitter/X discussions, blog posts, podcast interviews — sometimes important details surface there.

6. Run a small-scale replication. Apply the technique at a much smaller scale first. If it doesn’t work even there, something’s wrong with your understanding.

What to do if you can’t replicate:

1. Adapt the principle, not the recipe. The paper’s specific recipe may not generalise; the underlying idea probably does.

2. Find a similar simpler technique. Often the paper builds on prior work; the prior work may be more accessible.

3. Be skeptical of small improvements. A 1-2% improvement in a paper is often within replication variance. Don’t over-engineer to chase it.

4. Build your own baselines. Compare against YOUR strongest implementation, not the paper’s reported number. This is often more meaningful.

The bigger picture:

Frontier labs have huge advantages: compute, data, tooling, talent, internal lore. As a smaller player, you can’t replicate their full stack. But you CAN apply their high-level insights to your scale.

The papers are a contribution to the field; they don’t have to be exactly applicable to be valuable. The principle is what travels; the recipe is local to the lab that produced it.

Next: §27.3 — Open problems and the close. Where the genuine research seams are in 2026, and the closing reflection on what this book was trying to do.