All playbooks / LLM Fundamentals

Playbook · LLM Fundamentals · Editors' pick

How does training data affect model quality?

The trap here is treating this as a question about scale. The interviewer is checking whether you can trace a model's failure on a specific task, language, or domain back to the training corpus — before you reach for a bigger model or a different architecture.

Mid High frequency 10 min read Free
Practical answer framework for AI engineer interview loops.

01Interview Context

The trap here is treating this as a question about scale. The interviewer is checking whether you can trace a model's failure on a specific task, language, or domain back to the training corpus — before you reach for a bigger model or a different architecture.

02The 90-second answer

Model quality is mostly a question of whether the training corpus covers your production problem. Most base models train on variants of Common Crawl, which skews heavily toward English and toward whatever the public web produces in volume. If your task, language, or domain sits outside that distribution, you see real degradation regardless of parameter count. Before recommending a different model, I'd ask what the training data actually covered.

03Weak vs Strong Answer

Weak answer

"I'd use a larger model or add more training data to improve quality."

Strong answer

"I'd check the distribution gap first. If the language or domain was absent from the training corpus, more parameters won't close it. For missing factual knowledge I'd reach for retrieval. For a domain with no public training signal — protein structures, medical imaging — I'd evaluate purpose-built models rather than pushing a general one. And I'd measure by language and domain slice separately, not with an averaged quality score that hides where the product is actually failing."

04Decision Tree

I would use this flow to decide when to take what approach:

flowchart TD
    A[Production problem] --> B[Compare with training distribution]
    B --> C{Does the training distribution match?}

    C -->|Yes| D[Use the base model directly]
    C -->|No| E[Add domain adaptation]

    E --> F[Retrieval]
    E --> G[Fine-tuning]
    E --> H[Domain-specific model]

    F --> I[Inject missing knowledge at runtime]
    G --> J[Teach task patterns or domain language]
    H --> K[Train or use a model built for the domain]

05Where the data goes wrong

Quality is uneven and filtering is imperfect

Common Crawl indexes a few billion pages a month and underlies GPT-3, Gemini, and most other large models. The raw crawl includes duplicates, SEO filler, and misinformation. OpenAI used Reddit upvote counts to filter GPT-2 training data — that helped, but it biased the corpus toward popular rather than accurate.

Gunasekar et al. (2023) is worth citing here: a 1.3B-parameter model trained on 7B tokens of rigorously curated code outperformed much larger models on coding benchmarks. When someone argues for a bigger model to fix quality, that result is my counter.

Multilingual coverage is not what the marketing suggests

English accounts for roughly 46% of Common Crawl — around 8× the share of the next largest language. For languages spoken by hundreds of millions of people, the under-representation is severe.

Translating inputs through English first looks like an easy fix. I'd push back on three fronts. First, languages translation lost certain signals before the model sees it. Second, model behavior shifts across languages beyond vocabulary: a NewsGuard audit found the model was more likely to generate false claims in Chinese than in English, so English safety evals don't validate your multilingual deployment. Third, tokenization cost compounds the gap — the same content in Burmese takes roughly 10× as many tokens as in English, which hits API cost and latency hard at scale.

Private and rare domains are a different failure class

For domains like drug discovery, protein structures, or medical imaging, there is almost no public training signal to draw from. Parameter count does not compensate for absent data. For those cases I'd look at purpose-built models: AlphaFold for protein structure prediction trained on ~100K known structures, NVIDIA BioNeMo for biomolecular and drug discovery, Med-PaLM2 for clinical question answering. RAG can fill gaps in factual recall but it cannot substitute for a domain that was never in the weights.

06Follow-up questions to expect

  1. Why doesn't routing inputs through English first solve the multilingual problem?
  2. When would you use a domain-specific model rather than RAG on a general one?
  3. How would you measure whether training data coverage is adequate for a new task?
Next playbook

What are foundation models, and how have they changed AI engineering?

8 min · LLM Fundamentals