Advanced

How to Read an AI Paper

Most practitioners consume AI research through blog posts, Twitter threads, and YouTube summaries. These are useful for awareness but they strip out the details that matter: what the authors actually did, what the baselines were, what the limitations are, and whether the evaluation is fair. Building the ability to read primary sources is one of the highest-leverage skills for anyone who wants to understand the field deeply rather than just track it.

Why Primary Sources Matter

What you get from papers

Exact architecture decisions and hyperparameters that blog posts omit
The authors' own description of limitations — often buried but honest
The baseline comparisons (or absence of them)
Implementation details in the appendix that determine whether results replicate
What the paper does not claim, which is often more informative than what it does

Risks of secondhand sources

LLMs summarizing papers frequently hallucinate details or misstate claims
Blog posts amplify the headline result and drop the caveats
Social media summaries are optimized for engagement, not accuracy
Benchmark numbers get quoted without context of evaluation protocol
Impressive charts frequently represent cherry-picked examples

Paper Anatomy

Virtually all ML papers follow the same structural template. Understanding what each section is supposed to do tells you which sections to read in which order for your purpose.

Section	Purpose	Read priority
Abstract	High-level claim, method, and result in 150–200 words	Always — first pass starts here
Introduction	Problem motivation, gap in prior work, contributions list	Always — reveals what they claim to contribute
Related Work	Context-setting; positions paper relative to prior art	Skim first pass; read if you need citations
Methods	What they actually built and how — the technical core	Always — this is what matters most
Experiments	Evaluation setup, baselines, main results, ablations	Always — do results actually support the claims?
Conclusion	Summary and limitations statement	Always first pass — authors state limitations here
Appendix	Hyperparameters, ablations, proofs, additional results	Read when reproducing; often contains crucial details

The Three-Pass Method

Keshav (2007) proposed the three-pass reading method, originally for systems research papers. It maps directly onto ML papers and prevents the most common reading failure — spending four hours on a paper that turns out to be irrelevant.

First Pass — 5–10 minutes

Goal: decide whether the paper is worth your time. Read in order:

Title and abstract (carefully)
Section headings only
Conclusion paragraph
One glance at each figure/table — what are they measuring?

Exit outcome: you should know the paper's category (new method, benchmark, analysis, survey), the main claim, and whether it's relevant to you.

Second Pass — 45–90 minutes

Goal: understand the key ideas without full technical depth. Focus on:

Figures and tables in detail — can you explain each figure in your own words?
The methods section at a conceptual level — what is the core idea?
The main results table — does the improvement look meaningful?
The limitations paragraph in the conclusion
Skip proofs and derivations; note them for the third pass if needed

Exit outcome: you could summarize the paper accurately to someone else.

Third Pass — 4–8 hours (for papers you need to implement)

Goal: full technical understanding, sufficient to implement or build on. Cover:

Every equation and derivation — verify them against your own understanding
Appendix hyperparameters and implementation details
All ablation experiments — which components of the method matter?
Implicit assumptions you can now challenge: What would happen if they changed X?
Cross-reference with cited papers to verify the claims about prior work

Exit outcome: you could implement the method from scratch and identify flaws in the experimental design.

Reading Experiments Critically

The experiments section is where most papers are vulnerable. Learning to read it critically is the most valuable skill for distinguishing genuine advances from overfitted or cherry-picked results.

Questions to ask about every experiment

Is the baseline competitive and fair? (Authors often compare against weak baselines)
Is the evaluation metric appropriate for the task?
Is the test set genuinely held out, or could the model have been tuned on it?
How many seeds were run? Are standard deviations reported?
Are ablation experiments present — do they confirm which components matter?
Are the example outputs shown representative or cherry-picked?

Red flags that weaken a paper

No baselines, or baselines from 3+ years ago
Results only on a narrow set of benchmarks the method was tuned on
No ablation study — you cannot tell which parts of the method matter
Large variance in results with no significance testing
Qualitative examples only, no quantitative evaluation
Evaluation set not clearly described or potentially contaminated with training data
Claims not matched to results: abstract says X, results show X on a subset of conditions

Key Paper Discovery Sources

Source	Best for	Caveats
arXiv (arxiv.org)	All ML papers — posted before or simultaneous with peer review; cs.LG, cs.CL, cs.CV are the main sections	Not peer-reviewed; quality varies widely; preprints may change before final version
Papers with Code	Links papers to open implementations and benchmark leaderboards; filter by task/dataset	Coverage incomplete; linked code may differ from paper version
Semantic Scholar	Citation graph; find papers that cite or are cited by a key paper; track a field's evolution	Less useful for papers from last 6 months (citation lag)
HuggingFace Papers	Curated feed of significant recent papers; community upvoting surfaces high-signal work	Skews toward applied and model-release papers; theory underrepresented

Efficient Note-Taking

A minimal paper note template:

What they claim: one sentence from the abstract
What they actually did: method summary in your own words
Key result: the one number that matters, with its context (dataset, metric, baseline)
Main limitation: from their own limitations section or your own reading
Would I trust this? your critical assessment after reading the experiments
Follow-up: papers cited or citing that you should read next

Checklist: Do You Understand This?

Why is reading the primary paper more reliable than reading an LLM summary or a blog post about it?
Describe the three-pass method. What is the goal and time budget for each pass?
When reading the experiments section, what are three questions you should ask to assess whether the results are credible?
List three red flags in an ML paper's experimental design that should make you skeptical of the results.
What is the difference between arXiv and Papers with Code, and when would you use each?
If you only had 10 minutes to understand a paper you've never seen, what would you read in that order?