“Programs must be written for people to read, and only incidentally for machines to execute.” Harold Abelson and Gerald Jay Sussman, Structure and Interpretation of Computer Programs, 2nd ed., MIT Press, 1996, Preface.

🎯 Why Code Reviews Matter

Code reviews aren’t just about catching bugs. They’re about:

Knowledge sharing: Every review is a chance to level each other up.
Long-term velocity: Fast code is good. Maintainable code is better.
Protecting the user: Especially with LLMs and agents, bad assumptions = hallucinations, bugs, or trust failures.

✅ What to Look For

1. Clarity & Intent

Is it obvious what this code does and why it was added?
Are variable names, function names, and structure easy to follow?
- A good resource to learn about variables is Code Complete (we have a hard copy in the office)
- Specifically for python, make sure to follow https://peps.python.org/pep-0008/#naming-conventions
- [TODO] Add link to Typescript’s naming conventions
Are there inline comments or Notion links explaining non-obvious logic?

2. Atomic Commits

Is this pull request focused on one thing?
If not, it should be split. No catch-all “cleanup” PRs unless discussed.

3. Evaluation-Ready

If the code touches model logic, prompt flow, or user-facing AI output:
- Does it log outputs?
- Is there an eval set or at least a test input set?
- Are results tracked somewhere (ClickHouse, Notion, markdown)?
  - [TODO] Require specific approach for this

4. Observability

Is the feature debug-friendly? Look for:
- Logs (with clear log levels)
- Error tracking/reporting
- Analytics hooks if user-facing

5. Security & Privacy

Any hardcoded secrets, access tokens, or unsafe evals?
Are we respecting user data boundaries?
Any potential for prompt injection or LLM misuse?

6. Consistency

Does it follow our stack’s conventions?
Does it use shared libraries/utilities instead of reinventing?

7. Tests

Are there meaningful unit/integration tests?
- [TODO] Add test pyramid
Are eval prompts tested where relevant?
If no tests, is there a reason (e.g. exploratory code, not prod-bound yet)?

🚨 Red Flags

Logic buried in prompts with no version control or testing
Silent failures or except: pass patterns
Feature flagging missing for experimental code
Pushing without a corresponding design doc or Notion task
PR > 500 LOC without a strong reason or breakdown

🙋 Reviewer Mindset

Be kind. Be curious. Ask clarifying questions instead of assuming mistakes.
Review the why, not just the what. Does this fit the direction of the product, not just the codebase?
Don’t block unless necessary. If something’s not ideal but isn’t critical, suggest + approve.
Leave clear, thoughtful, and useful feedback. Avoid sarcasm, personal attacks, etc.
- To help with clarity of your comments, follow the Conventional Comments style.

🤖 AI-Specific Code Review Tips

Prompts are code. Version them. Document them. Review them like logic.
LLM calls need fallback. Always check: What happens if the model fails or returns junk?
Data > assumptions. Encourage logging real outputs, not just theoretical flows.

📋 Code Review Checklist (for Authors)

Before requesting a review, make sure:

The PR has a clear title and description
The PR links to a Notion task or goal
You’ve tested it locally or on staging
Prompt logic is isolated and versioned
Eval/logging is included if it touches AI
You’ve written a Loom walkthrough (if >300 LOC or user-facing)

🧠 Final Note: Build With Trust

Every line of code is a commitment to the team, the product, and our users. Reviews are how we protect velocity without losing quality. The goal is not to gatekeep—it’s to raise the floor for everyone.

📖 Required Reading

Make sure you to read Google’s The Standard of Code Review, and the corresponding Author’s Guide and Reviewer Guide.