“I never start a project without skimming my own notes first.”
— Every engineer the night before a deadline
I’ve been writing down my AI/LLM notes in last 6 months—what started as a few scrappy records about Transformer quirks has turned into a 75 pages of tips that spans everything from model tweaks to prod inference hacks. I’m sharing it publically so one can:
- Refresh the basics
- Jump into a new corner new concept fast
- Prepare for an AI fundamentals interview without opening 50 tabs
Scroll down for a quick tour and tips on getting the most out of it.
1. Architectures: Where It All Begins
- Transformers, RoPE, and friends – A concise recap of attention basics and how to scale up rotary position embeddings when your context grows past the original context the model was trained on.
- Beyond vanilla – Notes on Mixture‑of‑Experts routing, grouped‑query attention, and “switch” layers for bandwidth‑friendly scaling.
- Long‑context toolkit – FlashAttention‑2, context‑window grafting, and knob‑turning for retrieval‑augmented pipelines.
2. Training at Scale: Beyond Data Parallelism
The techniques below kick in once plain data parallelism tops out—think training runs on 512+ GPUs or context windows stretching past 64 K tokens.
Technique | TL;DR in the notes |
---|---|
Tensor Parallelism | Shard massive weight matrices column‑wise or row‑wise so each GPU hosts a slice; synchronise activations over high‑bandwidth links such as NVLink. |
Pipeline Parallelism | Classic assembly line—partition layers into stages and hide latency with micro‑batches. |
Sequence Parallelism | Give each worker a chunk of the token stream and overlap compute/communication to keep GPUs busy. |
3. Datasets: Building the Fuel
- Source Mixes – OpenWeb, Common Crawl, curated corpora, code, synthetic dialogues; pros/cons and licensing notes.
- Cleaning & Deduplication – Detect near‑duplicates with MinHash/SimHash, strip profanity, and nuke boilerplate.
- Domain Balancing – Up‑weight niche domains (medical, legal) without starving general language coverage.
- Synthetic & Augmented Data – Self‑instruct, RAG‑generated Q&A, tool‑augmented reasoning traces.
- Evaluation Splits – Leak‑proof dev/test and benchmark alignment tips.
4. Data & Scaling Laws: Feeding the Beast
- Dataset Curation – Practical heuristics for filtering web text, multilingual balancing, and deduplication without blowing up compute.
- Token Mixtures – Why code, math, and synthetic Q‑A boost downstream zero‑shot tasks; quick recipes for ratio tuning.
- Chinchilla‑style Scaling Laws – The compute‑optimal sweet spot (≈20 tokens per parameter) and how to project loss at larger model sizes.
- Dynamic Data Pacing – Curriculum vs self‑curriculum, temperature‑based sampling, and tricks like “token replay” for long‑tail skills.
- Tracking Data Quality – Per‑source perplexity dashboards and the “marginal utility of another billion tokens” checklist.
5. Inference Tricks & Optimisations
- Speculative Decoding – Pair a small draft model with a larger verifier to double effective tokens‑per‑second.
- KV‑cache management – Bucketing, sliding windows, and other memory‑saver moves.
- Quantisation cheatsheet – What survives INT8 vs FP8, and a one‑pager on GPTQ hyper‑params.
6. Alignment, Reasoning & RL
- Reward Modeling 101 – From scalar preference labels to pairwise ranking loss; why good rewards beat sparse accuracy metrics.
- RLHF & Friends – PPO vs. simpler DPO/IPO methods; where RLAIF (AI feedback) slots in when human labels run dry.
- Constitutional & Rule‑based Policies – Self‑critique loops, safety layers, and how to encode “don’t do that” without killing creativity.
- Reasoning Boosters – Chain‑of‑Thought, Self‑Consistency, Tree‑of‑Thought, ReAct, and graph‑based planners—what works when tokens are precious.
- Benchmarks & Eval – GSM8K, MATH, BBH, AGIEval, and reward‑hacking pitfalls (the “Wireheading Watchlist”).
- Practical Tips – Start with supervised fine‑tuning, scale label diversity, and monitor KL divergence to keep models “on policy”.
7. Safety & Governance
- Prompt‑level Filtering – Llama Guard, OpenAI policy templates, regex and tree‑sitter tricks for fast rule checks.
- System & Tooling – Safety layers that run after draft generation (re‑rankers) vs during (constrained decoding, refusal tokens).
- Red‑Teaming Playbooks – Manual adversarial prompts, automated mutation (JailbreakGym), and cross‑model ensemble attacks.
- Eval Suites – RealToxicityPrompts, HARM‑bench, HELM safety subset; measuring bias, toxicity, jailbreak rate.
- Content Policies – How to encode “allow / safe‑complete / refuse” tiers and avoid loopholes.
- Incident Response – Logging, canary prompts, and rollback plans when models misbehave in production.
8. Technical Report CliffsNotes
- Gemini – Multimodal routing and that infamous planner‑solver split.
- Llama 2/3 & Llama Guard – Safety alignment prompts and where RLHF still bites.
- DeepSeek‑R1 – Retrieval‑augmented pre‑training that treats documents as first‑class citizens.
- Plus quick hits on Falcon, Mistral, Phi‑3 and other crowd favourites.
Final Words
Whether you’re shipping models to prod, writing your first Transformer from scratch, or preparing before an onsite, I hope this bundle saves you a few hours of searching—and maybe sparks your next idea.