My Long-Running Notebook of AI & LLM

“I never start a project without skimming my own notes first.”
— Every engineer the night before a deadline

I’ve been writing down my AI/LLM notes in last 6 months—what started as a few scrappy records about Transformer quirks has turned into a 75 pages of tips that spans everything from model tweaks to prod inference hacks. I’m sharing it publically so one can:

  • Refresh the basics
  • Jump into a new corner new concept fast
  • Prepare for an AI fundamentals interview without opening 50 tabs

Link to my notes

Scroll down for a quick tour and tips on getting the most out of it.


1. Architectures: Where It All Begins

  • Transformers, RoPE, and friends – A concise recap of attention basics and how to scale up rotary position embeddings when your context grows past the original context the model was trained on.
  • Beyond vanilla – Notes on Mixture‑of‑Experts routing, grouped‑query attention, and “switch” layers for bandwidth‑friendly scaling.
  • Long‑context toolkit – FlashAttention‑2, context‑window grafting, and knob‑turning for retrieval‑augmented pipelines.

2. Training at Scale: Beyond Data Parallelism

The techniques below kick in once plain data parallelism tops out—think training runs on 512+ GPUs or context windows stretching past 64 K tokens.

TechniqueTL;DR in the notes
Tensor ParallelismShard massive weight matrices column‑wise or row‑wise so each GPU hosts a slice; synchronise activations over high‑bandwidth links such as NVLink.
Pipeline ParallelismClassic assembly line—partition layers into stages and hide latency with micro‑batches.
Sequence ParallelismGive each worker a chunk of the token stream and overlap compute/communication to keep GPUs busy.

3. Datasets: Building the Fuel

  • Source Mixes – OpenWeb, Common Crawl, curated corpora, code, synthetic dialogues; pros/cons and licensing notes.
  • Cleaning & Deduplication – Detect near‑duplicates with MinHash/SimHash, strip profanity, and nuke boilerplate.
  • Domain Balancing – Up‑weight niche domains (medical, legal) without starving general language coverage.
  • Synthetic & Augmented Data – Self‑instruct, RAG‑generated Q&A, tool‑augmented reasoning traces.
  • Evaluation Splits – Leak‑proof dev/test and benchmark alignment tips.

4. Data & Scaling Laws: Feeding the Beast

  • Dataset Curation – Practical heuristics for filtering web text, multilingual balancing, and deduplication without blowing up compute.
  • Token Mixtures – Why code, math, and synthetic Q‑A boost downstream zero‑shot tasks; quick recipes for ratio tuning.
  • Chinchilla‑style Scaling Laws – The compute‑optimal sweet spot (≈20 tokens per parameter) and how to project loss at larger model sizes.
  • Dynamic Data Pacing – Curriculum vs self‑curriculum, temperature‑based sampling, and tricks like “token replay” for long‑tail skills.
  • Tracking Data Quality – Per‑source perplexity dashboards and the “marginal utility of another billion tokens” checklist.

5. Inference Tricks & Optimisations

  • Speculative Decoding – Pair a small draft model with a larger verifier to double effective tokens‑per‑second.
  • KV‑cache management – Bucketing, sliding windows, and other memory‑saver moves.
  • Quantisation cheatsheet – What survives INT8 vs FP8, and a one‑pager on GPTQ hyper‑params.

6. Alignment, Reasoning & RL

  • Reward Modeling 101 – From scalar preference labels to pairwise ranking loss; why good rewards beat sparse accuracy metrics.
  • RLHF & Friends – PPO vs. simpler DPO/IPO methods; where RLAIF (AI feedback) slots in when human labels run dry.
  • Constitutional & Rule‑based Policies – Self‑critique loops, safety layers, and how to encode “don’t do that” without killing creativity.
  • Reasoning Boosters – Chain‑of‑Thought, Self‑Consistency, Tree‑of‑Thought, ReAct, and graph‑based planners—what works when tokens are precious.
  • Benchmarks & Eval – GSM8K, MATH, BBH, AGIEval, and reward‑hacking pitfalls (the “Wireheading Watchlist”).
  • Practical Tips – Start with supervised fine‑tuning, scale label diversity, and monitor KL divergence to keep models “on policy”.

7. Safety & Governance

  • Prompt‑level Filtering – Llama Guard, OpenAI policy templates, regex and tree‑sitter tricks for fast rule checks.
  • System & Tooling – Safety layers that run after draft generation (re‑rankers) vs during (constrained decoding, refusal tokens).
  • Red‑Teaming Playbooks – Manual adversarial prompts, automated mutation (JailbreakGym), and cross‑model ensemble attacks.
  • Eval Suites – RealToxicityPrompts, HARM‑bench, HELM safety subset; measuring bias, toxicity, jailbreak rate.
  • Content Policies – How to encode “allow / safe‑complete / refuse” tiers and avoid loopholes.
  • Incident Response – Logging, canary prompts, and rollback plans when models misbehave in production.

8. Technical Report CliffsNotes

  • Gemini – Multimodal routing and that infamous planner‑solver split.
  • Llama 2/3 & Llama Guard – Safety alignment prompts and where RLHF still bites.
  • DeepSeek‑R1 – Retrieval‑augmented pre‑training that treats documents as first‑class citizens.
  • Plus quick hits on Falcon, Mistral, Phi‑3 and other crowd favourites.

Final Words

Whether you’re shipping models to prod, writing your first Transformer from scratch, or preparing before an onsite, I hope this bundle saves you a few hours of searching—and maybe sparks your next idea.