Some Resources for Reinforcement Learning Basic

Reinforcement learning (RL) has surged in popularity over the past few months, largely thanks to Large Reasoning Models (LRMs) and test‑time scaling techniques. A solid understanding of RL fundamentals often makes the difference between a model that merely trains and one that converges stably and efficiently when applied to LLMs.

To refresh my own knowledge, I’ve been revisiting the basics and prototyping a few canonical algorithms. One of my resources is Prof. Ernest Ryu’s UCLA YouTube course—it leans theoretical, but does a great job unpacking core ideas such as value functions, advantage estimates, policy gradients, Generalized Advantage Estimation (GAE), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), and more.

To ground the theory, I implemented several algorithms in Google Colab using Gym environments. The code is in my this github repo, codes are kept deliberately pure‑Python and PyTorch for clarity. Below is a quick look at what each notebook covers:

RL_Basics_CartPole_REINFORCE.ipynb

gym: CartPole‑v1
Algorithms implemented:
- Vanilla REINFORCE
- REINFORCE + Value Network (one‑step Actor–Critic)
Main Points:
- Monte-Carlo policy gradient
- Benefit of subtract a baseline for reducing variance
- Walks through the full policy‑gradient pipeline—from collecting rollouts to computing returns/advantages.

RL_Sparse_reward_Actor_Critic_and_PPO.ipynb

gym: FrozenLake‑v1 (sparse‑reward setting)
Algorithms implemented:
- REINFORCE + Value Network
- PPO with Generalized Advantage Estimation (GAE)
Main Points:
- Illustrates how sparse rewards derail vanilla policy gradient.
- Demonstrates PPO’s clipped objective and GAE fixing instability.
- Includes side‑by‑side reward curves comparing REINFORCE vs. PPO.

RL_DQN.ipynb

gym: LunarLander‑v2
Algorithms implemented:
- Deep Q‑Network (DQN) with experience replay
- Target‑network updates (with notes on Double‑DQN)
Main Points:
- Provides a value‑based contrast to policy‑gradient methods.
- Covers ε‑greedy exploration, replay‑buffer tuning, and target‑network syncing.

Please note that LRMs primarily rely on PPO and its variants (e.g., GRPO). PPO can leverage a strong pre‑trained policy—in the LRM setting, the language model’s next‑token predictor. By contrast, DQN does not learn a separate policy network; instead, its policy is the simple rule at=arg⁡max_at⁡ Q(st,at). Consequently, DQN cannot benefit from a pre‑trained policy in the same way.