From itsmostafa-llm-engineering-skills
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin itsmostafa-llm-engineering-skillsThis skill uses the workspace's default tool permissions.
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.
Pretraining produces models that predict likely text, not necessarily good text. A model trained on internet data learns to complete text in ways that reflect its training distribution—including toxic, unhelpful, or dishonest patterns. RLHF addresses this gap by optimizing for human preferences rather than likelihood.
The core insight: humans can often recognize good outputs more easily than they can specify what makes an output good. RLHF exploits this by collecting human judgments and using them to shape model behavior.
Language models face several alignment challenges:
RLHF provides a framework for encoding these properties through preference data.
The standard RLHF pipeline consists of three main stages:
Start with a pretrained language model and fine-tune it on high-quality demonstrations. This teaches the model the desired format and style of responses.
Input: Pretrained model + demonstration dataset Output: SFT model that can follow instructions
Train a model to predict human preferences between pairs of outputs. The reward model learns to score outputs in a way that correlates with human judgment.
Input: SFT model + preference dataset (chosen/rejected pairs) Output: Reward model that scores any output
Use reinforcement learning to optimize the SFT model against the reward model, while staying close to the original SFT distribution.
Input: SFT model + reward model Output: Final aligned model
Direct alignment algorithms (DPO, IPO, KTO) skip the reward model entirely, optimizing directly from preference data. This simplifies the pipeline but trades off some flexibility.
Preference data encodes human judgment about model outputs. The most common format is pairwise comparisons.
Given a prompt, collect two or more model outputs and have humans indicate which is better:
Prompt: "Explain quantum entanglement"
Response A: [technical explanation]
Response B: [simpler explanation with analogy]
Human preference: B > A
This creates (prompt, chosen, rejected) tuples for training.
Human annotation: Trained annotators compare outputs according to guidelines. Most reliable but expensive and slow.
AI feedback: Use a capable model to generate preferences. Faster and cheaper but may propagate biases. This is the basis for Constitutional AI (CAI) and RLAIF.
Implicit signals: User interactions like upvotes, regeneration requests, or conversation length. Noisy but abundant.
Instruction tuning (supervised fine-tuning on instruction-response pairs) serves as the foundation for RLHF.
Typical instruction tuning datasets include:
The SFT model defines the "prior" that RLHF refines. A better SFT model means:
The reward model transforms pairwise preferences into a scalar signal for RL optimization.
Preferences are modeled using the Bradley-Terry framework:
P(A > B) = sigmoid(r(A) - r(B))
Where r(x) is the reward for output x. This assumes preferences depend only on the difference in rewards.
The loss function is:
L = -log(sigmoid(r(chosen) - r(rejected)))
This pushes the reward model to assign higher scores to chosen outputs.
Reward models are typically:
See reference/reward-modeling.md for detailed training procedures.
Policy optimization uses RL to maximize expected reward while staying close to the reference policy.
maximize E[R(x, y)] - β * KL(π || π_ref)
Where:
PPO is the most common algorithm for RLHF:
The clipping prevents large policy updates that could destabilize training.
The KL penalty serves multiple purposes:
Higher β means more conservative updates; lower β allows more aggressive optimization.
REINFORCE is simpler but has higher variance:
PPO adds complexity but improves stability:
See reference/policy-optimization.md for algorithm details.
Direct alignment methods optimize the RLHF objective without training a separate reward model.
DPO reparameterizes the RLHF objective to derive a closed-form loss:
L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))
Where y_w is the preferred response and y_l is the dispreferred response.
Advantages:
Trade-offs:
IPO addresses potential overfitting in DPO by using a different loss formulation that doesn't assume the Bradley-Terry model perfectly describes preferences.
KTO works with binary feedback (good/bad) rather than pairwise comparisons, making data collection easier. It's based on prospect theory from behavioral economics.
Direct alignment is preferred when:
Reward-based RLHF is preferred when:
See reference/direct-alignment.md for detailed algorithm comparisons.
As optimization proceeds, the policy may exploit weaknesses in the reward model rather than improving on the true objective. Symptoms include:
Mitigations:
The policy finds inputs that score highly with the reward model but don't represent genuine improvement:
Evaluating aligned models is difficult because:
The preference data comes from a specific distribution of prompts and responses. The deployed model will encounter different inputs, and the reward model may not generalize well.
reference/reward-modeling.md - Detailed reward model training proceduresreference/policy-optimization.md - PPO and policy gradient algorithms for RLHFreference/direct-alignment.md - DPO, IPO, KTO and other direct methods