Direct Preference Optimization (DPO) from Scratch

End-to-end PyTorch implementation of the DPO algorithm to align LLM outputs with human preference data.

Self Project · Jul – Aug 2025


Overview

DPO (Rafailov et al., 2023) reframes RLHF as a supervised learning problem — no reward model, no PPO, just a clever reparameterization. This project implements it from scratch to deeply understand how preference learning works in practice.

What Was Built

  • Full DPO training loop in PyTorch from scratch — no HuggingFace Trainer abstraction
  • Preference dataset handling: chosen/rejected response pairs with proper tokenization and batching
  • Loss implementation: the DPO objective using the log-ratio between policy and reference model probabilities
  • Verified alignment behavior by comparing policy outputs on preference pairs before and after training

Why

Understanding preference optimization at the implementation level — not just the paper — is essential for anyone working on LLM alignment, reward modeling, or recommender systems with human feedback.

Stack

Python · PyTorch · Hugging Face Tokenizers · Human Preference Datasets