diff --git a/README.md b/README.md
index 80fd6f3..ac93305 100644
--- a/README.md
+++ b/README.md
@@ -106,6 +106,34 @@ print(f"Parameters: {total:,}")
 
 ---
 
+## Training
+
+The training script for the 3B model on FineWeb-Edu is at [`training/3b_fine_web_edu.py`](training/3b_fine_web_edu.py).
+
+**Single GPU:**
+```bash
+python training/3b_fine_web_edu.py
+```
+
+**Multi-GPU (auto-detects GPU count):**
+```bash
+torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py
+```
+
+Key design choices:
+
+| Feature | Detail |
+|---|---|
+| Optimizer | Muon for 2D weight matrices, AdamW for embeddings/norms |
+| Dataset | `HuggingFaceFW/fineweb-edu` (`sample-10BT` by default, swap to `sample-100BT` or `default` for full run) |
+| Tokenizer | `openai/gpt-oss-20b` via `MythosTokenizer` |
+| Parallelism | PyTorch DDP via `torchrun`, sharded streaming dataset |
+| Precision | bfloat16 on H100/A100, float16 + GradScaler on older GPUs |
+| Schedule | Linear warmup (2000 steps) → cosine decay |
+| Target | 30B tokens (~Chinchilla-adjusted for looped architecture) |
+
+---
+
 ## Documentation
 
 | Page | Description |
diff --git a/pyproject.toml b/pyproject.toml
index 40dca10..e4fea69 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -38,7 +38,9 @@ classifiers = [
 
 [tool.poetry.dependencies]
 python = ">=3.10,<4.0"
-torch = "*"
+torch = "2.11.0"
+transformers = ">=4.40.0"
+datasets = ">=2.18.0"
 
 
 [tool.poetry.group.lint.dependencies]
diff --git a/requirements.txt b/requirements.txt
index d428db7..3b01619 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,2 +1,4 @@
 torch>=2.1.0
+transformers>=4.40.0
+datasets>=2.18.0
 pytest>=7.0.0
diff --git a/train.py b/training/3b_fine_web_edu.py
similarity index 100%
rename from train.py
rename to training/3b_fine_web_edu.py