From 5ffb897dcfff6962ce7992edf604386796985dd7 Mon Sep 17 00:00:00 2001 From: Kye Gomez Date: Sun, 19 Apr 2026 22:48:30 -0400 Subject: [PATCH] [feat][training-script][add 3b fineweb-edu training script][feat][tokenizer][add MythosTokenizer class with encode decode][improvement][deps][add transformers and datasets dependencies][docs][readme-training][add training section with run commands][improvement][pyproject][pin torch and add new deps] --- README.md | 28 +++++++++++++++++++++++++ pyproject.toml | 4 +++- requirements.txt | 2 ++ train.py => training/3b_fine_web_edu.py | 0 4 files changed, 33 insertions(+), 1 deletion(-) rename train.py => training/3b_fine_web_edu.py (100%) diff --git a/README.md b/README.md index 80fd6f3..ac93305 100644 --- a/README.md +++ b/README.md @@ -106,6 +106,34 @@ print(f"Parameters: {total:,}") --- +## Training + +The training script for the 3B model on FineWeb-Edu is at [`training/3b_fine_web_edu.py`](training/3b_fine_web_edu.py). + +**Single GPU:** +```bash +python training/3b_fine_web_edu.py +``` + +**Multi-GPU (auto-detects GPU count):** +```bash +torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py +``` + +Key design choices: + +| Feature | Detail | +|---|---| +| Optimizer | Muon for 2D weight matrices, AdamW for embeddings/norms | +| Dataset | `HuggingFaceFW/fineweb-edu` (`sample-10BT` by default, swap to `sample-100BT` or `default` for full run) | +| Tokenizer | `openai/gpt-oss-20b` via `MythosTokenizer` | +| Parallelism | PyTorch DDP via `torchrun`, sharded streaming dataset | +| Precision | bfloat16 on H100/A100, float16 + GradScaler on older GPUs | +| Schedule | Linear warmup (2000 steps) → cosine decay | +| Target | 30B tokens (~Chinchilla-adjusted for looped architecture) | + +--- + ## Documentation | Page | Description | diff --git a/pyproject.toml b/pyproject.toml index 40dca10..e4fea69 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -38,7 +38,9 @@ classifiers = [ [tool.poetry.dependencies] python = ">=3.10,<4.0" -torch = "*" +torch = "2.11.0" +transformers = ">=4.40.0" +datasets = ">=2.18.0" [tool.poetry.group.lint.dependencies] diff --git a/requirements.txt b/requirements.txt index d428db7..3b01619 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,4 @@ torch>=2.1.0 +transformers>=4.40.0 +datasets>=2.18.0 pytest>=7.0.0 diff --git a/train.py b/training/3b_fine_web_edu.py similarity index 100% rename from train.py rename to training/3b_fine_web_edu.py