Verifying SFT tokenization is byte-identical to origin/main
When you change code in scripts/data/convert_sft_data_for_olmocore.py
or open_instruct/numpy_dataset_conversion.py, you need to prove the
new output matches origin/main byte-for-byte before merging. This
page describes the procedure.
The full production mixer takes ~8 hours to run. For a tight feedback loop during development, do a 50k-example A/B first (about 8 minutes of wall-clock per side, plus a ~15s compare). Only fall back on a full-scale run once the small A/B passes.
How the compare works
The compare produces a sha256 for every output artifact:
token_ids_part_*.npylabels_mask_part_*.npytoken_ids_part_*.csv.gz(decompress before hashing)dataset_statistics.json— hash after strippingtimestampandoutput_directory(these vary by run and aren't meaningful to compare)
Skip dataset_statistics.txt (human-readable dupe), the tokenizer/
dir, and any _checkpoint* files.
Step 1: two image builds
Tokenization is driven from a Beaker image, so you need one image per side of the A/B.
-
Origin/main image. Create a throwaway worktree at
origin/mainand build an image from it:git worktree add -b verify-main /tmp/oi-main-verify origin/main cd /tmp/oi-main-verify ./scripts/train/build_image_and_launch.sh \ scripts/train/olmo-hybrid/7b_think_sft_tokenization.shNote the image ID printed at the end.
-
HEAD image. From your branch:
./scripts/train/build_image_and_launch.sh \ scripts/train/olmo-hybrid/7b_think_sft_tokenization.sh
Step 2: parallel tokenize jobs
Launch both images against the full production mixer with identical
args (Dolci-Think-SFT-32B 1.0 + 5 tool datasets at 3.0x,
--chat_template_name olmo123, --max_seq_length 32768). Write into
distinct weka output dirs, e.g.
/weka/oe-adapt-default/$USER/dataset/olmo-hybrid-{main,head}.
For a quick 50k A/B, add --num_examples 50000 to both sides. At 50k
each job takes ~7–8 minutes.
Step 3: compare on Beaker
weka isn't mounted locally, so run the compare as a tiny CPU job on
Beaker. Pin --image to an explicit image ID — the
$USER/open-instruct-integration-test tag moves in-place on every
push, so by the time the tokenize jobs finish the tag may no longer
match what you ran.
uv run python mason.py \
--cluster ai2/jupiter \
--budget ai2/oe-adapt \
--workspace ai2/olmo-instruct \
--image <HEAD_IMAGE_ID> \
--pure_docker_mode --no-host-networking \
--gpus 0 --priority urgent \
--description "compare head vs main" \
--no_auto_dataset_cache \
-- bash -c 'uv run python <<PY
import hashlib, gzip, json, os, sys
new, ref = sys.argv[1], sys.argv[2]
fail = 0
for name in sorted(set(os.listdir(new)) | set(os.listdir(ref))):
if name.startswith("_checkpoint") or name in ("dataset_statistics.txt", "tokenizer"):
continue
a, b = os.path.join(new, name), os.path.join(ref, name)
if not (os.path.isfile(a) and os.path.isfile(b)):
print(f"MISSING: {name}"); fail += 1; continue
if name == "dataset_statistics.json":
da, db = json.load(open(a)), json.load(open(b))
for d in (da, db):
d.pop("timestamp", None); d.pop("output_directory", None)
ha = hashlib.sha256(json.dumps(da, sort_keys=True).encode()).hexdigest()
hb = hashlib.sha256(json.dumps(db, sort_keys=True).encode()).hexdigest()
elif name.endswith(".gz"):
ha = hashlib.sha256(gzip.open(a).read()).hexdigest()
hb = hashlib.sha256(gzip.open(b).read()).hexdigest()
else:
ha = hashlib.sha256(open(a, "rb").read()).hexdigest()
hb = hashlib.sha256(open(b, "rb").read()).hexdigest()
tag = "OK " if ha == hb else "DIFF"
if ha != hb: fail += 1
print(f"{tag} {name} {ha} vs {hb}")
print("=== PASSED ===" if not fail else f"=== FAILED: {fail} mismatches ===")
sys.exit(1 if fail else 0)
PY
' -- /weka/oe-adapt-default/$USER/dataset/olmo-hybrid-head \
/weka/oe-adapt-default/$USER/dataset/olmo-hybrid-main
Success looks like === PASSED ===. On failure, use cmp -l on the
diverging artifact to find the first mismatching byte range.
Gotchas
- Always pin
--imageto an explicit Beaker image ID, not a tag. Tags move when anyone rebuilds the image. - The Beaker container doesn't have
jq— useuv run pythonfor JSON hashing. - If the file listings differ (not just hashes), the culprit is
usually a code change in
scripts/data/convert_sft_data_for_olmocore.pyor upstream inapply_chat_template. Checkgit log origin/main..HEAD -- scripts/data/ open_instruct/dataset_transformation.py. - If you've already produced a full-scale
origin/mainreference on weka, point the compare's reference side at it to skip re-running the full mixer.
Golden reference run
There is an existing full-scale origin/main tokenization on weka that
you can compare against without re-running the full mixer:
- Beaker experiment: https://beaker.org/ex/01KPRDGYEM81EASNNSBZ2HA7KA
- Output directory:
/weka/oe-adapt-default/finbarrt/dataset/olmo-hybrid-main-repro
Point the compare's reference side at that directory.