Dataset Transformations
Dataset transformations are a key part of the training process. Typically, we are given some text dataset, and we tokenize and filter it to be used for training.
Open Instruct includes a dataset_transformation.py
utility which
- handles dataset mixing
- handles different tokenization functions
- caches the tokenized dataset so we don't have to re-tokenize every time
- This is especially important when we have 405B SFT models: 32 nodes are just spending like 5 minutes to tokenize the dataset. This translates to 32 * 5 * 8 = 1280 minutes = 21 hours of wasted H100 time.
- Sometimes we also launch on places that don't have a shared cache, so we would
download individual datasets 32 times, and wait for concatenation and tokenization (actually
twice because the
with accelerator.main_process_first()
function assumes a shared cache) - Using a cache like this also minimizes the time to get first training output, making debug cycles faster.
SFT Dataset Format
We expect the dataset to have a messages
key, which is a list of dictionaries with role
and content
keys. For example,
Below is a minimal example of how dataset_transformation.py
was used in the finetune.py
script to mix, tokenize, and filter a dataset for SFT.
You can run python scripts/data/finetune_dataset_transformation.py
to see the output.
scripts/data/finetune_dataset_transformation.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
You can also use a different chat_template_name
. For example,
tc = TokenizerConfig(
# ...
chat_template_name="simple_chat",
)
#...
would give us