TIGER in torch-rechub
This document describes how TIGER (Transformer Index for GEnerative Recommenders) is implemented and run in torch-rechub. TIGER frames "predict the next item" as a sequence-to-sequence task of "generate the next item's semantic ID": each item is first quantized by RQ-VAE into a tuple of codebook tokens (a semantic ID, e.g. <a_1><b_3><c_5>), then T5 autoregressively generates the next item's semantic ID, constrained to legal items via prefix-restricted beam search.
The example scripts follow the same "one script per dataset" layout as HSTU / HLLM:
examples/generative/run_tiger_movielens.pyexamples/generative/run_tiger_amazon_books.py
1. Module Layout
- Model:
torch_rechub/models/generative/tiger.pyTIGERModel: subclassestransformers.T5ForConditionalGeneration, addingset_hyper(temperature)and a temperature-scaledranking_loss.
- Dataset & constrained decoding:
torch_rechub/utils/data.pyTigerSeqDataset: maps item-id sequences ininter.jsonto semantic-ID strings and applies a leave-one-out split fortrain/valid/test.Trie: builds a prefix tree over all legal semantic IDs and yields aprefix_allowed_tokens_fnfor constrained beam search.
- Semantic-ID generation:
examples/generative/run_rqvae_amazon_books.py- Trains an RQ-VAE over item embeddings and exports
semantic_ids.json.
- Trains an RQ-VAE over item embeddings and exports
- Example scripts:
run_tiger_movielens.py/run_tiger_amazon_books.py, each implementingtrain/testand toy-data generation.
2. Data Format
TIGER needs two JSON files:
inter.json:{user_id: [item_id, item_id, ...]}, each user's chronologically ordered item-id history. Item ids are 1-based;0is reserved for padding.semantic_ids.json:{item_id: ["<a_..>", "<b_..>", ...]}, a semantic ID for every item id that appears ininter.json.
TigerSeqDataset leave-one-out split:
train: expand the historyitems[:-2]into multiple(history, next_item)samples.valid: history isitems[:-2], label isitems[-2].test: history isitems[:-1], label isitems[-1].
So each user needs at least 3 interactions to produce a training sample.
3. Run Modes
Both scripts use --mode to select the stage(s):
| mode | Description |
|---|---|
generate-toy-data | Write a small synthetic dataset to --data_inter_path / --data_indice_path (the exact paths the loader reads) |
prepare-data | MovieLens only: build inter.json and movie_id_map.json from the real ratings.dat |
train | Add semantic-ID tokens, resize_token_embeddings, train T5 from scratch (random init, no pretrained weights), save tokenizer / config / model to --output_dir |
test | Load from --ckpt_path (defaults to --output_dir), run constrained beam search, report hit@k / ndcg@k |
all | Run generate-toy-data → train → test (default) |
Key implementation details:
- Trained from scratch, no pretrained weights: per the TIGER paper the T5 encoder-decoder is randomly initialized and trained from scratch (the semantic-ID vocabulary is not natural language, so pretrained NL weights are not useful).
train()builds the architecture from--base_model's config viaTIGERModel(config);--base_modelonly supplies the architecture/config and tokenizer, not pretrained weights. - Semantic-ID tokens must be added before training:
train()callstokenizer.add_tokens(dataset.get_new_tokens())and thenresize_token_embeddings; otherwise tokens like<a_1>are split into sub-words and training is meaningless. - Generated and read paths are identical:
generate-toy-datawrites to the same paths the dataset reads, avoiding a "generated filename ≠ read filename" mismatch. testloads weights with the checkpoint's own config: this avoids passing a grownvocab_sizeintofrom_pretrained(which would raise an embedding-size mismatch); the tokenizer is then reconciled and the model resized only if needed before evaluation.
4. Quick Toy Run
No external data is required; this runs end to end on CPU:
cd examples/generative
# Synthetic MovieLens-shaped data
python run_tiger_movielens.py --mode all \
--toy_num_users 16 --toy_num_items 20 \
--epochs 2 --per_device_batch_size 4 \
--num_beams 4 --test_batch_size 2 --num_workers 0
# Built-in Amazon-Books toy data
python run_tiger_amazon_books.py --mode all \
--epochs 5 --per_device_batch_size 4 \
--num_beams 4 --test_batch_size 2 --num_workers 0You should see Added N semantic-id tokens → Model saved to ... → Test results: {...}.
Offline / no access to the HuggingFace Hub: the legacy alias
t5-smallmay not resolve, so use the canonical repo id:--base_model google-t5/t5-small.
5. Real-Data Pipeline (MovieLens-1M)
Real data needs semantic IDs aligned with inter.json, so it is a two-stage RQ-VAE → TIGER pipeline:
Build interaction sequences:
bashpython run_tiger_movielens.py --mode prepare-data \ --ratings_path ./data/ml-1m/ratings.dat \ --data_inter_path ./data/ml-1m/tiger/inter.json \ --min_seq_len 5 --max_his_len 20This orders interactions by timestamp, filters users with too few interactions, remaps movie ids to contiguous 1-based item ids, and also writes
movie_id_map.json.Generate semantic IDs: prepare item embeddings for the same item ids (e.g. the text/ID embeddings produced by HLLM preprocessing), then train an
run_rqvae_amazon_books.py-style RQ-VAE and exportsemantic_ids.json. The RQ-VAE output must be keyed by the same item ids as step 1, otherwiseinter.jsonandsemantic_ids.jsonwill not line up.Train and test:
bashpython run_tiger_movielens.py --mode train \ --data_inter_path ./data/ml-1m/tiger/inter.json \ --data_indice_path ./data/ml-1m/tiger/semantic_ids.json \ --output_dir ./ckpt/tiger_ml python run_tiger_movielens.py --mode test --ckpt_path ./ckpt/tiger_ml \ --data_inter_path ./data/ml-1m/tiger/inter.json \ --data_indice_path ./data/ml-1m/tiger/semantic_ids.json
The Amazon-Books real-data flow is identical: produce semantic_ids.json with run_rqvae_amazon_books.py, prepare inter.json, then --mode train / --mode test.
6. Evaluation
The test stage runs constrained beam search per test sample:
- A
Triebuilds theprefix_allowed_tokens_fnso each step only allows the next token of a legal semantic ID. --num_beamsis also used asnum_return_sequencesto obtain top-N candidates;--filter_itemspushes any generated result outside the legal item set to a very low score.- Metrics are leave-one-out
hit@kandndcg@k(each user has a single ground truth, so IDCG=1). Configure with--metrics hit@1,hit@5,hit@10,ndcg@5,ndcg@10.
7. Troubleshooting
OSError: We couldn't connect to 'https://huggingface.co': offline, use--base_model google-t5/t5-small(canonical repo id) and make sure the weights are cached locally.Trainer.__init__() got an unexpected keyword argument 'tokenizer':transformers>=5renamed thetokenizerargument toprocessing_class; the script auto-adapts via the signature, no manual change needed.'TIGERModel' object has no attribute 'model_parallel': the legacy T5 model-parallel guards are no longer initialized intransformers>=5;TIGERModel.__init__now setsmodel_parallel=False/device_map=None, so multi-GPU DataParallel works too.- Outputs are always sub-words / accuracy is abnormally low: confirm that training actually ran
add_tokens+resize_token_embeddings(the log showsAdded N semantic-id tokens) and thattestloads the vocabulary from the--output_dirsaved during training.
