[Question] what to do when model doesn't have `tokenizer.model`? #2212

steveepreston · 2024-12-29T18:27:35Z

while tokenizer.model is required in yaml config, but there are many models that doesn't have tokenizer.model (example: unsloth/Llama-3.2-1B)

In these cases, how can we use tokenizer.json or tokenizer_config.json that are included in almost all model instead of tokenizer.model?

The text was updated successfully, but these errors were encountered:

RdoubleA · 2025-01-01T01:56:27Z

In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new tokenizer.model.

I don't believe you can load in the tokenizer without the tokenizer.model file, because it contains the BPE encoding itself.

steveepreston · 2025-01-01T05:21:20Z

@RdoubleA Thanks for explain, got the case.
I list some other random models that doesn't have a tokenizer.model:

deepseek-ai/DeepSeek-V3
Qwen/QVQ
nvidia/Llama-3.1-Nemotron
openai/gpt2
mistralai/Mistral-Nemo
CohereForAI/c4ai
facebook/opt-125m

I don't have any idea to what should be done here.

krammnic · 2025-01-12T23:30:46Z

@joecummings @RdoubleA I have faced this while working on Phi4 PR. There are several solutions about It, but would love to get comments from you firstly.

ebsmothers · 2025-01-13T18:18:44Z

So if I understand correctly, this is basically a function of torchtune not integrating with the Hugging Face tokenizers library, correct? In most of the examples listed above, I believe there are tokenizer.json and tokenizer_config.json files that are used by HF to build the tokenizer. I think we could consider building a utility to parse a given HF tokenizer and wrap into a format that is compatible with torchtune. This would require a fair bit of discussion though as there are a lot of details we'd need to iron out. cc @joecummings @RdoubleA for your thoughts

RdoubleA · 2025-01-14T18:59:27Z

@krammnic Took a look at your PR. I agree we need a better solution here. We are working on integrating with HF better so it's easier to port over new models, tokenizers being a major pain point. A few options:

We build a converter that takes in the tokenizer_config.json from HF and queries the tokenizer_class. For a small subset of very common classes, we map to the analogue in torchtune and load a default tokenizer.model. For Phi4, it would be GPT2Tokenizer (we don't have an analogue for this, it could be TikToken but not sure) (see https://huggingface.co/microsoft/phi-4/blob/main/tokenizer_config.json#L779)
We build a converter that takes the entire mapping in tokenizer.json from HF and builds the tokenizer from scratch. I'm not sure what abstractions are needed to support this, but it would remove the need to keep adding supported HF tokenizer classes

The other thing to consider is, once a new model tokenizer is added we don't need to "convert" from HF anymore because users can just instantiate the added model tokenizer. Or maybe we'll just need to load from some base tokenizer.model each time.

Open to other solutions.

joecummings added the high-priority label Jan 6, 2025

krammnic mentioned this issue Jan 12, 2025

Add Phi4 #2197

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] what to do when model doesn't have `tokenizer.model`? #2212

[Question] what to do when model doesn't have `tokenizer.model`? #2212

steveepreston commented Dec 29, 2024 •

edited

Loading

RdoubleA commented Jan 1, 2025

steveepreston commented Jan 1, 2025

krammnic commented Jan 12, 2025

ebsmothers commented Jan 13, 2025

RdoubleA commented Jan 14, 2025

[Question] what to do when model doesn't have tokenizer.model? #2212

[Question] what to do when model doesn't have tokenizer.model? #2212

Comments

steveepreston commented Dec 29, 2024 • edited Loading

RdoubleA commented Jan 1, 2025

steveepreston commented Jan 1, 2025

krammnic commented Jan 12, 2025

ebsmothers commented Jan 13, 2025

RdoubleA commented Jan 14, 2025

[Question] what to do when model doesn't have `tokenizer.model`? #2212

[Question] what to do when model doesn't have `tokenizer.model`? #2212

steveepreston commented Dec 29, 2024 •

edited

Loading