Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] what to do when model doesn't have tokenizer.model? #2212

Open
steveepreston opened this issue Dec 29, 2024 · 5 comments
Open

[Question] what to do when model doesn't have tokenizer.model? #2212

steveepreston opened this issue Dec 29, 2024 · 5 comments

Comments

@steveepreston
Copy link

steveepreston commented Dec 29, 2024

while tokenizer.model is required in yaml config, but there are many models that doesn't have tokenizer.model (example: unsloth/Llama-3.2-1B)

In these cases, how can we use tokenizer.json or tokenizer_config.json that are included in almost all model instead of tokenizer.model?

@RdoubleA
Copy link
Contributor

RdoubleA commented Jan 1, 2025

In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new tokenizer.model.

I don't believe you can load in the tokenizer without the tokenizer.model file, because it contains the BPE encoding itself.

@steveepreston
Copy link
Author

@RdoubleA Thanks for explain, got the case.
I list some other random models that doesn't have a tokenizer.model:

deepseek-ai/DeepSeek-V3
Qwen/QVQ
nvidia/Llama-3.1-Nemotron
openai/gpt2
mistralai/Mistral-Nemo
CohereForAI/c4ai
facebook/opt-125m

I don't have any idea to what should be done here.

@krammnic krammnic mentioned this issue Jan 12, 2025
13 tasks
@krammnic
Copy link
Contributor

@joecummings @RdoubleA I have faced this while working on Phi4 PR. There are several solutions about It, but would love to get comments from you firstly.

@ebsmothers
Copy link
Contributor

So if I understand correctly, this is basically a function of torchtune not integrating with the Hugging Face tokenizers library, correct? In most of the examples listed above, I believe there are tokenizer.json and tokenizer_config.json files that are used by HF to build the tokenizer. I think we could consider building a utility to parse a given HF tokenizer and wrap into a format that is compatible with torchtune. This would require a fair bit of discussion though as there are a lot of details we'd need to iron out. cc @joecummings @RdoubleA for your thoughts

@RdoubleA
Copy link
Contributor

@krammnic Took a look at your PR. I agree we need a better solution here. We are working on integrating with HF better so it's easier to port over new models, tokenizers being a major pain point. A few options:

  • We build a converter that takes in the tokenizer_config.json from HF and queries the tokenizer_class. For a small subset of very common classes, we map to the analogue in torchtune and load a default tokenizer.model. For Phi4, it would be GPT2Tokenizer (we don't have an analogue for this, it could be TikToken but not sure) (see https://huggingface.co/microsoft/phi-4/blob/main/tokenizer_config.json#L779)
  • We build a converter that takes the entire mapping in tokenizer.json from HF and builds the tokenizer from scratch. I'm not sure what abstractions are needed to support this, but it would remove the need to keep adding supported HF tokenizer classes

The other thing to consider is, once a new model tokenizer is added we don't need to "convert" from HF anymore because users can just instantiate the added model tokenizer. Or maybe we'll just need to load from some base tokenizer.model each time.

Open to other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants