Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Vocabulary #88

Merged
merged 14 commits into from
Nov 19, 2024
Merged

Extend Vocabulary #88

merged 14 commits into from
Nov 19, 2024

Conversation

torymur
Copy link
Contributor

@torymur torymur commented Nov 5, 2024

This is Part1 to partially address #81, closes #91.

In this PR only new logic is introduced with minimal changes to the current interface:

  • Introducing EOS token locator and its common locations
  • Introducing token processor to handle all token modifications depending on tokenizer's decoders
  • Connect them with Vocabulary
  • More tests & docs

@torymur
Copy link
Contributor Author

torymur commented Nov 8, 2024

With the exception of adding some less important tests I have in mind and maybe further improving docs, this is ready to be pre-reviewed.

There are few TODOs here in vocabulary, for which I will follow up separately, since they change already defined interface of vocabulary, so it would be better to do it separately and this PR is already massive.

A couple of questions for following PR, I'm planning to change:

  1. Token type from String to bytes Vec<u8>
  2. Was wondering about logic behind having value's of vocabulary map as a Vec<TokenId> in: HashMap<Token, Vec<TokenId>> and not just TokenId

Any thoughts on these above of not doing it? Or things to watch out for from python/outlines or any other perspectives?

@brandonwillard @umut-sahin Maybe you can help me with these?

src/lib.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/vocabulary/locator.rs Outdated Show resolved Hide resolved
src/vocabulary/mod.rs Outdated Show resolved Hide resolved
src/vocabulary/mod.rs Outdated Show resolved Hide resolved
src/vocabulary/mod.rs Outdated Show resolved Hide resolved
src/vocabulary/mod.rs Show resolved Hide resolved
@umut-sahin
Copy link
Contributor

Was wondering about logic behind having value's of vocabulary map as a Vec in: HashMap<Token, Vec> and not just TokenId

It's because the same token can have multiple entries in the tokenizer (e.g., in llama like tokenizers <0x61> -> 20 and a -> 30 is both in tokenizer so token a has two token ids, 20 and 30 in this case).

@torymur
Copy link
Contributor Author

torymur commented Nov 11, 2024

@umut-sahin Appreciate you taking a look here!

It's because the same token can have multiple entries in the tokenizer

Yep, I understand this point from token as a String perspective, but if we'll move on to token as bytes?

@umut-sahin
Copy link
Contributor

@umut-sahin Appreciate you taking a look here!

Of course 🙌

Yep, I understand this point from token as a String perspective, but if we'll move on to token as bytes?

It's the same there, in that case we'll have token [0x61] -> ids [20, 30].

Copy link

@414owen 414owen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or two stylistic questions, but looks good!

src/vocabulary/mod.rs Show resolved Hide resolved
src/vocabulary/mod.rs Show resolved Hide resolved
src/vocabulary/processor.rs Outdated Show resolved Hide resolved
src/vocabulary/processor.rs Outdated Show resolved Hide resolved
@torymur torymur force-pushed the extend-vocabulary branch 4 times, most recently from 8565981 to 10f31fd Compare November 12, 2024 20:49
ERROR tests/fsm/test_regex.py - RuntimeError: Failed to import transformers.models.auto.tokenization_auto because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.
@torymur torymur added documentation Improvements or additions to documentation enhancement New feature or request testing labels Nov 18, 2024
@torymur torymur marked this pull request as ready for review November 18, 2024 17:23
@torymur
Copy link
Contributor Author

torymur commented Nov 18, 2024

This PR is now ready to get a review mark ✔️

Missed lines in combined coverage were checked and can be ignored.

src/error.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@umut-sahin umut-sahin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to be merged 🙌

src/error.rs Outdated Show resolved Hide resolved
src/error.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/vocabulary/locator.rs Show resolved Hide resolved
src/vocabulary/locator.rs Show resolved Hide resolved
src/vocabulary/processor.rs Show resolved Hide resolved
src/vocabulary/processor.rs Outdated Show resolved Hide resolved
src/vocabulary/processor.rs Show resolved Hide resolved
src/vocabulary/processor.rs Show resolved Hide resolved
@torymur
Copy link
Contributor Author

torymur commented Nov 19, 2024

@umut-sahin @rlouf All addressed, thanks for taking a look! 🙌

Copy link
Contributor

@umut-sahin umut-sahin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@torymur torymur merged commit c5db1dd into main Nov 19, 2024
7 of 8 checks passed
@torymur torymur deleted the extend-vocabulary branch November 19, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create vocabulary from pretrained models
4 participants