Extend Vocabulary #88

torymur · 2024-11-05T12:26:42Z

This is Part1 to partially address #81, closes #91.

In this PR only new logic is introduced with minimal changes to the current interface:

Introducing EOS token locator and its common locations
Introducing token processor to handle all token modifications depending on tokenizer's decoders
Connect them with Vocabulary
More tests & docs

torymur · 2024-11-08T18:09:51Z

With the exception of adding some less important tests I have in mind and maybe further improving docs, this is ready to be pre-reviewed.

There are few TODOs here in vocabulary, for which I will follow up separately, since they change already defined interface of vocabulary, so it would be better to do it separately and this PR is already massive.

A couple of questions for following PR, I'm planning to change:

Token type from String to bytes Vec<u8>
Was wondering about logic behind having value's of vocabulary map as a Vec<TokenId> in: HashMap<Token, Vec<TokenId>> and not just TokenId

Any thoughts on these above of not doing it? Or things to watch out for from python/outlines or any other perspectives?

@brandonwillard @umut-sahin Maybe you can help me with these?

src/lib.rs

Cargo.toml

src/vocabulary/locator.rs

src/vocabulary/mod.rs

umut-sahin · 2024-11-11T11:13:02Z

Was wondering about logic behind having value's of vocabulary map as a Vec in: HashMap<Token, Vec> and not just TokenId

It's because the same token can have multiple entries in the tokenizer (e.g., in llama like tokenizers <0x61> -> 20 and a -> 30 is both in tokenizer so token a has two token ids, 20 and 30 in this case).

torymur · 2024-11-11T12:17:14Z

@umut-sahin Appreciate you taking a look here!

It's because the same token can have multiple entries in the tokenizer

Yep, I understand this point from token as a String perspective, but if we'll move on to token as bytes?

umut-sahin · 2024-11-11T13:31:09Z

@umut-sahin Appreciate you taking a look here!

Of course 🙌

Yep, I understand this point from token as a String perspective, but if we'll move on to token as bytes?

It's the same there, in that case we'll have token [0x61] -> ids [20, 30].

414owen

One or two stylistic questions, but looks good!

src/vocabulary/mod.rs

src/vocabulary/processor.rs

ERROR tests/fsm/test_regex.py - RuntimeError: Failed to import transformers.models.auto.tokenization_auto because of the following error (look up to see its traceback): Failed to import transformers.generation.utils because of the following error (look up to see its traceback): numpy.core is deprecated and has been renamed to numpy._core. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.multiarray.

torymur · 2024-11-18T17:24:00Z

This PR is now ready to get a review mark ✔️

Missed lines in combined coverage were checked and can be ignored.

src/error.rs

umut-sahin

Almost ready to be merged 🙌

src/error.rs

src/lib.rs

src/vocabulary/locator.rs

src/vocabulary/processor.rs

torymur · 2024-11-19T11:43:39Z

@umut-sahin @rlouf All addressed, thanks for taking a look! 🙌

umut-sahin

Looks great!

Introduce eos token locator

1446c76

torymur force-pushed the extend-vocabulary branch from a27d8e2 to 1446c76 Compare November 5, 2024 12:48

Introduce token processor

c3b4430

torymur force-pushed the extend-vocabulary branch from 2b6f87d to c3b4430 Compare November 7, 2024 15:11

Extend vocabulary with eos token id & pretrained models

e85ce90

torymur requested review from umut-sahin and brandonwillard November 8, 2024 18:09

Move vocabulary to a dedicated module

5eb350d

torymur force-pushed the extend-vocabulary branch from 6610052 to 5eb350d Compare November 11, 2024 10:35

umut-sahin reviewed Nov 11, 2024

View reviewed changes

torymur added 2 commits November 11, 2024 19:31

Add more tests

9b42797

Improve documentation and visibilities

d09ea69

414owen reviewed Nov 12, 2024

View reviewed changes

src/vocabulary/mod.rs Show resolved Hide resolved

src/vocabulary/mod.rs Show resolved Hide resolved

src/vocabulary/processor.rs Outdated Show resolved Hide resolved

src/vocabulary/processor.rs Outdated Show resolved Hide resolved

Separate and simplify errors

c7af981

torymur force-pushed the extend-vocabulary branch 4 times, most recently from 8565981 to 10f31fd Compare November 12, 2024 20:49

torymur force-pushed the extend-vocabulary branch from 10f31fd to fb833ae Compare November 12, 2024 21:08

torymur added 3 commits November 13, 2024 10:39

Apply suggestions from CR

334dab0

Add test rust coverage cmd to Makefile

527f952

Improve test coverage

2420742

torymur force-pushed the extend-vocabulary branch from 3971b46 to 2420742 Compare November 13, 2024 19:28

Locator as a trait

5e0177a

torymur force-pushed the extend-vocabulary branch from 7acc2ed to 5e0177a Compare November 14, 2024 16:01

torymur added documentation Improvements or additions to documentation enhancement New feature or request testing labels Nov 18, 2024

torymur force-pushed the extend-vocabulary branch from 78e4fb8 to 50c4225 Compare November 18, 2024 17:12

torymur marked this pull request as ready for review November 18, 2024 17:23

rlouf requested review from 414owen and umut-sahin November 18, 2024 18:37

rlouf reviewed Nov 18, 2024

View reviewed changes

src/error.rs Outdated Show resolved Hide resolved

Separate tokenizers errors, test supported pretrained models

b85523e

torymur force-pushed the extend-vocabulary branch from 50c4225 to b85523e Compare November 19, 2024 10:23

umut-sahin reviewed Nov 19, 2024

View reviewed changes

Apply CR suggestions

741f59c

torymur force-pushed the extend-vocabulary branch from 50b57ff to 741f59c Compare November 19, 2024 12:15

umut-sahin approved these changes Nov 19, 2024

View reviewed changes

torymur merged commit c5db1dd into main Nov 19, 2024
7 of 8 checks passed

torymur deleted the extend-vocabulary branch November 19, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Vocabulary #88

Extend Vocabulary #88

torymur commented Nov 5, 2024 •

edited

Loading

torymur commented Nov 8, 2024 •

edited

Loading

umut-sahin commented Nov 11, 2024

torymur commented Nov 11, 2024

umut-sahin commented Nov 11, 2024

414owen left a comment

torymur commented Nov 18, 2024

umut-sahin left a comment

torymur commented Nov 19, 2024

umut-sahin left a comment

Extend Vocabulary #88

Extend Vocabulary #88

Conversation

torymur commented Nov 5, 2024 • edited Loading

torymur commented Nov 8, 2024 • edited Loading

umut-sahin commented Nov 11, 2024

torymur commented Nov 11, 2024

umut-sahin commented Nov 11, 2024

414owen left a comment

Choose a reason for hiding this comment

torymur commented Nov 18, 2024

umut-sahin left a comment

Choose a reason for hiding this comment

torymur commented Nov 19, 2024

umut-sahin left a comment

Choose a reason for hiding this comment

torymur commented Nov 5, 2024 •

edited

Loading

torymur commented Nov 8, 2024 •

edited

Loading