-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenize
API
#11
Comments
perhaps. But also the more advanced user might want to configure it, and have it apply globally -- even into other packages. However the big issue I see with that really the tokenizer is corpus specific. Related: I am thinking more like:
and we should expose
(This should be using Languages.jl for type-based language ids here, and there cf JuliaText/Embeddings.jl#6) |
@oxinabox Would it be a good idea to add a |
I think multiple different trait functions. |
I really just think docs strings would be better here. It's a case of KISS until there's a clear need for any more complexity. |
Yeah, Then the same for Embeddings.jl (which almost does this) Then we could do things like:
|
That's a good use case, although even then, wasn't #14 meant to implement something fairly general and language-agnostic? It seems better to have the same default for all languages if at all possible. |
#18 is fairly general and langauge agnostic, and is now the default. So until we do this is not really pressing, as the answer would always be use |
I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers. |
I'm not sure how that would be. The Tokenizer API specifies what should happen when you call |
The
set_tokenizer
API seems a bit suspect here, given that it can be replaced withand likewise for RevTok etc, without bringing in multiple packages just to define an alias :)
I also think it's generally a good idea to expose people to higher order functions and such; people might not realise that you can just e.g. pass a custom tokenize function into a constructor rather than setting and unsetting it globally.
The text was updated successfully, but these errors were encountered: