Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and rework GPT-TF.js #807

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from
Open

Fix and rework GPT-TF.js #807

wants to merge 18 commits into from

Conversation

JulienVig
Copy link
Collaborator

@JulienVig JulienVig commented Oct 16, 2024

Closes #654

  • Fix weight initialization from zero to random uniform
  • Implement weight sharing between token embeddings and the language modeling head
  • Improve generation with top k sampling option
  • Add seed for deterministic runs
  • Implement text loaders by byte chunk rather than by line which doesn't require to pad each line to the context length
  • Waiting on Transformers.js #984 and #1019 to migrate to @huggingface/transformers v3.

@JulienVig JulienVig self-assigned this Oct 16, 2024
@JulienVig JulienVig force-pushed the 654-improve-gpt-julien branch 20 times, most recently from 1d88d35 to 03e5c7d Compare October 23, 2024 11:58
@JulienVig JulienVig marked this pull request as ready for review October 24, 2024 15:58
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).

btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)

Comment on lines 53 to 105
const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, {
padding: false,
truncation: false,
return_tensor: false,
})
if (tokens.length < blockSize + 1) {
// throw if it happens on the 1st iteration
if (iteration === 0)
throw new Error(`the chunk (${tokens.length} tokens) is too small ` +
`to get a sequence of length blockSize (${blockSize + 1} tokens). ` +
`Either the text file or the chunk size (${chunkBitSize} bits) is too small.`);
// if this isn't the first iteration we simply skip
// as we expect the last chunk to be potentially smaller than the block size
debug("chunk smaller than block size, loading next chunk")
continue
}
debug("batch per chunk: %o", tokens.length / (batchSize * blockSize))
let currentPosition = 0;
// yield one block of tokens at a time
while (currentPosition + blockSize + 1 <= tokens.length) {
yield tokens.slice(currentPosition, currentPosition + blockSize + 1);
currentPosition += blockSize; // don't add + 1 here
}
// keep the last tokens for the next chunk
// if this was the last one the remaining tokens are discarded
if (currentPosition < tokens.length) {
// We actually need to decode the tokens to get the leftover text
// instead of simply keeping the remaining tokens.
// this is because the tokens may be different once prepended to the next chunk
// e.g. if the remaining text is ". A" and the next chunk starts with "nother"
// the tokenization will be different than if we simply concatenate the remaining tokens
endOfPreviousChunk = tokenizer.decode(
tokens.slice(currentPosition),
{ skip_special_tokens: true }
)
debug("End of chunk, remaining text: '%s'", endOfPreviousChunk)
} else {
// Note that the difference between tokenizing and then concatenating
// vs concatenating and then tokenizing can happen if their is no
// remaining text. We consider this difference negligible
endOfPreviousChunk = "";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string) instead. that also would drop the blockSize, batchSize & minChunkSize argument which aren't really relevant for reading text (separation of concerns and all that)

in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like

loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))

with tokenize updated to accept block/List<string> instead, and maybe drop the padding (but what would be the behavior at the end of the file?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, the current implementation ended up being much simpler than my original tokenizing text loaders:

loadText('../datasets/wikitext/wiki.train.tokens')
  .map(text => processing.tokenize(tokenizer, text)) // tokenize the whole chunk
  .flat() // (I renamed unbatch to flat)
  .batchWithOverlap(config.blockSize) // one token overlap between each batch
  .map((tokens) => [tokens.pop(), tokens.last()] as [List<number>, number])
  .batch(batchSize);

discojs-node/src/loaders/text.ts Outdated Show resolved Hide resolved
discojs-node/src/loaders/text.ts Outdated Show resolved Hide resolved
cli/src/benchmark_gpt.ts Outdated Show resolved Hide resolved
discojs-node/src/loaders.spec.ts Outdated Show resolved Hide resolved
docs/CONTRIBUTING.md Outdated Show resolved Hide resolved
webapp/cypress.config.ts Outdated Show resolved Hide resolved
webapp/src/components/dataset_input/TextDatasetInput.vue Outdated Show resolved Hide resolved
webapp/src/components/dataset_input/TextDatasetInput.vue Outdated Show resolved Hide resolved
webapp/src/components/testing/TestSteps.vue Outdated Show resolved Hide resolved
@martinjaggi
Copy link
Member

maybe rename block size to context len, that would be more specific

discojs/src/dataset: implement and test repeat and batchWithOverlap
@JulienVig JulienVig added this to the v4.0.0 milestone Nov 14, 2024
@JulienVig JulienVig requested review from tharvik and removed request for tharvik November 14, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve and rework GPT-tfjs
3 participants