Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not get the train json by gsutil #1

Open
erichen510 opened this issue Dec 30, 2021 · 7 comments
Open

Could not get the train json by gsutil #1

erichen510 opened this issue Dec 30, 2021 · 7 comments

Comments

@erichen510
Copy link

The error message is : does not have storage.objects.list access to the Google Cloud Storage bucket.

@peregilk
Copy link
Collaborator

Exactly what url are you trying to retrieve?

Are you authenticated on gcloud?

@erichen510
Copy link
Author

The exact url is:
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_social_media/splits/social_train.jsonl social_train.json
How to get authorization on gcloud? Am I suppose to join the project?
image

@peregilk
Copy link
Collaborator

You are trying to access a non-open dataset. Where was this linked from?

@erichen510
Copy link
Author

The link is from

## RoBERTa

I want to pretrain the corpus on roberta large. If I cannot get the json, where should I get the original corpus?
I notice that https://huggingface.co/datasets/NbAiLab/NCC list the datasets, could you tell me how to convert the original data to the json required by run_mlm_flax_stream.py.

@erichen510
Copy link
Author

Sorry the link is
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_official_short/norwegian_colossal_corpus_train.jsonl norwegian_colossal_corpus_train.json

@peregilk
Copy link
Collaborator

Sorry. There is an internal link in this guide. You should replace this with whatever dataset you have available.

One alternative is of course the NCC (that was released after this tutorial was written).

There are several ways of training on this dataset. Assuming you are using Flax (since you are following the tutorial), a simple was is to specify dataset_name NbAiLab/NCC instead of train and validation file. Another way is to clone the HuggingFace repo and copy/combine the files from the repo. NCC is already in json format, but it is sharded and zipped. If you insist on having them locally, they should be combined and unzipped.

@peregilk
Copy link
Collaborator

Early next year, we will also place the NCC in an open gcloud bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants