Could not get the train json by gsutil #1

erichen510 · 2021-12-30T08:54:52Z

The error message is : does not have storage.objects.list access to the Google Cloud Storage bucket.

The text was updated successfully, but these errors were encountered:

peregilk · 2021-12-30T09:43:34Z

Exactly what url are you trying to retrieve?

Are you authenticated on gcloud?

erichen510 · 2021-12-30T09:55:19Z

The exact url is:
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_social_media/splits/social_train.jsonl social_train.json
How to get authorization on gcloud? Am I suppose to join the project?

peregilk · 2021-12-30T09:57:03Z

You are trying to access a non-open dataset. Where was this linked from?

erichen510 · 2021-12-30T10:03:02Z

The link is from

notram/guides/configure_flax.md

Line 247 in 54aeb6b

## RoBERTa

I want to pretrain the corpus on roberta large. If I cannot get the json, where should I get the original corpus?
I notice that https://huggingface.co/datasets/NbAiLab/NCC list the datasets, could you tell me how to convert the original data to the json required by run_mlm_flax_stream.py.

erichen510 · 2021-12-30T10:06:52Z

Sorry the link is
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_official_short/norwegian_colossal_corpus_train.jsonl norwegian_colossal_corpus_train.json

peregilk · 2021-12-30T10:33:46Z

Sorry. There is an internal link in this guide. You should replace this with whatever dataset you have available.

One alternative is of course the NCC (that was released after this tutorial was written).

There are several ways of training on this dataset. Assuming you are using Flax (since you are following the tutorial), a simple was is to specify dataset_name NbAiLab/NCC instead of train and validation file. Another way is to clone the HuggingFace repo and copy/combine the files from the repo. NCC is already in json format, but it is sharded and zipped. If you insist on having them locally, they should be combined and unzipped.

peregilk · 2021-12-30T10:35:06Z

Early next year, we will also place the NCC in an open gcloud bucket.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not get the train json by gsutil #1

Could not get the train json by gsutil #1

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

erichen510 commented Dec 30, 2021

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

peregilk commented Dec 30, 2021

Could not get the train json by gsutil #1

Could not get the train json by gsutil #1

Comments

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

erichen510 commented Dec 30, 2021

erichen510 commented Dec 30, 2021

peregilk commented Dec 30, 2021

peregilk commented Dec 30, 2021