How the training data is divided？ #87

wsa-dhu · 2023-09-27T12:59:32Z

Hello, I'm very interested in your work, and I'm currently attempting to train a general sentence representation model. I have a question: When my training dataset comes from different domains, how can I ensure that samples within the same batch belong to the same task during the training process? Would it be better to include samples from different tasks within the same batch during training? I'm not sure about my assumption. Could you provide insights based on your experience?

hongjin-su · 2023-09-27T13:46:33Z

Hi, Thanks a lot for your interest in the INSTRUCTOR!

You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task.

wsa-dhu · 2023-10-15T06:06:24Z

Hello author, I am very interested in your work instructor. I would like to ask about the task_id in your training dataset. I would like to know which datasets these ids correspond to, as my own correspondence may take more time. By the way, I found that there are only 329 task_id, 302 is missing, which does not match the 330 in the paper. Looking forward to your reply and wishing your work success.

…

---- Replied Message ---- From ***@***.***> Date 09/27/2023 21:46 To ***@***.***> Cc ***@***.***>***@***.***> Subject Re: [xlang-ai/instructor-embedding] How the training data is divided？ (Issue #87) Hi, Thanks a lot for your interest in the INSTRUCTOR! You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

hongjin-su · 2023-12-19T09:42:06Z

We found that we might miss a task_id when we uploaded the dataset. We anticipate to fix it in our next version.

robro612 · 2024-06-03T11:44:23Z

Is there any update on where to find the meaning of these task_ids?

Edit: Sorry, I was looking at the dataset link on the paper website which appears to be stale, the link in the README has actual dataset names in the id.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the training data is divided？ #87

How the training data is divided？ #87

wsa-dhu commented Sep 27, 2023

hongjin-su commented Sep 27, 2023

wsa-dhu commented Oct 15, 2023 via email

hongjin-su commented Dec 19, 2023 •

edited

Loading

robro612 commented Jun 3, 2024 •

edited

Loading

How the training data is divided？ #87

How the training data is divided？ #87

Comments

wsa-dhu commented Sep 27, 2023

hongjin-su commented Sep 27, 2023

wsa-dhu commented Oct 15, 2023 via email

hongjin-su commented Dec 19, 2023 • edited Loading

robro612 commented Jun 3, 2024 • edited Loading

hongjin-su commented Dec 19, 2023 •

edited

Loading

robro612 commented Jun 3, 2024 •

edited

Loading