-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How the training data is divided? #87
Comments
Hi, Thanks a lot for your interest in the INSTRUCTOR! You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task. |
Hello author, I am very interested in your work instructor. I would like to ask about the task_id in your training dataset. I would like to know which datasets these ids correspond to, as my own correspondence may take more time. By the way, I found that there are only 329 task_id, 302 is missing, which does not match the 330 in the paper. Looking forward to your reply and wishing your work success.
…---- Replied Message ----
From ***@***.***> Date 09/27/2023 21:46 To ***@***.***> Cc ***@***.***>***@***.***> Subject Re: [xlang-ai/instructor-embedding] How the training data is divided? (Issue #87)
Hi, Thanks a lot for your interest in the INSTRUCTOR!
You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
We found that we might miss a task_id when we uploaded the dataset. We anticipate to fix it in our next version. |
Is there any update on where to find the meaning of these task_ids? Edit: Sorry, I was looking at the dataset link on the paper website which appears to be stale, the link in the README has actual dataset names in the id. |
Hello, I'm very interested in your work, and I'm currently attempting to train a general sentence representation model. I have a question: When my training dataset comes from different domains, how can I ensure that samples within the same batch belong to the same task during the training process? Would it be better to include samples from different tasks within the same batch during training? I'm not sure about my assumption. Could you provide insights based on your experience?
The text was updated successfully, but these errors were encountered: