-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225
base: development
Are you sure you want to change the base?
fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very skeptical of tfidf approach. We should go simpler and filter out more instead.
@@ -15,6 +15,9 @@ import { | |||
import { BaseModule } from "../types/module"; | |||
import { ContextPlugin } from "../types/plugin-input"; | |||
import { GithubCommentScore, Result } from "../types/results"; | |||
import { TfIdf } from "../helpers/tf-idf"; | |||
|
|||
const TOKEN_MODEL_LIMIT = 124000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the model and possibly should be an environment variable because we might change models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding the 12400 doesn't seem like a solution there either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0x4007 It is not hard coded but configurable within the config file?
There is no API to retrieve a model max token value as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 179 is hard coded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should I do if the configuration is undefined, should I throw an error and stop the run then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if we don't have it saved in our library or collection of known amounts then it should throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no known amounts because no API nor endpoints can give this information, so I'll just throw when undefined then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually get the numbers from their docs then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this number is arbitrary, diddn't you just ask OpenAI to raise the limits on the account we're using, with the same model as before and they did increase the limit? I'm afraid this number can't be guessed or hard-coded.
I don't think TF-IDF would be the best option for selecting the comments, as it only takes in account the word frequency and does not give any importance to semantics. A better solution might be to switch to a larger context model, like Gemini, which provides a 2 million token context window when we reach the token limit, and exclude bot-generated comments from the selection process. |
This was added as a failsafe when we go through the limit. Gemini can be an option, but theoretically we could also go beyond its token limit (even if unlikely). Since we can configure models which all have different token max limits (and third parties could be using smaller and cheaper ones) I think it is important that the technique we use to lower the context size is not based on LLM. |
More careful filtering of comments like removal of bot commands and comments, and possibly text summarization, or concatenation of multiple calls are all more accurate approaches. TFIDF is not the right tool for the job. |
The commands and bot comments got also fixed in this PR. I added this as a last resort if this was not enough. As I said before, I don't think it should rely on LLMs itself because third party users could chose a tiny model like |
Doing multiple calls to score everything and then concatenate results seems the most straightforward with no data loss. |
@0x4007 Doing the evaluation is not the problem, the problem is the given context to the LLM that gets too big. If an issue has 300 comments, then the prompt would contain these 300 comments while evaluating which would be too many for the token limit, so it has to get smaller. I don't see a way to fix that with no data loss, except if you meant comparing the comment against every other single comment one by one? |
Divide into two and do 150 each call. Receive the results array and concatenate them together |
@0x4007 This is not reliable, and if a third party decides to use a model like Plus concatenating would not make sense. I should run the comment against 150 first and 150 last (which would make the context inaccurate) and then probably average the result. I believe you didn't see how comment are evaluated: the problem is that for the context of understanding comments, we send all the comments in the context evaluated against the comments of the user. Refer to:
|
Surely it's a bit of a trade off without all of the comments in one shot, but this approach seems to trades off the least.
Why would they complain about using a non default model? We set the default to what we know works. |
@0x4007 Changed the behavior when the limit of the model is burst:
|
# Conflicts: # dist/index.js
|
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Task | 1 | 100 |
Issue | Specification | 1 | 15.94 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
> ```diff@gentlementlegen perhaps we have too m… | 7.97content: content: p: score: 0 elementCount: 2 em: score: 0 elementCount: 1 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 54 wordValue: 0.1 result: 2.97 | 1 | 2 | 15.94 |
[ 21.956 WXDAI ]
@0x4007
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Comment | 1 | 0.69 |
Review | Comment | 19 | 21.266 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
I think high accuracy is the best choice from your selection. I … | 1.38content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 22 wordValue: 0.1 result: 1.38 | 0.25 | 2 | 0.69 |
Very skeptical of tfidf approach. We should go simpler and filte… | 0.94content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 14 wordValue: 0.1 result: 0.94 | 0.15 | 2 | 0.282 |
This depends on the model and possibly should be an environment … | 1.11content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 17 wordValue: 0.1 result: 1.11 | 0.4 | 2 | 0.888 |
We should also filter out slash commands? And minimized comments? | 0.71content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 10 wordValue: 0.1 result: 0.71 | 0.3 | 2 | 0.426 |
I'm skeptical about this whole TFIDF approach1. The tokenizer a… | 8.64content: content: p: score: 0 elementCount: 1 ol: score: 1 elementCount: 1 li: score: 0.5 elementCount: 3 result: 2.5 regex: wordCount: 127 wordValue: 0.1 result: 6.14 | 0.35 | 2 | 6.048 |
Can you articulate the weaknesses or concerns | 0.52content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 7 wordValue: 0.1 result: 0.52 | 0.25 | 2 | 0.26 |
Hard coding the 12400 doesn't seem like a solution there either | 0.83content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 12 wordValue: 0.1 result: 0.83 | 0.2 | 2 | 0.332 |
Line 179 is hard coded | 0.39content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 5 wordValue: 0.1 result: 0.39 | 0.15 | 2 | 0.117 |
Yes if we don't have it saved in our library or collection of kn… | 1.28content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 20 wordValue: 0.1 result: 1.28 | 0.25 | 2 | 0.64 |
It shouldn't affect it at all. I would proceed with implicit app… | 1.44content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 23 wordValue: 0.1 result: 1.44 | 0.1 | 2 | 0.288 |
Manually get the numbers from their docs then | 0.59content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 8 wordValue: 0.1 result: 0.59 | 0.3 | 2 | 0.354 |
Why is this a constant? Makes more sense to use let and directly… | 1.28content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 20 wordValue: 0.1 result: 1.28 | 0.4 | 2 | 1.024 |
```suggestion``` | 0content: content: {} result: 0 regex: wordCount: 0 wordValue: 0.1 result: 0 | 0.05 | 2 | 0 |
Add more chunks if the request to OpenAI fails for being too lon… | 2.1content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 36 wordValue: 0.1 result: 2.1 | 0.45 | 2 | 1.89 |
@shiv810 rfc | 0.18content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 2 wordValue: 0.1 result: 0.18 | 0.1 | 2 | 0.036 |
Separate is fine then just as long as the current code is stable. | 0.88content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 13 wordValue: 0.1 result: 0.88 | 0.25 | 2 | 0.44 |
More careful filtering of comments like removal of bot commands … | 2.05content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 35 wordValue: 0.1 result: 2.05 | 0.45 | 2 | 1.845 |
Doing multiple calls to score everything and then concatenate re… | 1.17content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 18 wordValue: 0.1 result: 1.17 | 0.425 | 2 | 0.995 |
Divide into two and do 150 each call. Receive the results array … | 1.06content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 16 wordValue: 0.1 result: 1.06 | 0.375 | 2 | 0.795 |
Surely it's a bit of a trade off without all of the comments in … | 6.58content: content: p: score: 0 elementCount: 1 ol: score: 1 elementCount: 1 li: score: 0.5 elementCount: 2 result: 2 regex: wordCount: 90 wordValue: 0.1 result: 4.58 | 0.35 | 2 | 4.606 |
[ 2.456 WXDAI ]
@shiv810
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Review | Comment | 3 | 2.456 |
Conversation Incentives
Comment | Formatting | Relevance | Priority | Reward |
---|---|---|---|---|
@gentlementlegen It shouldn't impact the comment evaluation at a… | 1.7content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 28 wordValue: 0.1 result: 1.7 | 0.6 | 2 | 0.516 |
Besides the error code, `OpenRouter` provides `Provi… | 1.38content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 22 wordValue: 0.1 result: 1.38 | 0.8 | 2 | 0.56 |
I don't think TF-IDF would be the best option for selecting the … | 3.66content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 69 wordValue: 0.1 result: 3.66 | 0.75 | 2 | 1.38 |
This is the result I got while limiting token count to 2000
which led to comment splitting. Note that I use the default configuration not ubiquity-os-marketplace
. Target was ubiquity-os-marketplace/text-conversation-rewards/174
Resolves #174
What are the changes
Rfc @sshivaditya2019