-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ignore file path and diffs #21
fix: ignore file path and diffs #21
Conversation
QA: issue With the same 15 mb index.js file in the PR |
Unused devDependencies (1)
Unused exports (1)
|
src/helpers/issue-fetching.ts
Outdated
//Find the filenames which do not have more than 200 changes | ||
let files = stats.filter((file) => file.changes < 500).map((file) => file.filename); | ||
//Ignore files like in dist or build or .lock files | ||
const ignoredFiles = ["dist/*", "build/*", ".lock", "index.js"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should definitely be dynamic from git ignore and such. Otherwise this was not included in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses the gitignore from the Repository if one is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it combine this + gitignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to, I have removed all of that now. It just checks for file sizes in bytes now.
src/helpers/issue-fetching.ts
Outdated
//Fetch the statistics of the pull request | ||
const stats = await githubDiff.getPullRequestStats(org, repo, issue); | ||
//Find the filenames which do not have more than 200 changes | ||
let files = stats.filter((file) => file.changes < 500).map((file) => file.filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong approach. Should not be any hard cut off from some arbitrary number. Follow the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised the approach:
- Sorts the diffs in ascending order based on file size
- Continues adding diffs until the maximum context token limit is reached
How should the maximum token limit be configured—via config settings or environment variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max i guess config so that partners can budget their token use.
src/helpers/format-chat-history.ts
Outdated
@@ -83,7 +88,18 @@ async function createContextBlockSection( | |||
if (!issueNumber || isNaN(issueNumber)) { | |||
throw context.logger.error("Issue number is not valid"); | |||
} | |||
const prDiff = await fetchPullRequestDiff(context, org, repo, issueNumber); | |||
const pulls = await fetchLinkedPrFromIssue(org, repo, issueNumber, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These pulls
rely on searching the repo for all PRs then parsing the PR bodies for a #<issueNumber>
, which is recommended but not enforced so this isn't 100% reliable. We tend to use GQL API for this nowadays as it's most reliable. Also for queries that potentially return more than 100 items I'm not sure of the limit actually but it's likely 100, we should use octokit.paginate(octokit.pulls.list))
as when scanning all PRs it's likely.
It's also relying on the PR author actually linking using #
because if they copy paste the url which most do then it'll be a url in the pr body instead. I had logic for hashtag + url matching in my implementation but I think you may have removed it.
I think it may also be pulling more PR diffs than intended within this promise on each call to createContextBlockSection
, unsure without QA-ing and inspecting the formatted chat on a context rich repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to GraphQL to directly fetch pull requests that close issues through GitHub GQL's closedByPullRequestsReferences field. This replaces the previous API-based approach which relied on parsing PR body text for issue references (via # or URLs).
@@ -17,7 +17,7 @@ | |||
"knip-ci": "knip --no-exit-code --reporter json --config .github/knip.ts", | |||
"prepare": "husky install", | |||
"test": "jest --setupFiles dotenv/config --coverage", | |||
"worker": "wrangler dev --env dev --port 4000" | |||
"worker": "wrangler dev --env dev --port 5000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason that you change this? The template uses :4000
so all other plugins also, it's tedious having to change it back for no reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is port 4000 required by some other service/kernel? I'm currently unable to use it but can switch back if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin-template comes with :4000
as standard so it's habitual for plugin devs to copy paste that url into -config.yml
and expecting it to work thinking something is broken for it to be the dev port lmao.
Ideally don't commit it if you want to use another port, like running multiple plugins locally for example, always change it back to the template port if you can remember to.
@@ -2,6 +2,7 @@ import OpenAI from "openai"; | |||
import { Context } from "../../../types"; | |||
import { SuperOpenAi } from "./openai"; | |||
import { CompletionsModelHelper, ModelApplications } from "../../../types/llm"; | |||
import { encode } from "gpt-tokenizer"; | |||
const MAX_TOKENS = 7000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should move this into config
and pass this via the yml config
src/helpers/format-chat-history.ts
Outdated
import { splitKey } from "./issue"; | ||
const MAX_TOKENS_ALLOWED = 7000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Should also be renamed to MAX_COMPLETION_TOKENS
.
return data as unknown as string; | ||
const githubDiff = new GithubDiff(octokit); | ||
//Fetch the statistics of the pull request | ||
const stats = await githubDiff.getPullRequestStats(org, repo, issue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lib is only called twice and can be done just as simply with one or two rest calls I'm sure, it's not clear if this is ideal relying on a lib for this. Nothing to hold back the PR for just an observation.
src/helpers/issue-fetching.ts
Outdated
@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam | |||
url: comment.html_url, | |||
})); | |||
} | |||
|
|||
export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) { | |||
const prs = await context.octokit.rest.pulls.list({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be best to wrap this in a try-catch because if e.g the repo has been deleted this will likely throw
src/helpers/issue-fetching.ts
Outdated
@@ -296,3 +318,41 @@ function castCommentsToSimplifiedComments(comments: Comments, params: FetchParam | |||
url: comment.html_url, | |||
})); | |||
} | |||
|
|||
export async function fetchLinkedPrFromIssue(owner: string, repo: string, issueNumber: number, context: Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be better named as it's actually fetching the linked issue from the pull request body
src/helpers/issue-fetching.ts
Outdated
state: "all", | ||
}); | ||
//Filter the PRs which are linked to the issue using the body of the PR | ||
return prs.data.filter((pr) => pr.body?.includes(`#${issueNumber}`)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be best to add handling for url matching too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not applicable - replaced with GraphQL solution
QA: Issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps my spec isnt clear but youre not following it.
- Remove all the file ignores.
- Count amount of changes per file.
- sort in order of amount of changes
- filter out files from most changes to least based on context limits.
This will automatically filter out large automated changes like compiled dist and lock files.
src/helpers/issue-fetching.ts
Outdated
//Fetch the statistics of the pull request | ||
const stats = await githubDiff.getPullRequestStats(org, repo, issue); | ||
//Ignore files like in dist or build or .lock files | ||
const ignoredFiles = (await buildIgnoreFilesFromGitIgnore(context, org, repo)) || []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey i just realized this is pointless.
If its ignored it wont be on git or on the diff. Remove all this logic and just rely on filtering files out based on largest amount of changes to smallest.
src/helpers/issue-fetching.ts
Outdated
repo, | ||
path: ".gitignore", | ||
}); | ||
// Build an array of files to ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of all this logic too
I think that counting the number of changes isn't the best metric. For instance, I'm following the guidelines outlined in this comment. Is there another specification for this issue? |
This could have been worded better. I meant that it would automatically exclude those based on how large their diffs are. GitHub pull request code view UI handles this in a smart way, where it wont display the dist/index.js and lock files due to "large diffs" Perhaps they also have a line length limitation. It may be wise to replicate. Then again, 300 line changes is significant. Much more than a normal file diff i think? This would almost certainly be filtered out if there were a ton of file changes with ~10 lines changed per each etc. |
It seems to use file size as a criterion to determine whether to show the UI, as those files wouldn't be accessible even in a regular file viewer without the diffs.
It is indeed significant, more than what you'd typically see in a standard file diff. However, while 300 lines could potentially be changed through code, the diff size in bytes for index.js was around 15.6 MB. In comparison, a code file with the same number of line changes would be much smaller in size. |
Alright so then it seems clear to me that we can measure the diff size in bytes and then go from there. |
@0x4007 I need some clarification. Currently, the system retrieves both filenames and the difference size in bytes, and then it fetches the diffs for each filename. Once that’s done, we sort the results in ascending order and continue adding to the context until we reach the maximum limit. I followed along up to the sorting step in the specifications, but I'm unsure if I understand it correctly. What criteria should be used to select the diffs? Is it based on file size being under a certain threshold, or is there another method we should use? |
I think that's all you need to do. What is the problem? |
Ok, I see now that I misunderstood these lines. I think this PR is ready to be merged. I'll post more QA for this. |
@@ -23,6 +23,7 @@ export const pluginSettingsSchema = T.Object({ | |||
model: T.String({ default: "o1-mini" }), | |||
openAiBaseUrl: T.Optional(T.String()), | |||
similarityThreshold: T.Number({ default: 0.9 }), | |||
maxTokens: T.Number({ default: 10000 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should be the max for the model we are using? 10k seems arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10k seemed like a reasonable limit. If someone is using older OpenAI models (3.5 Turbo, Gpt4Turbo-28K), it could result in a hefty bill instantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default to the latest sensible models then like 4o etc
QA: Link |
Resolves #19