Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adds scraper logic and token handling #190

Open
wants to merge 32 commits into
base: development
Choose a base branch
from

Conversation

shiv810
Copy link

@shiv810 shiv810 commented Dec 27, 2024

Resolves #187

  • Adds the scraper logic
  • Fetches if the data in the DB is older than 24 hrs

@shiv810 shiv810 requested a review from 0x4007 as a code owner December 27, 2024 01:22
@ubiquity-os-deployer
Copy link

ubiquity-os-deployer bot commented Dec 27, 2024

Copy link
Member

@0x4007 0x4007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems generally fine changes not sure how can be QA'd

Was wondering if another approach makes sense such as only copying with high precision based on when an issue is closed as complete and only that is copied over for example. But if we are hunting down historical issues then obviously this isn't a solution

My concern is redundantly finding the same historical issues and doing the same redundant search and scrape work. I am wondering if there is a better approach to this problem.

@shiv810
Copy link
Author

shiv810 commented Dec 28, 2024

@zugdev There are two locksfiles bun.lockdb and yarn.lock, is that intended, which one is suppoed to be used ?

@shiv810
Copy link
Author

shiv810 commented Dec 28, 2024

Seems generally fine changes not sure how can be QA'd

I'll create a demo video to walk through the flows.

Right now, we store the timestamp of the last scrape for the user, I think it should be possible to use this timestamp as a filter to check for any newly closed issues marked as completed. If there are any, we can proceed with the scrape; otherwise, we can skip the process for that user.

@zugdev
Copy link
Collaborator

zugdev commented Dec 29, 2024

@zugdev There are two locksfiles bun.lockdb and yarn.lock, is that intended, which one is suppoed to be used ?

The bun.lockb one is related to Cloudflare Wrangler, the main one is yarn.lock. Use the later.

@0x4007
Copy link
Member

0x4007 commented Dec 30, 2024

Perhaps we should consolidate all to bun? @gentlementlegen rfc

@gentlementlegen
Copy link
Member

Ideally we should use bun everywhere for consistency, except if there is a reason not to. It matters mostly because when testing packages locally, bun link and yarn link are not interchangeable (at least from my experience).

@shiv810
Copy link
Author

shiv810 commented Dec 30, 2024

Is there a way to add secrets in the final build without exposing them in dist? I’m thinking of creating a pg function to handle this or using Cloudflare KV.

@gentlementlegen
Copy link
Member

Yes, you can store and pull them from the GitHub Action environment during deployment.

@shiv810
Copy link
Author

shiv810 commented Dec 30, 2024

Yes, you can store and retrieve them from the GitHub Actions environment during deployment.

You’ll still be able to view the values in the home.js file after the final build. Is that fine? I think it’s fine for the SUPABASE URL and ANON KEY, but storing the VOYAGE API KEY in plain text doesn’t seem secure.

@gentlementlegen
Copy link
Member

Why would it be in the final build? It should not be there.

@shiv810
Copy link
Author

shiv810 commented Dec 30, 2024

The scraper logic is executed when a user logs in on the client side. So, we will need voyage API key for embeddings.

@gentlementlegen
Copy link
Member

Ha yes I see now, I was missing the context sorry. We should definitely not expose that API key on the client side. Sadly, we don't use SSR so I don't know how we can handle this except having some API pour that we host ourselves somewhere, or storing it in cookies?

@0x4007
Copy link
Member

0x4007 commented Dec 30, 2024

Make a worker endpoint. We have a setup like this for pay.ubq.fi that's related to the cards. Check its code.

@EresDev rfc

@EresDev
Copy link
Contributor

EresDev commented Dec 30, 2024

Make a worker endpoint. We have a setup like this for pay.ubq.fi that's related to the cards. Check its code.

@EresDev rfc

Yes, it looks like VOYAGEAI part needs a backend. You can get started with pages functions.

We store API keys in cloudflare worker secrets. However, one similar thing that ubiquity-os-kernel does differently is store the API keys in Github, but pushing it to cloudflare during worker deployment. This helps with managing the secrets in one place (GitHub) and we probably will do the same soon in pay.ubq.fi if possible.

@shiv810
Copy link
Author

shiv810 commented Dec 30, 2024

We store API keys in cloudflare worker secrets.

We can store API keys in Cloudflare KV, but I don't think this is the best approach, as we could quickly hit the read limit. A better option might be Supabase Functions, which would allow us to read directly from the Supabase Vault.

@gentlementlegen
Copy link
Member

You can have environment secrets in cloud flare that do not count against any quota. We can upload then from Github environment secrets, like we do for most plug-ins.

@shiv810
Copy link
Author

shiv810 commented Jan 4, 2025

QA:

Screenshot 2025-01-04 at 11 51 53 AM

Mean Stats:

  • CPU Wall time: 3.2ms
  • Sub Requests: 16 (For About 61 Issues)

@zugdev could you add the supabase service role key to the secrets under SUPABASE_KEY ? That's need for adding the scraped results back to the table. Please ping me in telegram for the value.

@shiv810
Copy link
Author

shiv810 commented Jan 6, 2025

@0x4007

QA:

https://scraperui.work-ubq-fi-50d.pages.dev/

You should be able to see the scraper function request, in the Network Tab of the Inspect Window.

@0x4007
Copy link
Member

0x4007 commented Jan 6, 2025

I'm mobile these days so I don't want to hold up the review somebody else please check on their computer for me.

.github/workflows/deploy.yml Outdated Show resolved Hide resolved
functions/issue-scraper.ts Show resolved Hide resolved
): Promise<void> {
const { error } = await supabase.from("issues").upsert(issues);
if (error) {
throw new Error(`Error during batch upsert: ${error.message}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can try an exponential backoff? I wonder if this is worth implementing to make this more robust although I imagine that errors are quite rare for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to implement, but if a request fails and we retry multiple times, the entire request will eventually be terminated once it hits the maximum CPU wall time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries should be handled from the client so no timeouts on the worker side should be relevant.

functions/issue-scraper.ts Outdated Show resolved Hide resolved
functions/issue-scraper.ts Outdated Show resolved Hide resolved
src/home/authentication.ts Outdated Show resolved Hide resolved
src/home/authentication.ts Outdated Show resolved Hide resolved

if (lastFetch) {
const lastFetchTimestamp = Number(lastFetch);
if (now - lastFetchTimestamp < 24 * 60 * 60 * 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better way to implement this. For example, instead of by timing, we can do based on an event.

Although I can't think of how it would be possible in outside organization contexts.

src/home/scraper/issue-scraper.ts Show resolved Hide resolved
return md.plainText;
}

const SEARCH_ISSUES_QUERY = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add /* GraphQL */ to help IDEs parse and format GraphQL queries.

Suggested change
const SEARCH_ISSUES_QUERY = `
const SEARCH_ISSUES_QUERY = /* GraphQL */`

package.json Outdated
@@ -36,10 +36,14 @@
"@octokit/request-error": "^6.1.0",
"@octokit/rest": "^20.0.2",
"@supabase/supabase-js": "^2.39.0",
"@types/markdown-it": "^14.1.2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in devDependencies

yarn.lock Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be deleted since you used bun?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about this. @zugdev mentioned that Bun is related to Cloudflare, while Yarn is the main one.So, I've reverted the bun.lockb to match the repo version and added the dependencies to the yarn.lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's okay to use either because the frontend doesn't rely on plugins or the kernel, but having both lockfiles seems unnecessary, I would suggest removing one or the other 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraper: Post Login with Github
6 participants