Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for clipboard processing from images #97

Merged
merged 10 commits into from
Jul 19, 2024

Conversation

jaresty
Copy link
Collaborator

@jaresty jaresty commented Jul 15, 2024

  • This add support for image processing
  • requires the use of gpt-4o

@jaresty
Copy link
Collaborator Author

jaresty commented Jul 16, 2024

I think this approach is workable if you are open to it. Because we are sending the requests much later than we are processing sources, I needed to make a signal that would allow me to know that I needed to pull from the clipboard later when sending the request to ChatGPT. This was the smallest change that I could find that seems to be workable-another alternative would be to build the request parts earlier where we are doing the source processing, but some of the other parts of the code rely on the text being raw so I wasn't sure that was a good idea for now.

@jaresty
Copy link
Collaborator Author

jaresty commented Jul 16, 2024

This change allows you to use almost any screenshot or clipboard contents and use them as input to every command. The possibilities are pretty open, and inspiring. You can use model blend to blend screenshots into text, or take a screenshot of a presentation and then use model snip to insert it into VSCode. The grammar is no different than it was before-the only change is that it check if the contents are an image to determine what to do.

],
"max_tokens": 2024,
"temperature": settings.get("user.model_temperature"),
"n": 1,
"stop": None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing an exception when running with an image prompt but not with text. I didn't notice any behavioral difference however so I removed it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good w me if it works with you

- This allows it to pull from images as well
@jaresty
Copy link
Collaborator Author

jaresty commented Jul 16, 2024

This works really well. I just tried "model snip format html clip" with a prompt to generate html and a screenshot of a design I needed to make and it generated a snippet based on the visual design dynamically. This is pretty amazing stuff 😮

@jaresty
Copy link
Collaborator Author

jaresty commented Jul 16, 2024

Incidentally appears that our current vision implementation is broken because the ChatGPT model is deprecated, however I didn't want to complicate this discussion by making a change there as well: tldraw/make-real-starter#30

@jaresty
Copy link
Collaborator Author

jaresty commented Jul 18, 2024

gpt-4o-mini launched today with vision and is cheaper than 3.5 turbo, apparently: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

GPT/gpt.py Outdated
@@ -244,7 +246,7 @@ def gpt_get_source_text(spoken_text: str) -> str:
"""Get the source text that is will have the prompt applied to it"""
match spoken_text:
case "clipboard":
return clip.text()
return "clip"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we accomplish this PR without needing to change this to return a string which then is matched in the other function? Is it possible for us to return the actual source text here?

If this is changed then gpt_get_source_text doesn't return the source text for the clipboard anymore so it doesn't semantically match what it is doing. (i.e. if we wanted to call it elsewhere)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to do this without a pretty significant change. The nice thing about this is that it transparently falls back if it's not an image in the clipboard, but we have to change the structure of the message we send to ChatGPT. We need some way to indicate that that should happen. This was the smallest change that I could think of to make this happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat about this tomorrow-this syntax / strategy unlocks a lot of interesting possibilities that I'd love to share with you. A bit hard to explain in text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a workaround that only makes the clipboard image a special case.

@C-Loftus
Copy link
Owner

Generally looks fine w me if we can find a way to clean up that one comment I made. I am fine switching to gpt-4o-mini. Seems like a good choice.

For some reason I have felt gpt 3.5 turbo is not as good as it once was. Maybe it is just me but it seems the newer versions of the model are extremely focused on trying to return conversation-style responses, even with proper prompting

@C-Loftus
Copy link
Owner

C-Loftus commented Jul 19, 2024

Ok looks good to me. Thanks. This is once again pretty clean. Nice work!

To return the image as a string, I think we could b64 encode the image and pass it to the query function, but if you have other ideas don't want to break that. This is working well.

@C-Loftus
Copy link
Owner

C-Loftus commented Jul 19, 2024

Oh and I switched to gpt-4o-mini. Don't see any reason not to use that. Gives users more support and has better performance.

@jaresty
Copy link
Collaborator Author

jaresty commented Jul 19, 2024

I debated returning the base sixty four encoded images as well, but you have to add a different type and pass the image URL when sending the message to ChatGPT, so you'd have to do some kind of a match to determine if it was a base sixty four encoded image, which is less clean than doing a direct string comparison I think.

@C-Loftus
Copy link
Owner

C-Loftus commented Jul 19, 2024

Good to merge this now whenever you want. Changed the magic string to be __IMAGE__ so it is essentially not possible to trigger accidentally.

@jaresty jaresty merged commit 29832e8 into main Jul 19, 2024
3 checks passed
@jaresty jaresty deleted the support-image-clipboard-processing branch July 19, 2024 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants