-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for clipboard processing from images #97
Conversation
jaresty
commented
Jul 15, 2024
- This add support for image processing
- requires the use of gpt-4o
- This add support for image processing - requires the use of gpt-4o
I think this approach is workable if you are open to it. Because we are sending the requests much later than we are processing sources, I needed to make a signal that would allow me to know that I needed to pull from the clipboard later when sending the request to ChatGPT. This was the smallest change that I could find that seems to be workable-another alternative would be to build the request parts earlier where we are doing the source processing, but some of the other parts of the code rely on the text being raw so I wasn't sure that was a good idea for now. |
This change allows you to use almost any screenshot or clipboard contents and use them as input to every command. The possibilities are pretty open, and inspiring. You can use model blend to blend screenshots into text, or take a screenshot of a presentation and then use model snip to insert it into VSCode. The grammar is no different than it was before-the only change is that it check if the contents are an image to determine what to do. |
], | ||
"max_tokens": 2024, | ||
"temperature": settings.get("user.model_temperature"), | ||
"n": 1, | ||
"stop": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing an exception when running with an image prompt but not with text. I didn't notice any behavioral difference however so I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good w me if it works with you
- This allows it to pull from images as well
This works really well. I just tried "model snip format html clip" with a prompt to generate html and a screenshot of a design I needed to make and it generated a snippet based on the visual design dynamically. This is pretty amazing stuff 😮 |
Incidentally appears that our current vision implementation is broken because the ChatGPT model is deprecated, however I didn't want to complicate this discussion by making a change there as well: tldraw/make-real-starter#30 |
gpt-4o-mini launched today with vision and is cheaper than 3.5 turbo, apparently: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ |
GPT/gpt.py
Outdated
@@ -244,7 +246,7 @@ def gpt_get_source_text(spoken_text: str) -> str: | |||
"""Get the source text that is will have the prompt applied to it""" | |||
match spoken_text: | |||
case "clipboard": | |||
return clip.text() | |||
return "clip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we accomplish this PR without needing to change this to return a string which then is matched in the other function? Is it possible for us to return the actual source text here?
If this is changed then gpt_get_source_text
doesn't return the source text for the clipboard anymore so it doesn't semantically match what it is doing. (i.e. if we wanted to call it elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to do this without a pretty significant change. The nice thing about this is that it transparently falls back if it's not an image in the clipboard, but we have to change the structure of the message we send to ChatGPT. We need some way to indicate that that should happen. This was the smallest change that I could think of to make this happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat about this tomorrow-this syntax / strategy unlocks a lot of interesting possibilities that I'd love to share with you. A bit hard to explain in text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a workaround that only makes the clipboard image a special case.
Generally looks fine w me if we can find a way to clean up that one comment I made. I am fine switching to gpt-4o-mini. Seems like a good choice. For some reason I have felt gpt 3.5 turbo is not as good as it once was. Maybe it is just me but it seems the newer versions of the model are extremely focused on trying to return conversation-style responses, even with proper prompting |
- Make image the only clipboard special case
for more information, see https://pre-commit.ci
Ok looks good to me. Thanks. This is once again pretty clean. Nice work! To return the image as a string, I think we could b64 encode the image and pass it to the query function, but if you have other ideas don't want to break that. This is working well. |
Oh and I switched to gpt-4o-mini. Don't see any reason not to use that. Gives users more support and has better performance. |
I debated returning the base sixty four encoded images as well, but you have to add a different type and pass the image URL when sending the message to ChatGPT, so you'd have to do some kind of a match to determine if it was a base sixty four encoded image, which is less clean than doing a direct string comparison I think. |
Good to merge this now whenever you want. Changed the magic string to be |