Human Eval #32

rawwerks · 2024-06-25T20:06:38Z

rawwerks
Jun 25, 2024

The biggest limitation that I currently see for applying this framework to my production generative AI systems is simply that LLMs are not very good judges. It is much easier to get a model to give a good output than it is to get it to reliably recognize a good output.

I think that the current implementation is fantastic for reducing the cost of use cases that are clearly within reach of AI capabilities, such as using a more expensive model to train a cheaper model to do a better job.

I am personally much more interested in use cases where we are pushing the best models to their peak performance. To do this I think there needs to be a way to incorporate human evaluation into textgrad. Even just a few optimization cycles with expert human input as the evaluation criteria could quickly unlock novel capabilities that may have been previously undiscovered.

In terms of implementation , I think this could be as simple as incorporating a human input field, which the "teacher" would incorporate as context into the optimization step. It should be fairly straightforward to prompt the teacher to defer to the human evaluation, instead of making up its own.

Obviously, this approach doesn't scale very well, But again, I think that a few optimization steps with thoughtful human expert feedback could really unlock huge gains for many real world applications.

mertyg · 2024-06-26T14:34:27Z

mertyg
Jun 26, 2024
Maintainer

I am 100% with you on this -- we are trying to exploit the gradient abstraction by swapping in human feedback instead of LLM feedback. This is in our roadmap!

1 reply

rawwerks Jun 26, 2024
Author

thanks! i tried this morning to simply extend TextLoss with human_feedback = input("..."), but ran out of time before i need to head out to AI engineer world fair.

for additional context - this is where my head is at: https://github.com/rawwerks/minetuning . right now all of the examples are DSPy, but i'd like to do one with textgrad too. (for example, the "sense of humor" one should be easy, in fact you guys have a "joke" example in the source code.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Human Eval #32

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Human Eval #32

rawwerks Jun 25, 2024

Replies: 1 comment · 1 reply

mertyg Jun 26, 2024 Maintainer

rawwerks Jun 26, 2024 Author

rawwerks
Jun 25, 2024

Replies: 1 comment 1 reply

mertyg
Jun 26, 2024
Maintainer

rawwerks Jun 26, 2024
Author