You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To a beginner, this is weird! But then I learn that under the hood, for some reason, each 'cell' is actually a dictionary. And even though there are three options for different types of datasets, they all seem to have this requirement that the input and output MUST be a dictionary.
The confusion is amplified by the wide range of terminology for the same thing (input/question/prompt and output/target/reference/etc). So for example in one tutorial you've got the dataset column is called "output", the dataset record output is a dict with the key 'answer' and the evaluator uses "reference".
And an LLM Judge doesn't compare two strings, it compares two dicts! That can't be optimal.
I get that some people may want this complexity, but I can tell you that I don't.
Suggestion:
Allow strings as values for the input and output columns of a dataset.
I assume there's a good reason for having dictionaries in the backend, but I would like to suggest that this needn't be exposed to the user in all cases. Just do this whenever processing inputs behind the scenes:
ifisinstance(input, str):
input= {"input": input}
Or allow a new type of dataset "Simple" that's just input and output strings.
Maybe as I progress in my knowledge I will come to understand/appreciate why these values MUST be dicts, but I thought I'd share my newbie perspective while it's still fresh.
Side note: LangSmith is awesome, as is the boatload of "Hi this is Lance from LangChain" videos on YouTube.
The text was updated successfully, but these errors were encountered:
Agreed that dataset types are confusing and that we need better consistency in the docs. Will forward to the team .We're planning some changes to hopefully simplify the experience a bit. I can't promise support for raw string datasets yet.
Issue you'd like to raise.
I'm new to LangSmith and find the dataset structure more complicated (and confusing) than it needs to be.
In some ways, the dataset is treated like a table, with Input and Output columns, like in the UI:
And I can upload a dataframe to fill those columns like:
Which makes it seem like I'm creating a table where the Input and Output column values are strings. But this code doesn't work:
To a beginner, this is weird! But then I learn that under the hood, for some reason, each 'cell' is actually a dictionary. And even though there are three options for different types of datasets, they all seem to have this requirement that the input and output MUST be a dictionary.
The confusion is amplified by the wide range of terminology for the same thing (input/question/prompt and output/target/reference/etc). So for example in one tutorial you've got the dataset column is called "output", the dataset record output is a dict with the key 'answer' and the evaluator uses "reference".
And an LLM Judge doesn't compare two strings, it compares two dicts! That can't be optimal.
I get that some people may want this complexity, but I can tell you that I don't.
Suggestion:
Allow strings as values for the input and output columns of a dataset.
I assume there's a good reason for having dictionaries in the backend, but I would like to suggest that this needn't be exposed to the user in all cases. Just do this whenever processing inputs behind the scenes:
Or allow a new type of dataset "Simple" that's just input and output strings.
Maybe as I progress in my knowledge I will come to understand/appreciate why these values MUST be dicts, but I thought I'd share my newbie perspective while it's still fresh.
Side note: LangSmith is awesome, as is the boatload of "Hi this is Lance from LangChain" videos on YouTube.
The text was updated successfully, but these errors were encountered: