Issue: dataset types lacking simple string option #1073

davidgilbertson · 2024-10-07T20:06:36Z

Issue you'd like to raise.

I'm new to LangSmith and find the dataset structure more complicated (and confusing) than it needs to be.

In some ways, the dataset is treated like a table, with Input and Output columns, like in the UI:

And I can upload a dataframe to fill those columns like:

client.upload_dataframe(
    df,
    name="my-dataset",
    input_keys=["question"],
    output_keys=["expected"],
)

Which makes it seem like I'm creating a table where the Input and Output column values are strings. But this code doesn't work:

client.create_examples(
    inputs=df.question,
    outputs=df.expected,
    dataset_name="my-dataset",
)

To a beginner, this is weird! But then I learn that under the hood, for some reason, each 'cell' is actually a dictionary. And even though there are three options for different types of datasets, they all seem to have this requirement that the input and output MUST be a dictionary.

The confusion is amplified by the wide range of terminology for the same thing (input/question/prompt and output/target/reference/etc). So for example in one tutorial you've got the dataset column is called "output", the dataset record output is a dict with the key 'answer' and the evaluator uses "reference".
And an LLM Judge doesn't compare two strings, it compares two dicts! That can't be optimal.

I get that some people may want this complexity, but I can tell you that I don't.

Suggestion:

Allow strings as values for the input and output columns of a dataset.

I assume there's a good reason for having dictionaries in the backend, but I would like to suggest that this needn't be exposed to the user in all cases. Just do this whenever processing inputs behind the scenes:

if isinstance(input, str):
    input = {"input": input}

Or allow a new type of dataset "Simple" that's just input and output strings.

Maybe as I progress in my knowledge I will come to understand/appreciate why these values MUST be dicts, but I thought I'd share my newbie perspective while it's still fresh.

Side note: LangSmith is awesome, as is the boatload of "Hi this is Lance from LangChain" videos on YouTube.

hinthornw · 2024-10-08T03:48:26Z

Agreed that dataset types are confusing and that we need better consistency in the docs. Will forward to the team .We're planning some changes to hopefully simplify the experience a bit. I can't promise support for raw string datasets yet.

Will also forward your gratitude to Lance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: dataset types lacking simple string option #1073

Issue: dataset types lacking simple string option #1073

davidgilbertson commented Oct 7, 2024

hinthornw commented Oct 8, 2024

Issue: dataset types lacking simple string option #1073

Issue: dataset types lacking simple string option #1073

Comments

davidgilbertson commented Oct 7, 2024

Issue you'd like to raise.

Suggestion:

hinthornw commented Oct 8, 2024