Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: dataset types lacking simple string option #1073

Open
davidgilbertson opened this issue Oct 7, 2024 · 1 comment
Open

Issue: dataset types lacking simple string option #1073

davidgilbertson opened this issue Oct 7, 2024 · 1 comment

Comments

@davidgilbertson
Copy link

Issue you'd like to raise.

I'm new to LangSmith and find the dataset structure more complicated (and confusing) than it needs to be.

In some ways, the dataset is treated like a table, with Input and Output columns, like in the UI:
image

And I can upload a dataframe to fill those columns like:

client.upload_dataframe(
    df,
    name="my-dataset",
    input_keys=["question"],
    output_keys=["expected"],
)

Which makes it seem like I'm creating a table where the Input and Output column values are strings. But this code doesn't work:

client.create_examples(
    inputs=df.question,
    outputs=df.expected,
    dataset_name="my-dataset",
)

To a beginner, this is weird! But then I learn that under the hood, for some reason, each 'cell' is actually a dictionary. And even though there are three options for different types of datasets, they all seem to have this requirement that the input and output MUST be a dictionary.

The confusion is amplified by the wide range of terminology for the same thing (input/question/prompt and output/target/reference/etc). So for example in one tutorial you've got the dataset column is called "output", the dataset record output is a dict with the key 'answer' and the evaluator uses "reference".
And an LLM Judge doesn't compare two strings, it compares two dicts! That can't be optimal.

I get that some people may want this complexity, but I can tell you that I don't.

Suggestion:

Allow strings as values for the input and output columns of a dataset.

I assume there's a good reason for having dictionaries in the backend, but I would like to suggest that this needn't be exposed to the user in all cases. Just do this whenever processing inputs behind the scenes:

if isinstance(input, str):
    input = {"input": input}

Or allow a new type of dataset "Simple" that's just input and output strings.

Maybe as I progress in my knowledge I will come to understand/appreciate why these values MUST be dicts, but I thought I'd share my newbie perspective while it's still fresh.

Side note: LangSmith is awesome, as is the boatload of "Hi this is Lance from LangChain" videos on YouTube.

@hinthornw
Copy link
Collaborator

Agreed that dataset types are confusing and that we need better consistency in the docs. Will forward to the team .We're planning some changes to hopefully simplify the experience a bit. I can't promise support for raw string datasets yet.

Will also forward your gratitude to Lance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants