Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: New Feature: Qdrant snapshot download support #9645

Open
lilhoser opened this issue Nov 11, 2024 · 7 comments
Open

.Net: New Feature: Qdrant snapshot download support #9645

lilhoser opened this issue Nov 11, 2024 · 7 comments
Assignees
Labels
memory connector .NET Issue or Pull requests regarding .NET code

Comments

@lilhoser
Copy link


name: Add an API to download a qdrant snapshot
about: SK already supports creating, enumerating and deleting snapshots, but there is no method to download it.


@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code triage labels Nov 11, 2024
@github-actions github-actions bot changed the title .net: New Feature: Qdrant snapshot download support .Net: New Feature: Qdrant snapshot download support Nov 11, 2024
@westey-m
Copy link
Contributor

Hi @lilhoser, thanks for the feature request.
When you say snapshots, do you mean the tar archive files that Qdrant supports? https://qdrant.tech/documentation/concepts/snapshots/
We don't support creating or deleting those in SK, so I wasn't sure if you meant collections?

Or do you mean in the Qdrant Client Library: https://github.com/qdrant/qdrant-dotnet/blob/main/src/Qdrant.Client/QdrantClient.cs#L3958
I don't see a method in it to download snapshots, while I do see create/delete/etc. Note that while we depend on this library it is supplied by Qdrant.

@markwallace-microsoft markwallace-microsoft added question Further information is requested and removed triage labels Nov 12, 2024
@markwallace-microsoft markwallace-microsoft moved this to Sprint: In Review in Semantic Kernel Nov 12, 2024
@lilhoser
Copy link
Author

Yeah, the qdrant client (qdrant-dotnet) library provided by Qdrant allows creating, deleting and enumerating those tar-based snapshots through grpc but offers no snapshot download/restore (I filed an issue qdrant/qdrant-dotnet#75 last week).

I noticed SK interacts with the qdrant endpoint via http requests already for collection tasks, so it might be something to consider to extend to snapshots. This is useful to save/restore large qdrant collections if you want to migrate the data without regenerating it manually. I wasn't sure if this is what you folks had in mind on this TBD page: https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/serialization?pivots=programming-language-csharp

Here is the code I'm using right now:

Generate a snapshot:

var snapshotName = "";
using (var client = new QdrantClient("localhost"))
{
    try
    {
        var result = await client.CreateSnapshotAsync(s_ManifestCollectionName);
        if (string.IsNullOrEmpty(result.Name))
        {
            throw new Exception("Snapshot name is empty");
        }
        snapshotName = result.Name;
    }
    catch (Exception ex)
    {
        m_StateManager.ProgressState.FinalizeProgress($"Unable to create snapshot: {ex.Message}");
        return;
    }

Download the snapshot:

using (var httpClient = new HttpClient())
{
    httpClient.BaseAddress = new Uri(@"http://localhost:6333");
    try
    {
        var uri = $"collections/{collection.CollectionName}/snapshots/" +
            $"{snapshotName}";
        var request = new HttpRequestMessage(HttpMethod.Get, uri);
        var response = await httpClient.SendAsync(request, HttpCompletionOption.ResponseContentRead).ConfigureAwait(false);
        if (response.StatusCode != HttpStatusCode.OK)
        {
            throw new Exception($"HTTP status code is {response.StatusCode}");
        }
        var content = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
        var target = System.IO.Path.Combine(Path, snapshotName);
        File.WriteAllBytes(target, content);
    }
    catch (Exception ex)
    {
        m_StateManager.ProgressState.FinalizeProgress($"Unable to download snapshot: {ex.Message}");
        return;
    }
}

Upload the snapshot:

using (var httpClient = new HttpClient())
{
    httpClient.BaseAddress = new Uri(@"http://localhost:6333");
    try
    {
        var uri = $"collections/{collection.CollectionName}/snapshots/" +
            $"upload?priority=snapshot"; // priority overrides local data
        var snapshotData = File.ReadAllBytes(Path);
        var content = new MultipartFormDataContent()
        {
            {new ByteArrayContent(snapshotData), "snapshot"}
        };
        var response = await httpClient.PostAsync(uri, content).ConfigureAwait(false);
        if (response.StatusCode != HttpStatusCode.OK)
        {
            throw new Exception($"HTTP status code is {response.StatusCode}");
        }
    }
    catch (Exception ex)
    {
        m_StateManager.ProgressState.FinalizeProgress($"Unable to upload snapshot: {ex.Message}");
        return;
    }
}

@westey-m
Copy link
Contributor

Thanks for the additional information. The value that we are trying to target with the VectorStore abstractions is

  1. To provide an abstraction over different databases so you can use them interchangeably
  2. To provide the type of experience that .net developers typically expect, e.g. strong typing, model first, etc.

Snapshots is a really cool feature, but I don't believe it's very common. I haven't seen other DBs support this. If it's not supported across different DBs, adding it to the abstraction is not great because it will just throw NotSupportedException for all dbs except Qdrant.
There's also not much extra value we can provide over what the Qdrant SDK will hopefully add, i.e. downloading the tar file.

In this case we typically just recommend using the Qdrant SDK directly (assuming they can add the download feature to the client), since it seems like the right place for this to live.

@lilhoser
Copy link
Author

Yep, that makes sense!

Just curious, what is the expected deliverable for the page I linked above regarding vector store serialization?

My use case for the qdrant snapshot is so that users of my application can easily export and share large collections for analysis.

@westey-m
Copy link
Contributor

About the vector store serialization page.
It's only filled out on the python tab at the moment, see https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/serialization?pivots=programming-language-python

Our Python connectors has some unique capabilities when it comes to converting to and from storage, and this is really about that conversion. On the .net side, we still need to add general details about how data model to storage conversion happens. This is typically already documented on each connector's page as well, but an overview here would be useful.

Exporting and importing data to/from disk is an interesting use case. E.g. let's say we had some code that can query the DB for all records, download them, and then produce a json file or many json files in the format of the serialized data model objects. Also supporting an import with that could be really useful, since it allows easy data migration and backup/restore.

There are some challenges here though. Such a process could be very long running depending on the data set size, so supporting recovery and restart from a failure point, if the process fails for some reason, would be very important. E.g. sort the results by field x asc, continuously save the last value of x downloaded, and restart by querying where x > the last downloaded value of x. While some vector dbs can be a bit restrictive when it comes to querying, this should be doable for most DBs.

I'd say the starting point would be supporting more complex filtering capabilities for search, without vectors and with sorting.

@lilhoser
Copy link
Author

Thanks @westey-m - we can close this ticket if there is no new issue to file from this discussion (not sure if any of this is already on SK roadmap).

@westey-m westey-m moved this from Sprint: In Review to Backlog in Semantic Kernel Nov 14, 2024
@westey-m
Copy link
Contributor

We discussed this earlier and adding better search capabilities to support this kind of behavior makes sense, we just need to prioritize it, so will keep this issue to track. Thanks again @lilhoser for brining this up.

@westey-m westey-m added memory connector and removed question Further information is requested labels Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory connector .NET Issue or Pull requests regarding .NET code
Projects
Status: Backlog
Development

No branches or pull requests

3 participants