diff --git a/codelabs/get-started-with-vector-db-0/index.md b/codelabs/get-started-with-vector-db-0/index.md index e963f0d..66e6c45 100644 --- a/codelabs/get-started-with-vector-db-0/index.md +++ b/codelabs/get-started-with-vector-db-0/index.md @@ -12,7 +12,7 @@ Feedback Link: https://github.com/milvus-io/milvus ## Introduction -Welcome to Milvus codelabs. This is the first tutorial, and will be mostly a text-based overview of _unstructured data_. We know, we know, doesn't sound like a very sexy topic, but before you press that little x button on your browser tab, hear us out. +Welcome to Milvus codelabs. This is the first tutorial, and will be mostly a text-based overview of _unstructured data_. I know, I know, doesn't sound like a very sexy topic, but before you press that little x button on your browser tab, hear us out. New data is being generated every day, and is undoubtedly a key driver of both worldwide integration as well as the global economy. From heart rate monitors worn on wrists to GPS positions of a vehicle fleet to videos uploaded to social media, data is being generated at an exponentially increasing rate. The importance of this ever-increasing amount of data cannot be understated; data can help better serve existing customers, identify supply chain weaknesses, pinpoint workforce inefficiencies, and help companies identify and break into new markets, all factors that can enable a company (and you) to generate more $$$. @@ -106,7 +106,7 @@ Excited yet? Excellent. But before we dive headfirst into vector databases and M >>> document = collection.find_one({'Author': 'Bill Bryson'}) ``` -This type of querying methodology is not dissimilar to that of traditional relational databases, which rely on SQL statements to filter and fetch data. The concept is the same: databases for structured/semi-structured data perform filtering and querying using mathematical (e.g. `<=`, string distance) or logical (e.g. `EQUALS`, `NOT`) operators across numerical values and/or strings. For traditional relational databases, this is called _relational algebra_; for those of you unfamiliar with it, trust us when we say it's much worse than linear algebra. You may have seen examples of extremely complex filters being constructed through relational algebra, but the core concept remains the same - traditional databases are _deterministic_ systems that always return exact matches for a given set of filters. +This type of querying methodology is not dissimilar to that of traditional relational databases, which rely on SQL statements to filter and fetch data. The concept is the same: databases for structured/semi-structured data perform filtering and querying using mathematical (e.g. `<=`, string distance) or logical (e.g. `EQUALS`, `NOT`) operators across numerical values and/or strings. For traditional relational databases, this is called _relational algebra_; for those of you unfamiliar with it, trust me when I say it's much worse than linear algebra. You may have seen examples of extremely complex filters being constructed through relational algebra, but the core concept remains the same - traditional databases are _deterministic_ systems that always return exact matches for a given set of filters. Unlike databases for structured/semi-structured data, vector database queries are done by specifying an input _query vector_ as opposed to SQL statement or data filters (such as `{'Author': 'Bill Bryson'}`). This vector is the embedding-based representation of the unstructured data. As a quick example, this can be done in Milvus with the following snippet (using `pymilvus`): @@ -131,7 +131,7 @@ Thanks for making it this far! Here are the key takewaways for this tutorial: - Searching and analyzing unstructured data is done through ANN search, a process that is inherently probabilistic. Querying across structured/semi-structured data, on the other hand, is deterministic. - Unstructured data processing is very different from semi-structured data processing, and requires a complete paradigm shift. This naturally necessiates a new type of database - the vector database. -This concludes part one of our introductory series - for those of you new to vector databases, welcome to Milvus! In the next tutorial, we'll cover vector databases in more detail: +This concludes part one of this introductory series - for those of you new to vector databases, welcome to Milvus! In the next tutorial, we'll cover vector databases in more detail: - We'll first provide a birds-eye view of the the Milvus vector database. - We'll then follow it up with how Milvus differs from vector search libraries (FAISS, ScaNN, DiskANN, etc). - We'll also discuss how vector databases differ from vector search plugins (for traditional databases and search systems). diff --git a/codelabs/get-started-with-vector-db-1/index.md b/codelabs/get-started-with-vector-db-1/index.md index dfba8b1..b5a2411 100644 --- a/codelabs/get-started-with-vector-db-1/index.md +++ b/codelabs/get-started-with-vector-db-1/index.md @@ -28,14 +28,14 @@ Phew. That was quite a bit of info, so we'll summarize it right here: a vector d ## Vector databases versus vector search libraries -A common misconception that we hear around the industry is that vector databases are merely wrappers around ANN search algorithms. This could not be further from the truth! A vector database is, at its core, a full-fledged solution for unstructured data. As we've already seen in the previous section, this means that user-friendly features present in today's database management systems for structured/semi-structured data - cloud-nativity, multi-tenancy, scalability, etc - should also be attributes for a mature vector database as well. All of these features will become clear as we dive deeper into this tutorial. +A common misconception that I hear around the industry is that vector databases are merely wrappers around ANN search algorithms. This could not be further from the truth! A vector database is, at its core, a full-fledged solution for unstructured data. As we've already seen in the previous section, this means that user-friendly features present in today's database management systems for structured/semi-structured data - cloud-nativity, multi-tenancy, scalability, etc - should also be attributes for a mature vector database as well. All of these features will become clear as we dive deeper into this tutorial. On ther other hand, projects such as FAISS, ScaNN, and HNSW are lightweight _ANN libraries_ rather than managed solutions. The intention of these libraries is to aid in the construction of vector indices - data structures designed to significantly speed up nearest neighbor search for multi-dimensional vectors[^1]. If your dataset is small and limited, these libraries can prove to be sufficient for unstructured data processing, even for systems running in production. However, as dataset sizes increase and more users are onboarded, the problem of scale becomes increasingly difficult to solve.
-

High-level overview of Milvus's architecture. It looks confusing, we know, but don't worry, we'll dive into each component in the next tutorial.

+

High-level overview of Milvus's architecture. It looks confusing, I know, but don't worry, we'll dive into each component in the next tutorial.

Vector databases also operate in a totally different layer of abstraction from vector search libraries - vector databases are full-fledged services, while ANN libraries are meant to be integrated into the application that you're developing. In this sense, ANN libraries are one of the many components that vector databases are built on top of, similar to how Elasticsearch is built on top of Apache Lucene. To give an example of why this abstraction is so important, let's take a look at inserting a new unstructured data element into a vector database. This is super easy in Milvus: @@ -53,7 +53,7 @@ Great, now that we've established the difference between vector search libraries An increasing number of traditional databases and search systems such as Clickhouse and Elasticsearch are including built-in vector search plugins. Elasticsearch 8.0, for example, includes vector insertion and ANN search functionality that can be called via restful API endpoints. The problem with vector search plugins should be clear as night and day - these solutions do not take a full-stack approach to embedding management and vector search. Instead, these plugins are meant to be enhancements on top of existing architectures, thereby making them limited and unoptimized. Developing an unstructured data application atop a traditional database would be like trying to fit lithium batteries and electric motors inside a the frame of a gas-powered car - not a great idea! -To illustrate why this is, let's go back to the list of features that a vector database should implement (from the first section). Vector search plugins are missing two of these features - tunability and user-friendly APIs/SDKs. We'll continue to use Elasticsearch's ANN engine as an example; other vector search plugins operate very similarly so we won't go too much further into detail. Elasticsearch supports vector storage via the `dense_vector` data field type and allows for querying via the `_knn_search` endpoint: +To illustrate why this is, let's go back to the list of features that a vector database should implement (from the first section). Vector search plugins are missing two of these features - tunability and user-friendly APIs/SDKs. I'll continue to use Elasticsearch's ANN engine as an example; other vector search plugins operate very similarly so I won't go too much further into detail. Elasticsearch supports vector storage via the `dense_vector` data field type and allows for querying via the `_knn_search` endpoint: ```json PUT index @@ -88,7 +88,7 @@ GET index/_knn_search } ``` -Elasticsearch's ANN plugin supports only one indexing algorithm: Hierarchical Navigable Small Worlds, also known as HNSW (we like to think that the creator was ahead of Marvel when it came to popularizing the multiverse). On top of that, only L2/Euclidean distance is supported as a distance metric. This is an okay start, but let's compare it to Milvus, a full-fledged vector database. Using `pymilvus`: +Elasticsearch's ANN plugin supports only one indexing algorithm: Hierarchical Navigable Small Worlds, also known as HNSW (I like to think that the creator was ahead of Marvel when it came to popularizing the multiverse). On top of that, only L2/Euclidean distance is supported as a distance metric. This is an okay start, but let's compare it to Milvus, a full-fledged vector database. Using `pymilvus`: ```python >>> field1 = FieldSchema(name='id', dtype=DataType.INT64, description='int64', is_primary=True) @@ -119,7 +119,7 @@ We just blew through quite a bit of content. This section was admittedly fairly ## Technical challenges -Earlier in this tutorial, we listed the desired features a vector database should implement, before comparing vector databases to vector search libraries and vector search plugins. Now, let's briefly go over some high-level technical challenges associated with modern vector databases. In future tutorials, we'll provide an overview of how Milvus tackles each of these, in addition to how these technical decisions improve Milvus' performance over other open-source vector databases. +Earlier in this tutorial, I listed the desired features a vector database should implement, before comparing vector databases to vector search libraries and vector search plugins. Now, let's briefly go over some high-level technical challenges associated with modern vector databases. In future tutorials, we'll provide an overview of how Milvus tackles each of these, in addition to how these technical decisions improve Milvus' performance over other open-source vector databases. Picture an airplane. The airplane itself contains a number of interconnected mechanical, electrical, and embedded systems, all working on harmony to provide us with a smooth and pleasurable in-flight experience. Likewise, vector databases are composed of a number of evolving software components. Roughly speaking, these can be broken down into the storage, the index, and the service. Although these three components are tightly integrated[^2], companies such as Snowflake have shown the broader storage industry that "shared nothing" database architectures are arguably superior to the traditional "shared storage" cloud database models. Thus, the first technical challenge associated with vector databases is _designing a flexible and scalable data model_. diff --git a/codelabs/get-started-with-vector-db-2/index.md b/codelabs/get-started-with-vector-db-2/index.md index 5e87472..fbca678 100644 --- a/codelabs/get-started-with-vector-db-2/index.md +++ b/codelabs/get-started-with-vector-db-2/index.md @@ -100,8 +100,6 @@ In the next several tutorials, we'll provide a series of _Milvus quickstarts_, d Lastly, we at the Milvus community have provided [a short video](https://www.youtube.com/watch?v=nQkmgCtVz5k) introducing Milvus in 2.5 minutes, narrated by yours truly. Grab a cup of coffee and enjoy a front-row seat for the video! -See you in the next couple of tutorials. - --- [^1]: These libraries include ANNOY, FAISS, ScaNN, DiskANN, and others. diff --git a/codelabs/get-started-with-vector-db-3/index.md b/codelabs/get-started-with-vector-db-3/index.md new file mode 100644 index 0000000..d0b5d35 --- /dev/null +++ b/codelabs/get-started-with-vector-db-3/index.md @@ -0,0 +1,190 @@ +summary: Getting started with Milvus - the world's most popular open-source vector database. +id: getting-started-with-vector-databases-introduction-to-milvus +categories: Getting Started +tags: getting-started +status: Published +authors: Frank Liu +Feedback Link: https://github.com/milvus-io/milvus + +--- + +# Getting Started with Vector Database - Milvus Quickstart + +Hey there - welcome back to Milvus codelabs. In the previous tutorial, we provided a brief introduction to Milvus, Milvus' history, as well as the primary differences between Milvus 1.x and Milvus 2.x. We also took a quick tour of the architecture of Milvus 2.x and helped shine some light on how Milvus' architecture allows it to implement all of the required features of vector databases. + +## Let's get started + +If you haven't read the previous tutorials ([unstructured data](), [vector databases](), [Milvus introduction]()), I recommend you go ahead and read them. If you have, great. Let's get started with Milvus! + +We offer two different modes of deployment: standalone and cluster. In Milvus standalone, all nodes - coordinators, worker nodes, and forward-facing proxies - are deployed as a single instance. For persistent data and metadata, Milvus standalone relies on `MinIO` and `etcd`, respectively. In future releases, we hope to eliminate these two third-party dependencies, allowing everything to run in a single process and removing the need to install third-party dependencies. + +Milvus cluster is our full-fledged version of Milvus, complete with separate instances/pods for all eight microservice components along with three third-party dependencies: `MinIO`, `etcd`, and `Pulsar` (Pulsar serves as the log broker and provides log pub/sub services). If you haven't gotten the chance to take a look at the Milvus overview from the previous slide, please do so! It'll help clarify what each of these third party dependencies is used for and why we've included them in Milvus cluster. + +## Milvus standalone (`docker-compose`) + +Milvus standalone is meant to be super easy to install. In this section, we'll go over how `docker-compose` can be used to install Milvus. You can view the recommended prerequisites [here](https://github.com/milvus-io/milvus-docs/blob/v2.0.x/site/en/getstarted/prerequisite-docker.md). + +Let's first download the [`docker-compose.yml`](https://github.com/milvus-io/milvus/releases/download/v2.0.2/milvus-standalone-docker-compose.yml) configuration file needed for the standalone installation. If you're on any Debian-based Linux (including Ubuntu), you can use the following command: + +```shell +$ wget https://github.com/milvus-io/milvus/releases/download/v2.0.2/milvus-standalone-docker-compose.yml -O docker-compose.yml +``` + + + Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ... + Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected. + HTTP request sent, awaiting response... 200 OK + Length: 1303 (1.3K) [application/octet-stream] + Saving to: ‘docker-compose.yml’ + + docker-compose.yml 100%[===================>] 1.27K --.-KB/s in 0s + + 2022-06-29 13:58:49 (113 MB/s) - ‘docker-compose.yml’ saved [1303/1303] + +Alternatively, if you're on MacOS, make sure you have [Docker Desktop](https://docs.docker.com/desktop/mac/install/) installed first. I recommend using `brew`: + +```shell +% brew install --cask docker +``` + +You can then follow this up with the command below: + +```shell +% curl https://github.com/milvus-io/milvus/releases/download/v2.0.2/milvus-standalone-docker-compose.yml -o docker-compose.yml +``` + +With everything ready, we can now spin up our Milvus standalone instance: + +```shell +$ docker-compose up -d +``` + + Docker Compose is now in the Docker CLI, try `docker compose up` + Creating milvus-etcd ... done + Creating milvus-minio ... done + Creating milvus-standalone ... done + + +Now, we can check on the status of our containers + +```shell +$ docker ps -a +``` + + CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES + 711d54ab15c7 milvusdb/milvus:v2.0.2 "/tini -- milvus run…" 42 seconds ago Up 40 seconds 0.0.0.0:19530->19530/tcp milvus-standalone + 0d85f4927864 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 42 seconds ago Up 40 seconds (healthy) 9000/tcp milvus-minio + 99de39278b35 quay.io/coreos/etcd:v3.5.0 "etcd -advertise-cli…" 42 seconds ago Up 40 seconds 2379-2380/tcp milvus-etcd + + +Here's a quick rundown of what each of the containers are doing. `milvus-standalone` is the compiled/compressed version of Milvus, mean to run on a single machine. + +To stop Milvus standalone, run: + +``` +$ docker-compose down +``` + +And that's it for Milvus standalone! Easy, right? + +## Milvus standalone (`apt`) + +We also provide a handy `apt` package for Debian-based distributions. Simply run: + +```shell +$ sudo apt install software-properties-common +$ sudo add-apt-repository ppa:milvusdb/milvus +$ sudo apt update +$ sudo apt install milvus +``` + +Once that's done, you're good to go. You can check the status of the running services with: + +```shell +$ sudo systemctl status milvus +$ sudo systemctl status milvus-etcd +$ sudo systemctl status milvus-minio +``` + +## Milvus cluster + +From the previous tutorial, we know that Milvus is composed of four primary components: the access layer, coordinator service, worker nodes, and object storage. Requests are sent to a cluster of proxies in the access layer, which then forwards the requests to either the coordinator layer or a streaming service for vector data. The stateful coordinator nodes within the coordinator service manage and control all of the stateless worker nodes, allowing for easy horizontal scaling. Object storage is accomplished via S3 or any "S3-like" storage layer, allowing Milvus to be run both in the cloud and on-premises via MinIO. + +Milvus' remaining third-party dependencies, Pulsar/Kafka and etcd, are also distributed and cloud-native, allowing the entirety of Milvus to run via Kubernetes as an orchestration engine. Using [Kubernetes](https://github.com/kubernetes/kubernetes) is a no-brainer for nearly all distributed applications, as it provides out-of-the-box support for application deployment, maintanence, and scaling. We recommend deploying Milvus as a Kubernetes application via [Helm](https://helm.sh/): + +```shell +% helm repo add milvus https://milvus-io.github.io/milvus-helm/ +``` + + "milvus" has been added to your repositories + +Now, let's grab the latest Milvus chart from the `milvus-io/milvus-helm` repository. + +```shell +% helm repo update +``` + + Hang tight while we grab the latest from your chart repositories... + ...Successfully got an update from the "milvus" chart repository + Update Complete. ⎈Happy Helming!⎈ + +Great. Now that we've gotten all of the dependencies out of the way, let's install Milvus (cluster)! + +```shell +% helm install my-release milvus/milvus +``` + + W0629 16:01:00.674407 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.676536 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.678594 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.680671 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.808448 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.809339 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.809344 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + W0629 16:01:00.809594 21803 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget + NAME: my-release + LAST DEPLOYED: Wed Jun 29 16:01:00 2022 + NAMESPACE: default + STATUS: deployed + REVISION: 1 + TEST SUITE: None + +With this done, we can now see the pods that are up and running via `kubectl`: + +```shell +$ kubectl get pods +``` + + NAME READY STATUS RESTARTS AGE + my-release-etcd-0 1/1 Running 0 2m23s + my-release-etcd-1 1/1 Running 0 2m23s + my-release-etcd-2 1/1 Running 0 2m23s + my-release-milvus-datacoord-6fd4bd885c-gkzwx 1/1 Running 0 2m23s + my-release-milvus-datanode-68cb87dcbd-4khpm 1/1 Running 0 2m23s + my-release-milvus-indexcoord-5bfcf6bdd8-nmh5l 1/1 Running 0 2m23s + my-release-milvus-indexnode-5c5f7b5bd9-l8hjg 1/1 Running 0 2m24s + my-release-milvus-proxy-6bd7f5587-ds2xv 1/1 Running 0 2m24s + my-release-milvus-querycoord-579cd79455-xht5n 1/1 Running 0 2m24s + my-release-milvus-querynode-5cd8fff495-k6gtg 1/1 Running 0 2m24s + my-release-milvus-rootcoord-7fb9488465-dmbbj 1/1 Running 0 2m23s + my-release-minio-0 1/1 Running 0 2m23s + my-release-minio-1 1/1 Running 0 2m23s + my-release-minio-2 1/1 Running 0 2m23s + my-release-minio-3 1/1 Running 0 2m23s + my-release-pulsar-autorecovery-86f5dbdf77-lchpc 1/1 Running 0 2m24s + my-release-pulsar-bookkeeper-0 1/1 Running 0 2m23s + my-release-pulsar-bookkeeper-1 1/1 Running 0 98s + my-release-pulsar-broker-556ff89d4c-2m29m 1/1 Running 0 2m23s + my-release-pulsar-proxy-6fbd75db75-nhg4v 1/1 Running 0 2m23s + my-release-pulsar-zookeeper-0 1/1 Running 0 2m23s + my-release-pulsar-zookeeper-metadata-98zbr 1/1 Completed 0 2m24s + +That's it! You now have Milvus installed directly on your on-premises cluster. Check out our [next tutorial]() to see how to create a collection within Milvus and begin inserting and querying embeddings. + +If you're interested in running Milvus on cloud infrastructure check out the [Milvus standalone on AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-hzmmt4xyvi7ei). + +## Wrapping up + +In this tutorial, we took a look at how to install the standalone version of Milvus (via `docker-compose`) and the cluster version of Milvus (via `helm`) The standalone version is suitable for testing purposes, while the cluster version is suitable for internal clusters or on-premises deployments. In the next tutorial, we'll look at basic Milvus operations: connecting to a Milvus server, creating a collection (equivalent to a table in relational databases), creating a partition within the collection, inserting embedding vector data, and conducting a vector search. + +See you in the next couple of tutorials.