Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot insert spark dataframe in batches into a collection #14

Open
timtimich35 opened this issue May 8, 2024 · 4 comments
Open

Cannot insert spark dataframe in batches into a collection #14

timtimich35 opened this issue May 8, 2024 · 4 comments

Comments

@timtimich35
Copy link

Dependencies:
python 3.8.12
pyspark 3.5.0
pymilvus 2.4.1
grpcio-tools 1.60.0
protobuf 4.25.3
milvus cluster was deployed in k8s using milvus operator 0.9.13

SparkSession setup:

spark = SparkSession.builder \
                    .master("local[*]") \
                    .appName("similarity_search_mech") \
                    .config("spark.driver.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar') \
                    .config("spark.jars", '/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar') \
                    .config("spark.driver.memory", "4g") \
                    .getOrCreate()

Milvus setup:

client = MilvusClient(uri="http://cluster_address:19530")

# create schema
schema = MilvusClient.create_schema(auto_id=False, enable_dynamic_field=True)

# add fields to schema
schema.add_field(field_name="id", 
                 datatype=DataType.INT64, 
                 is_primary=True)

schema.add_field(field_name="vec", 
                 datatype=DataType.FLOAT_VECTOR, 
                 dim=3000)

# prepare index parameters
index_params = client.prepare_index_params()

# add indexes
index_params.add_index(field_name="id", 
                       index_type="STL_SORT")

index_params.add_index(field_name="vec", 
                       index_type="IVF_FLAT", 
                       metric_type="IP", 
                       params={"nlist": 128})

# create a collection
client.create_collection(collection_name="similarity_search_mech",
                         schema=schema,
                         index_params=index_params)

Given:
I work in DataLore IDE deployed in k8s along with Milvus and Spark.
I have a spark dataframe of 2.5 million rows and 2 columns, id and 3000 elements long vector of floats.
I try to load it in batches 100,000 records each so it should be 25 iterations in total.
None of the batches gets inserted.

Insert operation:

    batch.write \
    .mode("append") \
    .option("milvus.host", "cluster_address") \
    .option("milvus.port", "19530") \
    .option("milvus.collection.name", "similarity_search_mech") \
    .option("milvus.collection.vectorField", "vec") \
    .option("milvus.collection.vectorDim", "3000") \
    .option("milvus.collection.primaryKeyField", "id") \
    .format("milvus") \
    .save()

Can you please help me understand what do I do wrong?

@timtimich35
Copy link
Author

timtimich35 commented May 8, 2024

The error I get is saying "Py4JJavaError: An error occurred while calling o100353.save. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: milvus." however the jar file is just there where I specified it. SparkSession doesn't return any error on clickhouse driver I also specify and utilizes it just fine.

@wayblink
Copy link
Contributor

wayblink commented May 8, 2024

@timtimich35 Obviously the spark-milvus jar is not loaded correctly. I 'm not quite familiar to pyspark. Have you ever tried .config("spark.driver.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar') \ .config("spark.executor.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar')
or pyspark --jars /data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar

@timtimich35
Copy link
Author

timtimich35 commented May 8, 2024

@wayblink Nope. Haven't yet tried this approach. Will do. Will text you back.

@timtimich35
Copy link
Author

How can I get this part (marked yellow) done if I'm on Windows?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants