diff --git a/docs/sphinx/source/advanced-configuration.ipynb b/docs/sphinx/source/advanced-configuration.ipynb new file mode 100644 index 00000000..4e7886ac --- /dev/null +++ b/docs/sphinx/source/advanced-configuration.ipynb @@ -0,0 +1,458 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "given-adoption", + "metadata": {}, + "source": [ + "\n", + " \n", + " \n", + " \"#Vespa\"\n", + "\n", + "\n", + "# Advanced Configuration\n", + "\n", + "Vespa support a wide range of configuration options to customize the behavior of the system through the `services.xml`-[file](https://docs.vespa.ai/en/reference/services.html). Until pyvespa version 0.50.0, only a limited subset of these configurations were available in pyvespa.\n", + "\n", + "Now, we have added support for passing a `ServiceConfiguration` object to your `ApplicationPackage` that allows you to define any configuration you want. This notebook demonstrates how to use this new feature if you have the need for more advanced configurations.\n", + "\n", + "Note that it is not required to provide a `ServiceConfiguration` feature, and if not passed, the default configuration will still be created for you.\n", + "\n", + "There are some slight differences in which configuration options are available when running self-hosted (Docker) and when running on the cloud (Vespa Cloud). For details, see [Vespa Cloud services.xml-reference](https://cloud.vespa.ai/en/reference/services) This notebook demonstrates how to use the `ServiceConfiguration` object to configure a Vespa application for some common use cases, with options that are available in both environments.\n" + ] + }, + { + "cell_type": "markdown", + "id": "8c967bd2", + "metadata": {}, + "source": [ + "
\n", + " Refer to troubleshooting\n", + " for any problem when running this guide.\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "8345b2fe", + "metadata": {}, + "source": [ + "[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "03f3d0f2", + "metadata": {}, + "outputs": [], + "source": [ + "!pip3 install pyvespa\n", + "!docker info | grep \"Total Memory\"" + ] + }, + { + "cell_type": "markdown", + "id": "db637322", + "metadata": {}, + "source": [ + "## Setting up document-expiry\n", + "\n", + "As an example of a common use case for advanced configuration, we will configure document-expiry. This feature allows you to set a time-to-live for documents in your Vespa application. This is useful when you have documents that are only relevant for a certain period of time, and you want to avoid serving stale data.\n", + "\n", + "For reference, see the [docs on document-expiry](https://docs.vespa.ai/en/documents.html#document-expiry).\n" + ] + }, + { + "cell_type": "markdown", + "id": "a3619ad1", + "metadata": {}, + "source": [ + "### Define a schema\n", + "\n", + "We define a simple schema, with a timestamp field that we will use in the document selection expression to set the document-expiry.\n", + "\n", + "Note that the fields that are referenced in the selection expression should be attributes(in-memory).\n", + "\n", + "Also, either the fields should be set with `fast-access` or the number of searchable copies in the content cluster should be the same as the redundancy. Otherwise, the document selection maintenance will be slow and have a major performance impact on the system.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "bd5c2629", + "metadata": {}, + "outputs": [], + "source": [ + "from vespa.package import Document, Field, Schema, ApplicationPackage\n", + "\n", + "application_name = \"music\"\n", + "music_schema = Schema(\n", + " name=application_name,\n", + " document=Document(\n", + " fields=[\n", + " Field(\n", + " name=\"artist\",\n", + " type=\"string\",\n", + " indexing=[\"attribute\", \"summary\"],\n", + " ),\n", + " Field(\n", + " name=\"title\",\n", + " type=\"string\",\n", + " indexing=[\"attribute\", \"summary\"],\n", + " ),\n", + " Field(\n", + " name=\"timestamp\",\n", + " type=\"long\",\n", + " indexing=[\"attribute\", \"summary\"],\n", + " attribute=[\"fast-access\"],\n", + " ),\n", + " ]\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5be18383", + "metadata": {}, + "source": [ + "## The `ServiceConfiguration` object\n", + "\n", + "The `ServiceConfiguration` object allows you to define any configuration you want in the `services.xml` file.\n", + "\n", + "The syntax is as follows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "05318c34", + "metadata": {}, + "outputs": [], + "source": [ + "from vespa.package import ServicesConfiguration\n", + "from vespa.configuration.services import (\n", + " services,\n", + " container,\n", + " search,\n", + " document_api,\n", + " document_processing,\n", + " content,\n", + " redundancy,\n", + " documents,\n", + " document,\n", + " node,\n", + " nodes,\n", + ")\n", + "\n", + "# Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400)\n", + "services_config = ServicesConfiguration(\n", + " application_name=application_name,\n", + " services_config=services(\n", + " container(\n", + " search(),\n", + " document_api(),\n", + " document_processing(),\n", + " id=f\"{application_name}_container\",\n", + " version=\"1.0\",\n", + " ),\n", + " content(\n", + " redundancy(\"1\"),\n", + " documents(\n", + " document(\n", + " type=application_name,\n", + " mode=\"index\",\n", + " # Note that the selection-expression does not need to be escaped, as it will be automatically escaped during xml-serialization\n", + " selection=\"music.timestamp > now() - 86400\",\n", + " ),\n", + " garbage_collection=\"true\",\n", + " ),\n", + " nodes(node(distribution_key=\"0\", hostalias=\"node1\")),\n", + " id=f\"{application_name}_content\",\n", + " version=\"1.0\",\n", + " ),\n", + " ),\n", + ")\n", + "application_package = ApplicationPackage(\n", + " name=application_name,\n", + " schema=[music_schema],\n", + " services_config=services_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "aa40c0ce", + "metadata": {}, + "source": [ + "There are some useful gotchas to keep in mind when constructing the `ServiceConfiguration` object.\n", + "\n", + "First, let's establish a common vocabulary through an example. Consider the following `services.xml` file, which is what we are actually representing with the `ServiceConfiguration` object from the previous cell:\n", + "\n", + "```xml\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " 1\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "```\n", + "\n", + "In this example, `services`, `container`, `search`, `document-api`, `document-processing`, `content`, `redundancy`, `documents`, `document`, and `nodes` are _tags_. The `id`, `version`, `type`, `mode`, `selection`, `distribution-key`, `hostalias`, and `garbage-collection` are _attributes_, with a corresponding _value_.\n", + "\n", + "### Tag names\n", + "\n", + "All tags as referenced in the [Vespa documentation](https://docs.vespa.ai/en/reference/services.html) are available in `vespa.configuration.services` module with the following modifications:\n", + "\n", + "- All `-` in the tag names are replaced by `_` to avoid conflicts with Python syntax.\n", + "- Some tags that are Python reserved words (or commonly used objects) are constructed by adding a `_` at the end of the tag name. These are:\n", + " - `type_`\n", + " - `class_`\n", + " - `for_`\n", + " - `time_`\n", + " - `io_`\n", + "\n", + "Only valid tags are exported by the `vespa.configuration.services` module.\n", + "\n", + "### Attributes\n", + "\n", + "- _any_ attribute can be passed to the tag constructor (no validation at construction time).\n", + "- The attribute name should be the same as in the Vespa documentation, but with `-` replaced by `_`. For example, the `garbage-collection` attribute in the `query` tag should be passed as `garbage_collection`.\n", + "- In case the attribute name is a Python reserved word, the same rule as for the tag names applies (add `_` at the end). An example of this is the `global` attribute which should be passed as `global_`.\n", + "- Some attributes, such as `id`, in the `container` tag, are mandatory and should be passed as positional arguments to the tag constructor.\n", + "\n", + "### Values\n", + "\n", + "- The value of an attribute can be a string, an integer, or a boolean. For types `bool` and `int`, the value is converted to a string (lowercased for `bool`). If you need to pass a float, you should convert it to a string before passing it to the tag constructor, e.g. `container(version=\"1.0\")`.\n", + "- Note that we are _not_ escaping the values. In the xml file, the value of the `selection` attribute in the `document` tag is `music.timestamp > now() - 86400`. (`>` is the escaped form of `>`.) When passing this value to the `document` tag constructor in python, we should _not_ escape the `>` character, i.e. `document(selection=\"music.timestamp > now() - 86400\")`.\n" + ] + }, + { + "cell_type": "markdown", + "id": "careful-savage", + "metadata": {}, + "source": [ + "## Deploy the Vespa application\n", + "\n", + "Deploy `package` on the local machine using Docker,\n", + "without leaving the notebook, by creating an instance of\n", + "[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker). `VespaDocker` connects\n", + "to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/).\n", + "\n", + "If this step fails, please check\n", + "that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "canadian-blood", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Waiting for configuration server, 0/60 seconds...\n", + "Waiting for application to come up, 0/300 seconds.\n", + "Application is up!\n", + "Finished deployment.\n" + ] + } + ], + "source": [ + "from vespa.deployment import VespaDocker\n", + "\n", + "vespa_docker = VespaDocker()\n", + "app = vespa_docker.deploy(application_package=application_package)" + ] + }, + { + "cell_type": "markdown", + "id": "aaae2f91", + "metadata": {}, + "source": [ + "`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.\n", + "see this [notebook](https://pyvespa.readthedocs.io/en/latest/authenticating-to-vespa-cloud.html) for details on authenticating to Vespa Cloud.\n" + ] + }, + { + "cell_type": "markdown", + "id": "sealed-mustang", + "metadata": {}, + "source": [ + "## Feeding documents to Vespa\n", + "\n", + "Now, let us feed some documents to Vespa. We will feed one document with a timestamp of 24 hours (+1 sec (86401)) ago and another document with a timestamp of the current time. We will then query the documents to check verify that the document-expiry is working as expected.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e9d3facd", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "docs_to_feed = [\n", + " {\n", + " \"id\": \"1\",\n", + " \"fields\": {\n", + " \"artist\": \"Snoop Dogg\",\n", + " \"title\": \"Gin and Juice\",\n", + " \"timestamp\": int(time.time()) - 86401,\n", + " },\n", + " },\n", + " {\n", + " \"id\": \"2\",\n", + " \"fields\": {\n", + " \"artist\": \"Dr.Dre\",\n", + " \"title\": \"Still D.R.E\",\n", + " \"timestamp\": int(time.time()),\n", + " },\n", + " },\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6185fbce", + "metadata": {}, + "outputs": [], + "source": [ + "from vespa.io import VespaResponse\n", + "\n", + "\n", + "def callback(response: VespaResponse, id: str):\n", + " if not response.is_successful():\n", + " print(f\"Error when feeding document {id}: {response.get_json()}\")\n", + "\n", + "\n", + "app.feed_iterable(docs_to_feed, schema=application_name, callback=callback)" + ] + }, + { + "cell_type": "markdown", + "id": "8430dd98", + "metadata": {}, + "source": [ + "## Verify document expiry through visiting\n", + "\n", + "[Visiting](https://docs.vespa.ai/en/visiting.html) is a feature to efficiently get or process a set of documents, identified by a [document selection](https://docs.vespa.ai/en/reference/document-select-language.html) expression.\n", + "Here is how you can use visiting in pyvespa:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "450a925f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'pathId': '/document/v1/music/music/docid/',\n", + " 'documents': [{'id': 'id:music:music::2',\n", + " 'fields': {'artist': 'Dr.Dre',\n", + " 'title': 'Still D.R.E',\n", + " 'timestamp': 1727175623}}],\n", + " 'documentCount': 1}]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "visit_results = []\n", + "for slice_ in app.visit(\n", + " schema=application_name,\n", + " content_cluster_name=f\"{application_name}_content\",\n", + " timeout=\"5s\",\n", + "):\n", + " for response in slice_:\n", + " visit_results.append(response.json)\n", + "visit_results" + ] + }, + { + "cell_type": "markdown", + "id": "7e8fefc9", + "metadata": {}, + "source": [ + "We can see that the document with the timestamp of 24 hours ago is not returned by the query, while the document with the current timestamp is returned.\n" + ] + }, + { + "cell_type": "markdown", + "id": "28591491", + "metadata": {}, + "source": [ + "## Cleanup\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "e5064bd2", + "metadata": {}, + "outputs": [], + "source": [ + "vespa_docker.container.stop()\n", + "vespa_docker.container.remove()" + ] + }, + { + "cell_type": "markdown", + "id": "d1872b31", + "metadata": {}, + "source": [ + "## Next steps\n", + "\n", + "This is just an intro into to the advanced configuration options available in Vespa. For more details, see the [Vespa documentation](https://docs.vespa.ai/en/reference/services.html).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11.4 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.19" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tests/integration/test_integration_docker.py b/tests/integration/test_integration_docker.py index 7bfd6d0f..121fe3dc 100644 --- a/tests/integration/test_integration_docker.py +++ b/tests/integration/test_integration_docker.py @@ -25,7 +25,9 @@ QueryTypeField, AuthClient, Struct, + ServicesConfiguration, ) +from vespa.configuration.services import * from vespa.deployment import VespaDocker from vespa.application import VespaSync from vespa.exceptions import VespaError @@ -1621,3 +1623,104 @@ def callback(response: VespaResponse, id: str): def tearDown(self) -> None: self.vespa_docker.container.stop(timeout=CONTAINER_STOP_TIMEOUT) self.vespa_docker.container.remove() + + +class TestDocumentExpiry(unittest.TestCase): + def setUp(self) -> None: + application_name = "music" + self.application_name = application_name + music_schema = Schema( + name=application_name, + document=Document( + fields=[ + Field( + name="artist", + type="string", + indexing=["attribute", "summary"], + ), + Field( + name="title", + type="string", + indexing=["attribute", "summary"], + ), + Field( + name="timestamp", + type="long", + indexing=["attribute", "summary"], + attribute=["fast-access"], + ), + ] + ), + ) + # Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400) + services_config = ServicesConfiguration( + application_name=application_name, + services_config=services( + container( + search(), + document_api(), + document_processing(), + id=f"{application_name}_container", + version="1.0", + ), + content( + redundancy("1"), + documents( + document( + type=application_name, + mode="index", + selection="music.timestamp > now() - 86400", + ), + garbage_collection="true", + ), + nodes(node(distribution_key="0", hostalias="node1")), + id=f"{application_name}_content", + version="1.0", + ), + ), + ) + self.application_package = ApplicationPackage( + name=application_name, + schema=[music_schema], + services_config=services_config, + ) + self.vespa_docker = VespaDocker(port=8089) + self.app = self.vespa_docker.deploy( + application_package=self.application_package + ) + + def test_document_expiry(self): + docs_to_feed = [ + { + "id": "1", + "fields": { + "artist": "Snoop Dogg", + "title": "Gin and Juice", + "timestamp": int(time.time()) - 86401, + }, + }, + { + "id": "2", + "fields": { + "artist": "Dr.Dre", + "title": "Still D.R.E", + "timestamp": int(time.time()), + }, + }, + ] + self.app.feed_iterable(docs_to_feed, schema=self.application_name) + visit_results = [] + for slice_ in self.app.visit( + schema=self.application_name, + content_cluster_name=f"{self.application_name}_content", + timeout="5s", + ): + for response in slice_: + visit_results.append(response.json) + # Visit results: [{'pathId': '/document/v1/music/music/docid/', 'documents': [{'id': 'id:music:music::2', 'fields': {'artist': 'Dr. Dre', 'title': 'Still D.R.E', 'timestamp': 1726836495}}], 'documentCount': 1}] + self.assertEqual(len(visit_results), 1) + self.assertEqual(visit_results[0]["documentCount"], 1) + + def tearDown(self) -> None: + self.vespa_docker.container.stop(timeout=CONTAINER_STOP_TIMEOUT) + self.vespa_docker.container.remove() diff --git a/tests/unit/test_configuration.py b/tests/unit/test_configuration.py index ca379fca..09db3ad4 100644 --- a/tests/unit/test_configuration.py +++ b/tests/unit/test_configuration.py @@ -1,7 +1,17 @@ import unittest from lxml import etree import xml.etree.ElementTree as ET -from vespa.configuration.vt import * +from vespa.configuration.vt import ( + VT, + vt, + create_tag_function, + attrmap, + valmap, + to_xml, + compare_xml, + vt_escape, +) + from vespa.configuration.services import * @@ -186,18 +196,7 @@ def test_generate_colbert_services(self): generated_xml = generated_services.to_xml() # Validate against relaxng self.assertTrue(validate_services(etree.fromstring(str(generated_xml)))) - # Check all nodes and attributes being equal - tree_original = ET.fromstring(self.xml_schema.encode("utf-8")) - tree_generated = ET.fromstring(str(generated_xml)) - for original, generated in zip(tree_original.iter(), tree_generated.iter()): - # print(f"Original: {original.tag}, {original.attrib}, {original.text}") - # print(f"Generated: {generated.tag}, {generated.attrib}, {generated.text}") - self.assertEqual(original.tag, generated.tag) - self.assertEqual(original.attrib, generated.attrib) - self.assertEqual( - original.text.strip() if original.text else None, - generated.text.strip() if generated.text else None, - ) + self.assertTrue(compare_xml(self.xml_schema, str(generated_xml))) class TestBillionscaleServiceConfiguration(unittest.TestCase): @@ -426,21 +425,21 @@ def test_generate_billion_scale_services(self): requestthreads(persearch("2")), feeding(concurrency("1.0")), summary( - io(read("directio")), + io_(read("directio")), store( cache( maxsize_percent("5"), compression( - vt_type("lz4") - ), # Using vt_type as type is a reserved keyword + type_("lz4") + ), # Using type_ as type is a reserved keyword ), logstore( chunk( maxsize("16384"), compression( - vt_type( + type_( "zstd" - ), # Using vt_type as type is a reserved keyword + ), # Using type_ as type is a reserved keyword level("3"), ), ), @@ -459,16 +458,154 @@ def test_generate_billion_scale_services(self): # Validate against relaxng self.assertTrue(validate_services(etree.fromstring(str(generated_xml)))) # Check all nodes and attributes being equal - tree_original = ET.fromstring(self.xml_schema.encode("utf-8")) - tree_generated = ET.fromstring(str(generated_xml)) - for original, generated in zip(tree_original.iter(), tree_generated.iter()): - # print(f"Original: {original.tag}, {original.attrib}, {original.text}") - # print(f"Generated: {generated.tag}, {generated.attrib}, {generated.text}") - self.assertEqual(original.tag, generated.tag) - self.assertEqual(original.attrib, generated.attrib) - orig_text = original.text or "" - gen_text = generated.text or "" - self.assertEqual(orig_text.strip(), gen_text.strip()) + self.assertTrue(compare_xml(self.xml_schema, str(generated_xml))) + + +class TestValidateServices(unittest.TestCase): + def setUp(self): + # Prepare some sample valid and invalid XML data + self.valid_xml_content = """ + + + + + + + 1 + + + + + + + +""" + self.invalid_xml_content = """ + + + + + + + 1 + + + + + + + +""" + + # Create temporary files with valid and invalid XML content + self.valid_xml_file = "valid_test.xml" + self.invalid_xml_file = "invalid_test.xml" + + with open(self.valid_xml_file, "w") as f: + f.write(self.valid_xml_content) + + with open(self.invalid_xml_file, "w") as f: + f.write(self.invalid_xml_content) + + # Create etree.Element from valid XML content + self.valid_xml_element = etree.fromstring(self.valid_xml_content) + + def tearDown(self): + # Clean up temporary files + os.remove(self.valid_xml_file) + os.remove(self.invalid_xml_file) + + def test_validate_valid_xml_content(self): + # Test with valid XML content as string + result = validate_services(self.valid_xml_content) + self.assertTrue(result) + + def test_validate_invalid_xml_content(self): + # Test with invalid XML content as string + result = validate_services(self.invalid_xml_content) + self.assertFalse(result) + + def test_validate_valid_xml_file(self): + # Test with valid XML file path + result = validate_services(self.valid_xml_file) + self.assertTrue(result) + + def test_validate_invalid_xml_file(self): + # Test with invalid XML file path + result = validate_services(self.invalid_xml_file) + self.assertFalse(result) + + def test_validate_valid_xml_element(self): + # Test with valid etree.Element + result = validate_services(self.valid_xml_element) + self.assertTrue(result) + + def test_validate_nonexistent_file(self): + # Test with a non-existent file path + result = validate_services("nonexistent.xml") + self.assertFalse(result) + + def test_validate_invalid_input_type(self): + # Test with invalid input type + result = validate_services(123) + self.assertFalse(result) + + +class TestDocumentExpiry(unittest.TestCase): + def setUp(self): + self.xml_schema = """ + + + + + + + 1 + + + + + + + + +""" + + def test_xml_validation(self): + to_validate = etree.fromstring(self.xml_schema.encode("utf-8")) + # Validate against relaxng + self.assertTrue(validate_services(to_validate)) + + def test_document_expiry(self): + application_name = "music" + generated = services( + container( + search(), + document_api(), + document_processing(), + id=f"{application_name}_container", + version="1.0", + ), + content( + redundancy("1"), + documents( + document( + type=application_name, + mode="index", + selection="music.timestamp > now() - 86400", + ), + garbage_collection="true", + ), + nodes(node(distribution_key="0", hostalias="node1")), + id=f"{application_name}_content", + version="1.0", + ), + ) + generated_xml = generated.to_xml() + # Validate against relaxng + self.assertTrue(validate_services(etree.fromstring(str(generated_xml)))) + # Compare the generated XML with the schema + self.assertTrue(compare_xml(self.xml_schema, str(generated_xml))) if __name__ == "__main__": diff --git a/tests/unit/test_package.py b/tests/unit/test_package.py index fb6f7cfe..ea5f6fb1 100644 --- a/tests/unit/test_package.py +++ b/tests/unit/test_package.py @@ -35,6 +35,7 @@ ApplicationConfiguration, ) from vespa.configuration.vt import compare_xml +from vespa.configuration.services import * class TestField(unittest.TestCase): @@ -1805,3 +1806,83 @@ def test_default_service_config_to_text(self): self.assertTrue( compare_xml(app_package.services_to_text_vt, expected_result), ) + + def test_document_expiry(self): + # Create a Schema with name music and a field with name artist, title and timestamp + # Ref https://docs.vespa.ai/en/documents.html#document-expiry + application_name = "music" + music_schema = Schema( + name=application_name, + document=Document( + fields=[ + Field( + name="artist", + type="string", + indexing=["attribute", "summary"], + ), + Field( + name="title", + type="string", + indexing=["attribute", "summary"], + ), + Field( + name="timestamp", + type="long", + indexing=["attribute", "summary"], + attribute=["fast-access"], + ), + ] + ), + ) + # Create a ServicesConfiguration with document-expiry set to 1 day (timestamp > now() - 86400) + services_config = ServicesConfiguration( + application_name=application_name, + services_config=services( + container( + search(), + document_api(), + document_processing(), + id=f"{application_name}_container", + version="1.0", + ), + content( + redundancy("1"), + documents( + document( + type=application_name, + mode="index", + selection="music.timestamp > now() - 86400", + ), + garbage_collection="true", + ), + nodes(node(distribution_key="0", hostalias="node1")), + id=f"{application_name}_content", + version="1.0", + ), + ), + ) + application_package = ApplicationPackage( + name=application_name, + schema=[music_schema], + services_config=services_config, + ) + expected = """ + + + + + + + + 1 + + + + + + + + +""" + self.assertEqual(expected, application_package.services_to_text) + self.assertTrue(validate_services(application_package.services_to_text)) diff --git a/vespa/application.py b/vespa/application.py index f57848e4..a22dcd57 100644 --- a/vespa/application.py +++ b/vespa/application.py @@ -37,6 +37,9 @@ import gzip from requests.models import PreparedRequest from io import BytesIO +import logging + +logging.getLogger("urllib3").setLevel(logging.ERROR) VESPA_CLOUD_SECRET_TOKEN: str = "VESPA_CLOUD_SECRET_TOKEN" diff --git a/vespa/configuration/services.py b/vespa/configuration/services.py index 653c38b2..115d399c 100644 --- a/vespa/configuration/services.py +++ b/vespa/configuration/services.py @@ -1,5 +1,9 @@ from vespa.configuration.vt import VT, create_tag_function, voids from vespa.configuration.relaxng import RELAXNG +from lxml import etree +from pathlib import Path +import os +from typing import Union # List of XML tags (customized for Vespa configuration) services_tags = [ @@ -165,14 +169,37 @@ _g[sanitized_name] = create_tag_function(tag, tag in voids) -def validate_services(xml_schema: str) -> bool: +def validate_services(xml_input: Union[Path, str, etree.Element]) -> bool: """ - Validate an XML schema against the RelaxNG schema file for services.xml + Validate an XML input against the RelaxNG schema file for services.xml Args: - xml_schema (str): XML schema to validate - + xml_input (Path or str or etree.Element): The XML input to validate. Returns: - bool: True if the XML schema is valid, False otherwise + True if the XML input is valid according to the RelaxNG schema, False otherwise. """ - return RELAXNG["services"].validate(xml_schema) + try: + if isinstance(xml_input, etree._Element): + xml_tree = etree.ElementTree(xml_input) + elif isinstance(xml_input, etree._ElementTree): + xml_tree = xml_input + elif isinstance(xml_input, (str, Path)): + if isinstance(xml_input, Path) or os.path.exists(xml_input): + # Assume it's a file path + xml_tree = etree.parse(str(xml_input)) + elif isinstance(xml_input, str): + # May hav unicode string with encoding declaration + if "encoding" in xml_input: + xml_tree = etree.ElementTree(etree.fromstring(xml_input.encode())) + else: + # Assume it's a string containing XML content + xml_tree = etree.ElementTree(etree.fromstring(xml_input)) + else: + raise TypeError("xml_input must be a Path, str, or etree.Element.") + except Exception as e: + # Handle parsing exceptions + print(f"Error parsing XML input: {e}") + return False + + is_valid = RELAXNG["services"].validate(xml_tree) + return is_valid diff --git a/vespa/configuration/vt.py b/vespa/configuration/vt.py index b2a645c2..74ba1265 100644 --- a/vespa/configuration/vt.py +++ b/vespa/configuration/vt.py @@ -4,11 +4,14 @@ from fastcore.utils import patch import xml.etree.ElementTree as ET -# If the vespa tags correspond to reserved Python keywords, they are replaced with the following: +# If the vespa tags correspond to reserved Python keywords or commonly used names, +# they are replaced with the following: replace_reserved = { - "type": "vt_type", - "class": "cls", - "for": "fr", + "type": "type_", + "class": "class_", + "for": "for_", + "time": "time_", + "io": "io_", } restore_reserved = {v: k for k, v in replace_reserved.items()} @@ -65,11 +68,17 @@ def __iter__(self): def attrmap(o): + """This maps the attributes that we don't want to be Python keywords or commonly used names to the replacement names.""" o = dict(_global="global").get(o, o) return o.lstrip("_").replace("_", "-") def valmap(o): + """Convert values to the string representation for xml. integers to strings and booleans to 'true' or 'false'""" + if isinstance(o, bool): + return str(o).lower() + elif isinstance(o, int): + return str(o) return o if isinstance(o, str) else " ".join(map(str, o)) @@ -84,8 +93,22 @@ def _flatten_tuple(tup): def _preproc(c, kw, attrmap=attrmap, valmap=valmap): + """ + Preprocess the children and attributes of a VT structure. + + :param c: Children of the VT structure + :param kw: Attributes of the VT structure + :param attrmap: Dict to map attribute names + :param valmap: Dict to map attribute values + + :return: Tuple of children and attributes + """ + + # If the children are a single generator, map, or filter, convert it to a tuple if len(c) == 1 and isinstance(c[0], (types.GeneratorType, map, filter)): c = tuple(c[0]) + # Create the attributes dictionary by mapping the keys and values + # TODO: Check if any of Vespa supported attributes are camelCase attrs = {attrmap(k.lower()): valmap(v) for k, v in kw.items() if v is not None} return _flatten_tuple(c), attrs @@ -99,7 +122,7 @@ def vt( valmap: callable = valmap, **kw, ): - "Create an `VT` structure for `to_xml()`" + "Create a VT structure with `tag`, `children` and `attrs`" # NB! fastcore.xml uses tag.lower() for tag names. This is not done here. return VT(tag, *_preproc(c, kw, attrmap=attrmap, valmap=valmap), void_=void_) @@ -110,7 +133,7 @@ def vt( def Xml(*c, version="1.0", encoding="UTF-8", **kwargs) -> VT: - "An top level XML tag, with `encoding` and children `c`" + "A top level XML tag, with `encoding` and children `c`" res = vt("?xml", *c, version=version, encoding=encoding, void_="?") return res @@ -173,8 +196,10 @@ def _to_xml(elm, lvl, indent, do_escape): # Handle the case where children are text or elements res = f"{sp}<{stag}{attr_str}>" - # If the children are just text, don't introduce newlines - if len(cs) == 1 and isinstance(cs[0], str): + # If the children are just text or int, don't introduce newlines + if len(cs) == 1 and (isinstance(cs[0], str) or isinstance(cs[0], int)): + if isinstance(cs[0], int): + cs = str(cs[0]) res += f"{esc_fn(cs[0].strip())}{nl if indent else ''}" else: # If there are multiple children, properly indent them