From b24b41678f392dca42432395feb650dc686865d0 Mon Sep 17 00:00:00 2001 From: Jon Bratseth Date: Mon, 28 Oct 2024 22:03:24 +0100 Subject: [PATCH] Document pack_bits and binarize --- en/reference/indexing-language-reference.html | 83 ++++++++++++++----- 1 file changed, 62 insertions(+), 21 deletions(-) diff --git a/en/reference/indexing-language-reference.html b/en/reference/indexing-language-reference.html index 72dd176740..639ad30594 100644 --- a/en/reference/indexing-language-reference.html +++ b/en/reference/indexing-language-reference.html @@ -201,12 +201,9 @@

Arithmetics

-

Converters

-

There are several expressions that allow you to convert from one data type to another. -These are often used within a for_each to convert -e.g. an array of strings to an array of integers.

+

These expressions lets you convert from one data type to another.

@@ -217,21 +214,65 @@

Converters

- - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + @@ -541,9 +582,9 @@

Other expressions

Description
embedStringA tensor of the type of the receiving field

Invokes an embedder to convert a text to a point in a tensor space. - Arguments are given space separated, as in embed colbert chunk. - The first argument is the id of the embedder, and can be omitted when only one is configured. - Any additional arguments are passed to the embedder implementation.

hashStringAny string

Converts the input to a hash value (using SipHash). +

binarize [threshold]Any tensorAny tensor +

+ Replaces all values in a tensor by 0 or 1. + This takes an optional argument specifying the threshold a value needs to be larger than to be + replaced by 1 instead of 0. The default threshold is 0. + This is useful to create a suitable input to pack_bits. +

+
embed [id]StringA tensor

Invokes an embedder to convert a text to one or more vector embeddings. + The type of the output tensor is what is required by the following expression (as supported by the specific embedder). + Arguments are given space separated, as in embed colbert chunk. + The first argument is the id of the embedder, and can be omitted when only one is configured. + Any additional arguments are passed to the embedder implementation.

hashStringint or long

Converts the input to a hash value (using SipHash). The hash will be int or long depending on the target field.

pack_bitsA tensorA tensor +

+ Packs the values of a binary tensor into bytes with 1 bit per value in big-endian order. +

+

+ The input tensor must: +

    +
  • Only have values that are 0 or 1
  • +
  • Have a single dense dimension
  • +
+ It can have any value type and any number of sparse dimensions. +

+

+ The output tensor will have: +

    +
  • int8 as the value type.
  • +
  • The dense dimension size divided by 8 (rounded upwards to integer).
  • +
  • The same sparse dimensions as before.
  • +
+ The resulting tensor can be unpacked during ranking using + unpack_bits. + A tensor can be converted to binary form suitable as input to this by the + binarize function. +

+
to_array random [ <max> ]

- Returns a random integer value. - Lowest value is 0 and the highest value is determined either by the argument or, - if no argument is given, the execution value. + Returns a random integer value. + Lowest value is 0 and the highest value is determined either by the argument or, + if no argument is given, the execution value.