Distributed inserts (_bulk, replace, update) #392

donhardman · 2024-11-05T04:37:18Z

Proposal:

We should implement the logic and stick to JSON protocol ONLY for the initial version when we are able to insert into sharded tables on the Buddy side.

Key considerations:

Implement id generation on the Buddy side that is easy to maintain and can be moved to the daemon part later
Ensure we use a proper algorithm for sharding logic. If we allow users to pass id, we should use MD5 or similar hashing; otherwise, don't allow passing IDs
For replace or update operations:
- When performed without an ID, send to all nodes
- When performed with an ID, we can determine the specific shard from the ID

Checklist:

^{To be completed by the assignee. Check off tasks that have been completed or are not applicable.}

The text was updated successfully, but these errors were encountered:

donhardman · 2024-11-12T03:38:37Z

Two Approaches for ID Generation and Sharding

1. Snowflake-like ID Generation

Even distribution is prioritized. This approach generates unique IDs with embedded shard information.

Structure (63-bit integer):

41 bits: timestamp (milliseconds since custom epoch)
10 bits: shard/node ID (0-1023)
12 bits: sequence number (0-4095)

Formula:

$id = ($timestamp << 22) | ($shardId % 1024 << 12) | ($sequence);

Example in PHP:

class SnowflakeGenerator {
    private const CUSTOM_EPOCH = 1640995200000; // 2022-01-01
    private $sequence = 0;
    private $lastTimestamp = -1;

    public function generateId($shardId) {
        $timestamp = $this->getCurrentTimestamp();

        if ($timestamp == $this->lastTimestamp) {
            $this->sequence = ($this->sequence + 1) & 4095;
            if ($this->sequence == 0) {
                $timestamp = $this->waitNextMillis($this->lastTimestamp);
            }
        } else {
            $this->sequence = 0;
        }

        $this->lastTimestamp = $timestamp;

        return (($timestamp - self::CUSTOM_EPOCH) << 22)
             | ($shardId % 1024 << 12)
             | $this->sequence;
    }

    // Helper methods...
}

Characteristics:

Number of shards limited but not fixed
Cannot support custom IDs
No extra overhead
Guaranteed sequential IDs

2. MD5-based Sharding

For scenarios where custom IDs are needed.

Formula:

$shardNumber = hexdec(substr(md5($id), 0, 8)) % $totalShards;

Example in PHP:

class Md5Sharding {
    private $totalShards;

    public function __construct($totalShards) {
        $this->totalShards = $totalShards;
    }

    public function getShardNumber($id) {
        return hexdec(substr(md5((string)$id), 0, 8)) % $this->totalShards;
    }
}

// Usage example
$sharding = new Md5Sharding(16);
$customId = 12345;
$shardNumber = $sharding->getShardNumber($customId);

Characteristics:

Fixed number of shards
Supports custom IDs
Small performance overhead for MD5 calculation
Even distribution across shards

Both approaches have their use cases:

Use Snowflake when you need sequential IDs and embedded shard information
Use MD5-based sharding when you need to support custom IDs or have a fixed number of shards

sanikolaev · 2024-11-12T09:41:21Z

As we discussed in Slack, let's avoid using the snowflake ID approach, as we need to keep the option to provide custom IDs. Instead of the modulo function, let's explore other options, like jump consistent hashing.

sanikolaev · 2024-11-12T11:15:34Z

@donhardman also, as we discussed select uuid_short() may be required in the daemon. I've discussed it with @tomatolog and it's not a big deal to add it. Pls create a separate task about it if required.

donhardman · 2024-11-12T12:28:41Z

I have created a task: manticoresoftware/manticoresearch#2752

sanikolaev · 2024-11-13T10:13:55Z

manticoresoftware/manticoresearch#2752 is done

donhardman self-assigned this Nov 5, 2024

sanikolaev added the est::size_M label Nov 5, 2024

donhardman assigned sanikolaev and unassigned donhardman Nov 12, 2024

sanikolaev assigned donhardman and unassigned sanikolaev Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inserts (_bulk, replace, update) #392

Distributed inserts (_bulk, replace, update) #392

donhardman commented Nov 5, 2024

donhardman commented Nov 12, 2024

sanikolaev commented Nov 12, 2024

sanikolaev commented Nov 12, 2024

donhardman commented Nov 12, 2024

sanikolaev commented Nov 13, 2024

Distributed inserts (_bulk, replace, update) #392

Distributed inserts (_bulk, replace, update) #392

Comments

donhardman commented Nov 5, 2024

Proposal:

Checklist:

donhardman commented Nov 12, 2024

Two Approaches for ID Generation and Sharding

1. Snowflake-like ID Generation

Structure (63-bit integer):

Formula:

Example in PHP:

Characteristics:

2. MD5-based Sharding

Formula:

Example in PHP:

Characteristics:

sanikolaev commented Nov 12, 2024

sanikolaev commented Nov 12, 2024

donhardman commented Nov 12, 2024

sanikolaev commented Nov 13, 2024