Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed inserts (_bulk, replace, update) #392

Open
1 of 6 tasks
donhardman opened this issue Nov 5, 2024 · 5 comments
Open
1 of 6 tasks

Distributed inserts (_bulk, replace, update) #392

donhardman opened this issue Nov 5, 2024 · 5 comments
Assignees

Comments

@donhardman
Copy link
Contributor

Proposal:

We should implement the logic and stick to JSON protocol ONLY for the initial version when we are able to insert into sharded tables on the Buddy side.

Key considerations:

  • Implement id generation on the Buddy side that is easy to maintain and can be moved to the daemon part later
  • Ensure we use a proper algorithm for sharding logic. If we allow users to pass id, we should use MD5 or similar hashing; otherwise, don't allow passing IDs
  • For replace or update operations:
    • When performed without an ID, send to all nodes
    • When performed with an ID, we can determine the specific shard from the ID

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation reviewed
  • Changelog updated
  • OpenAPI YAML updated and issue created to rebuild clients
@donhardman donhardman self-assigned this Nov 5, 2024
@donhardman
Copy link
Contributor Author

Two Approaches for ID Generation and Sharding

1. Snowflake-like ID Generation

Even distribution is prioritized. This approach generates unique IDs with embedded shard information.

Structure (63-bit integer):

  • 41 bits: timestamp (milliseconds since custom epoch)
  • 10 bits: shard/node ID (0-1023)
  • 12 bits: sequence number (0-4095)

Formula:

$id = ($timestamp << 22) | ($shardId % 1024 << 12) | ($sequence);

Example in PHP:

class SnowflakeGenerator {
    private const CUSTOM_EPOCH = 1640995200000; // 2022-01-01
    private $sequence = 0;
    private $lastTimestamp = -1;

    public function generateId($shardId) {
        $timestamp = $this->getCurrentTimestamp();

        if ($timestamp == $this->lastTimestamp) {
            $this->sequence = ($this->sequence + 1) & 4095;
            if ($this->sequence == 0) {
                $timestamp = $this->waitNextMillis($this->lastTimestamp);
            }
        } else {
            $this->sequence = 0;
        }

        $this->lastTimestamp = $timestamp;

        return (($timestamp - self::CUSTOM_EPOCH) << 22)
             | ($shardId % 1024 << 12)
             | $this->sequence;
    }

    // Helper methods...
}

Characteristics:

  • Number of shards limited but not fixed
  • Cannot support custom IDs
  • No extra overhead
  • Guaranteed sequential IDs

2. MD5-based Sharding

For scenarios where custom IDs are needed.

Formula:

$shardNumber = hexdec(substr(md5($id), 0, 8)) % $totalShards;

Example in PHP:

class Md5Sharding {
    private $totalShards;

    public function __construct($totalShards) {
        $this->totalShards = $totalShards;
    }

    public function getShardNumber($id) {
        return hexdec(substr(md5((string)$id), 0, 8)) % $this->totalShards;
    }
}

// Usage example
$sharding = new Md5Sharding(16);
$customId = 12345;
$shardNumber = $sharding->getShardNumber($customId);

Characteristics:

  • Fixed number of shards
  • Supports custom IDs
  • Small performance overhead for MD5 calculation
  • Even distribution across shards

Both approaches have their use cases:

  • Use Snowflake when you need sequential IDs and embedded shard information
  • Use MD5-based sharding when you need to support custom IDs or have a fixed number of shards

@donhardman donhardman assigned sanikolaev and unassigned donhardman Nov 12, 2024
@sanikolaev
Copy link
Collaborator

As we discussed in Slack, let's avoid using the snowflake ID approach, as we need to keep the option to provide custom IDs. Instead of the modulo function, let's explore other options, like jump consistent hashing.

@sanikolaev sanikolaev assigned donhardman and unassigned sanikolaev Nov 12, 2024
@sanikolaev
Copy link
Collaborator

@donhardman also, as we discussed select uuid_short() may be required in the daemon. I've discussed it with @tomatolog and it's not a big deal to add it. Pls create a separate task about it if required.

@donhardman
Copy link
Contributor Author

I have created a task: manticoresoftware/manticoresearch#2752

@sanikolaev
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants