-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed inserts (_bulk, replace, update) #392
Comments
Two Approaches for ID Generation and Sharding1. Snowflake-like ID GenerationEven distribution is prioritized. This approach generates unique IDs with embedded shard information. Structure (63-bit integer):
Formula:$id = ($timestamp << 22) | ($shardId % 1024 << 12) | ($sequence); Example in PHP:class SnowflakeGenerator {
private const CUSTOM_EPOCH = 1640995200000; // 2022-01-01
private $sequence = 0;
private $lastTimestamp = -1;
public function generateId($shardId) {
$timestamp = $this->getCurrentTimestamp();
if ($timestamp == $this->lastTimestamp) {
$this->sequence = ($this->sequence + 1) & 4095;
if ($this->sequence == 0) {
$timestamp = $this->waitNextMillis($this->lastTimestamp);
}
} else {
$this->sequence = 0;
}
$this->lastTimestamp = $timestamp;
return (($timestamp - self::CUSTOM_EPOCH) << 22)
| ($shardId % 1024 << 12)
| $this->sequence;
}
// Helper methods...
} Characteristics:
2. MD5-based ShardingFor scenarios where custom IDs are needed. Formula:$shardNumber = hexdec(substr(md5($id), 0, 8)) % $totalShards; Example in PHP:class Md5Sharding {
private $totalShards;
public function __construct($totalShards) {
$this->totalShards = $totalShards;
}
public function getShardNumber($id) {
return hexdec(substr(md5((string)$id), 0, 8)) % $this->totalShards;
}
}
// Usage example
$sharding = new Md5Sharding(16);
$customId = 12345;
$shardNumber = $sharding->getShardNumber($customId); Characteristics:
Both approaches have their use cases:
|
As we discussed in Slack, let's avoid using the snowflake ID approach, as we need to keep the option to provide custom IDs. Instead of the modulo function, let's explore other options, like jump consistent hashing. |
@donhardman also, as we discussed |
I have created a task: manticoresoftware/manticoresearch#2752 |
Proposal:
We should implement the logic and stick to JSON protocol ONLY for the initial version when we are able to insert into sharded tables on the Buddy side.
Key considerations:
id
generation on the Buddy side that is easy to maintain and can be moved to the daemon part laterid
, we should use MD5 or similar hashing; otherwise, don't allow passing IDsChecklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: