-
Notifications
You must be signed in to change notification settings - Fork 726
Metadata Management
The data of TiDB will be stored in TiKV in the form of key-value. TiKV is a distributed key-value storage engine and can ensure global key order. The key actually is a byte array of arbitrary length. How to ensure that the key is ordered in the global view and how to access the data?
TiDB splits data by range. For different Regions, each Region corresponds to a key range, and the data in this range is stored in order. The metadata of a Region records the corresponding start key and end key. All requests are processed by finding the corresponding Region through these metadata. Therefore, the management and storage of metadata are important to guarantee the normal operation of the entire cluster. The metadata is not directly stored in TiKV but is stored and managed by placement driver (PD). Like TiKV, PD embeds etcd to utilize raft algorithm to ensure high availability. In addition to the metadata of Region, there are some other metadata in TiDB, which are managed and stored by PD. At present, there are mainly the following types of metadata:
- Metadata of Region
- Metadata about TiDB, including DDL, GC, safe point
- Metadata registered by TiDB tools, such as pump, drainer, etc
- Allocated ID and TSO window, etc
The metadata of Region is reported by TiKV's Heartbeat. Each Region is a raft group. By default, a Region heartbeat is reported to the PD leader every 60s
. The information reported by a Region defined in pdpb.proto is mainly composed of:
type RegionHeartbeatRequest struct {
...
Region *metapb.Region `protobuf:"bytes,2,opt,name=region" json:"region,omitempty"`
// Leader Peer sending the heartbeat.
Leader *metapb.Peer `protobuf:"bytes,3,opt,name=leader" json:"leader,omitempty"`
// Leader considers that these peers are down.
DownPeers []*PeerStats `protobuf:"bytes,4,rep,name=down_peers,json=downPeers" json:"down_peers,omitempty"`
// Pending peers are the peers that the leader can't consider as
// working followers.
PendingPeers []*metapb.Peer `protobuf:"bytes,5,rep,name=pending_peers,json=pendingPeers" json:"pending_peers,omitempty"`
// Bytes read/written during this period.
BytesWritten uint64 `protobuf:"varint,6,opt,name=bytes_written,json=bytesWritten,proto3" json:"bytes_written,omitempty"`
BytesRead uint64 `protobuf:"varint,7,opt,name=bytes_read,json=bytesRead,proto3" json:"bytes_read,omitempty"`
// Keys read/written during this period.
KeysWritten uint64 `protobuf:"varint,8,opt,name=keys_written,json=keysWritten,proto3" json:"keys_written,omitempty"`
KeysRead uint64 `protobuf:"varint,9,opt,name=keys_read,json=keysRead,proto3" json:"keys_read,omitempty"`
// Approximate Region size.
ApproximateSize uint64 `protobuf:"varint,10,opt,name=approximate_size,json=approximateSize,proto3" json:"approximate_size,omitempty"`
// Actually reported time interval
Interval *TimeInterval `protobuf:"bytes,12,opt,name=interval" json:"interval,omitempty"`
// Approximate number of keys.
ApproximateKeys uint64 `protobuf:"varint,13,opt,name=approximate_keys,json=approximateKeys,proto3" json:"approximate_keys,omitempty"`
// Term is the term of raft group.
Term uint64 `protobuf:"varint,14,opt,name=term,proto3" json:"term,omitempty"`
ReplicationStatus *replication_modepb.RegionReplicationStatus `protobuf:"bytes,15,opt,name=replication_status,json=replicationStatus" json:"replication_status,omitempty"`
...
}
The metadata includes the information of start key, end key, Region epoch, as well as some flow and Region size information. It will be stored in the cache as one of the inputs of the scheduler to make scheduling decisions. The Region metadata in the cache is ordered by key in the BTree structure to implement quick query of a Region.
In order to reduce the RPC overhead caused by the requests which are used to get the routing information of Regions from PD, TiDB will maintain the routing information of the Region cache itself. For a specific request, TiDB will first check if there is any relevant Region information in its own cache, which can be used directly. Otherwise, it will request PD to get the corresponding Region information through the key and update its local cache. In addition to its own unique ID, each Region also has epoch information, which is used to mark the version of Region change. The definition is listed as follows:
type RegionEpoch struct {
// Conf change version, auto increment when add or remove peer
ConfVer uint64 `protobuf:"varint,1,opt,name=conf_ver,json=confVer,proto3" json:"conf_ver,omitempty"`
// Region version, auto increment when split or merge
Version uint64 `protobuf:"varint,2,opt,name=version,proto3" json:"version,omitempty"`
...
}
The epoch can uniquely identify the Region of the current request. Before TiKV processes a request, the request will be verified by comparing the epoch. If the status is inconsistent, the request will be regarded as illegal and an error will be returned. There are some common errors:
- Not Leader. The leader may change due to PD scheduling or network jitter. If the error of TiKV is returned, a new leader related information will be carried in it, and the cached leader information of TiDB will be directly updated
- Stale Epoch. If the Region epoch is out of date due to member changes or splitting, it is necessary to request PD to update the local cache information
Besides, a default 10 minute TTL is set to the Region cache in TiDB. If a request occurs in a Region with TTL expired, the Region cache will be updated directly from PD to prevent cache from being incorrect.
In order to prevent the loss of metadata of PD after restarting, it is necessary to store the important metadata. PD is embedded with etcd, which itself is a highly available and strongly consistent key-value storage. Therefore, PD stores metadata by writing etcd directly. Each Region has its own unique Region ID, and PD processes the Region metadata as KV pair, and the encoding method of its key is {clusterID}/r/{regionID}
, and the value is Region
which defined in metapb.proto.
type Region struct {
Id uint64 `protobuf:"varint,1,opt,name=id,proto3" json:"id,omitempty"`
// Region key range [start_key, end_key).
StartKey []byte `protobuf:"bytes,2,opt,name=start_key,json=startKey,proto3" json:"start_key,omitempty"`
EndKey []byte `protobuf:"bytes,3,opt,name=end_key,json=endKey,proto3" json:"end_key,omitempty"`
RegionEpoch *RegionEpoch `protobuf:"bytes,4,opt,name=region_epoch,json=regionEpoch" json:"region_epoch,omitempty"`
Peers []*Peer `protobuf:"bytes,5,rep,name=peers" json:"peers,omitempty"`
...
}
In addition, in order to speed up the processing of the heartbeat, the corresponding Region information reported by the heartbeat will be updated to KV only when the epoch changes or a new Region is inserted. Otherwise, only the cache is updated.