title | summary |
---|---|
TiCDC Canal-JSON Protocol |
Learn the concept of TiCDC Canal-JSON Protocol and how to use it. |
Canal-JSON is a data exchange format protocol defined by Alibaba Canal. In this document, you can learn how Canal-JSON data formats are implemented in TiCDC, including the TiDB extension field, the definitions of the Canal-JSON data formats, and comparison with the official Canal.
When using Message Queue (MQ) as the downstream Sink, you can specify Canal-JSON in sink-uri
. TiCDC wraps and constructs Canal-JSON messages with Event as the basic unit, and sends TiDB data change Events to the downstream.
There are three types of Events:
- DDL Event: Represents a DDL change record. It is sent after an upstream DDL statement is successfully executed. The DDL Event is sent to the MQ Partition with the index being 0.
- DML Event: Represents a row data change record. This type of Event is sent when a row change occurs. It contains the information about the row after the change occurs.
- WATERMARK Event: Represents a special time point. It indicates that the Events received before this point is complete. It applies only to the TiDB extension field and takes effect when you set
enable-tidb-extension
totrue
insink-uri
.
The following is an example of using Canal-JSON
:
{{< copyable "shell-regular" >}}
cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json"
The Canal-JSON protocol is originally designed for MySQL. It does not contain important fields such as the TiDB-specific unique identifier for the CommitTS transaction. To solve this problem, TiCDC appends a TiDB extension field to the Canal-JSON protocol format. After you set enable-tidb-extension
to true
(false
by default) in sink-uri
, TiCDC behaves as follows when generating Canal-JSON messages:
- TiCDC sends DML Event and DDL Event messages that contain a field named
_tidb
. - TiCDC sends WATERMARK Event messages.
The following is an example:
{{< copyable "shell-regular" >}}
cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json-enable-tidb-extension" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json&enable-tidb-extension=true"
This section describes the formats of DDL Event, DML Event and WATERMARK Event, and how the data is resolved on the consumer side.
TiCDC encodes a DDL Event into the following Canal-JSON format.
{
"id": 0,
"database": "test",
"table": "",
"pkNames": null,
"isDdl": true,
"type": "QUERY",
"es": 1639633094670,
"ts": 1639633095489,
"sql": "drop database if exists test",
"sqlType": null,
"mysqlType": null,
"data": null,
"old": null,
"_tidb": { // TiDB extension field
"commitTs": 163963309467037594
}
}
The fields are explained as follows.
Field | Type | Description |
---|---|---|
id | Number | The default value is 0 in TiCDC. |
database | String | The name of the database where the row is located |
table | String | The name of the table where the row is located |
pkNames | Array | The names of all the columns that make up the primary key |
isDdl | Bool | Whether the message is a DDL event |
type | String | Event types defined by Canal-JSON |
es | Number | 13-bit (millisecond) timestamp when the event that generated the message happened |
ts | Number | 13-bit (millisecond) timestamp when TiCDC generated the message |
sql | String | When isDdl is true , records the corresponding DDL statement |
sqlType | Object | When isDdl is false , records how the data type of each column is represented in Java |
mysqlType | object | When isDdl is false , records how the data type of each column is represented in MySQL |
data | Object | When isDdl is false , records the name of each column and its data value |
old | Object | Only if the message is generated by an update Event, records the name of each column and the data value before the update |
_tidb | Object | TiDB extension field. It exists only if you set enable-tidb-extension to true . The value of commitTs is the TSO of the transaction that caused the row to change. |
TiCDC encodes a row of DML data change event as follows:
{
"id": 0,
"database": "test",
"table": "tp_int",
"pkNames": [
"id"
],
"isDdl": false,
"type": "INSERT",
"es": 1639633141221,
"ts": 1639633142960,
"sql": "",
"sqlType": {
"c_bigint": -5,
"c_int": 4,
"c_mediumint": 4,
"c_smallint": 5,
"c_tinyint": -6,
"id": 4
},
"mysqlType": {
"c_bigint": "bigint",
"c_int": "int",
"c_mediumint": "mediumint",
"c_smallint": "smallint",
"c_tinyint": "tinyint",
"id": "int"
},
"data": [
{
"c_bigint": "9223372036854775807",
"c_int": "2147483647",
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "127",
"id": "2"
}
],
"old": null,
"_tidb": { // TiDB extension field
"commitTs": 163963314122145239
}
}
TiCDC sends a WATERMARK Event only when you set enable-tidb-extension
to true
. The value of the type
field is TIDB_WATERMARK
. The Event contains the _tidb
field, and the field contains only one parameter watermarkTs
. The value of watermarkTs
is the TSO recorded when the Event is sent.
When you receive an Event of this type, all Events with commitTs
less than watermarkTs
have been sent. Because TiCDC provides the "At Least Once" semantics, data might be sent repeatedly. If a subsequent Event with commitTs
less than watermarkTs
is received, you can safely ignore this Event.
The following is an example of the WATERMARK Event.
{
"id": 0,
"database": "",
"table": "",
"pkNames": null,
"isDdl": false,
"type": "TIDB_WATERMARK",
"es": 1640007049196,
"ts": 1640007050284,
"sql": "",
"sqlType": null,
"mysqlType": null,
"data": null,
"old": null,
"_tidb": { // TiDB extension field
"watermarkTs": 429918007904436226
}
}
As you can see from the example above, Canal-JSON has a uniform data format, with different field filling rules for different Event types. You can use a uniform method to resolve this JSON format data, and then determine the Event type by checking the field values.
- When
isDdl
istrue
, the message contains a DDL Event. - When
isDdl
isfalse
, you need to further check thetype
field. Iftype
isTIDB_WATERMARK
, it is a WATERMARK Event; otherwise, it is a DML Event.
The Canal-JSON format records the corresponding data type in the mysqlType
field and the sqlType
field.
In the mysqlType
field, the Canal-JSON format records the string of MySQL Type in each column. For more information, see TiDB Data Types.
In the sqlType
field, the Canal-JSON format records Java SQL Type of each column, which is the data type corresponding to the data in JDBC. Its value can be calculated by MySQL Type and the specific data value. The mapping is as follows:
MySQL Type | Java SQL Type Code |
---|---|
Boolean | -6 |
Float | 7 |
Double | 8 |
Decimal | 3 |
Char | 1 |
Varchar | 12 |
Binary | 2004 |
Varbinary | 2004 |
Tinytext | 2005 |
Text | 2005 |
Mediumtext | 2005 |
Longtext | 2005 |
Tinyblob | 2004 |
Blob | 2004 |
Mediumblob | 2004 |
Longblob | 2004 |
Date | 91 |
Datetime | 93 |
Timestamp | 93 |
Time | 92 |
Year | 12 |
Enum | 4 |
Set | -7 |
Bit | -7 |
JSON | 12 |
You need to consider whether integer types have the Unsigned
constraint and the value size, which corresponds to different Java SQL Type Codes respectively, as shown in the following table.
MySQL Type String | Value Range | Java SQL Type Code |
---|---|---|
tinyint | [-128, 127] | -6 |
tinyint unsigned | [0, 127] | -6 |
tinyint unsigned | [128, 255] | 5 |
smallint | [-32768, 32767] | 5 |
smallint unsigned | [0, 32767] | 5 |
smallint unsigned | [32768, 65535] | 4 |
mediumint | [-8388608, 8388607] | 4 |
mediumint unsigned | [0, 8388607] | 4 |
mediumint unsigned | [8388608, 16777215] | 4 |
int | [-2147483648, 2147483647] | 4 |
int unsigned | [0, 2147483647] | 4 |
int unsigned | [2147483648, 4294967295] | -5 |
bigint | [-9223372036854775808, 9223372036854775807] | -5 |
bigint unsigned | [0, 9223372036854775807] | -5 |
bigint unsigned | [9223372036854775808, 18446744073709551615] | 3 |
The following table shows the mapping relationships between Java SQL Types in TiCDC and their codes.
Java SQL Type | Java SQL Type Code |
---|---|
CHAR | 1 |
DECIMAL | 3 |
INTEGER | 4 |
SMALLINT | 5 |
REAL | 7 |
DOUBLE | 8 |
VARCHAR | 12 |
DATE | 91 |
TIME | 92 |
TIMESTAMP | 93 |
BLOB | 2004 |
CLOB | 2005 |
BIGINT | -5 |
TINYINT | -6 |
Bit | -7 |
For more information about Java SQL Types, see Java SQL Class Types.
TiCDC encodes binary types in the Canal-JSON format by converting each byte to its character representation as follows:
- Printable characters are represented using the ISO/IEC 8859-1 character encodings.
- Non-printable characters and certain characters with special meaning in HTML are represented using their UTF-8 escape sequence.
The following table shows the detailed representation information.
Character type | Value range | Character representation |
---|---|---|
Control characters | [0, 31] | UTF-8 escape (such as \u0000 through \u001F ) |
Horizontal tab | [9] | \t |
Line feed | [10] | \n |
Carriage return | [13] | \r |
Printable characters | [32, 127] | Literal character (such as A ) |
Ampersand | [38] | \u0026 |
Less-than sign | [60] | \u0038 |
Greater-than sign | [62] | \u003E |
Extended control characters | [128, 159] | Literal character |
ISO 8859-1 (Latin-1) | [160, 255] | Literal character |
For example, the following 16 bytes [5 7 10 15 36 50 43 99 120 60 38 255 254 45 55 70]
stored in a VARBINARY
column called c_varbinary
are encoded in a Canal-JSON Update
event as follows:
{
...
"data": [
{
...
"c_varbinary": "\u0005\u0007\n\u000f$2+cx\u003c\u0026ÿþ-7F"
}
]
...
}
The way that TiCDC implements the Canal-JSON data format, including the Update
Event and the mysqlType
field, differs from the official Canal. The following table shows the main differences.
Item | TiCDC Canal-JSON | Canal |
---|---|---|
Event of Update Type |
By default, the old field contains all the column data. When only_output_updated_columns is true , the old field contains only the modified column data. |
The old field contains only the modified column data |
mysqlType field |
For types with parameters, it does not contain the information of the type parameter | For types with parameters, it contains the full information of the type parameter |
Starting from v6.5.6, v7.1.3, and v7.6.0, TiCDC Canal-JSON supports compatibility with the data format of the official Canal. When creating a changefeed, you can set content-compatible=true
in sink-uri
to enable this feature. In this mode, TiCDC outputs Canal-JSON format data that is compatible with the official Canal. The specific changes are as follows:
- The
mysqlType
field contains the full information of the type parameter for each type. - An Event of
Update
Type only outputs the modified column data.
For an Event of Update
Type:
- In TiCDC, the
old
field contains all the column data - In the official Canal, the
old
field contains only the modified column data
Assume that the following SQL statements are executed sequentially in the upstream TiDB:
create table tp_int
(
id int auto_increment,
c_tinyint tinyint null,
c_smallint smallint null,
c_mediumint mediumint null,
c_int int null,
c_bigint bigint null,
constraint pk
primary key (id)
);
insert into tp_int(c_tinyint, c_smallint, c_mediumint, c_int, c_bigint)
values (127, 32767, 8388607, 2147483647, 9223372036854775807);
update tp_int set c_int = 0, c_tinyint = 0 where c_smallint = 32767;
For the update
statement, TiCDC outputs an Event message with type
as UPDATE
, as shown below. The update
statement only modifies the c_int
and c_tinyint
columns. The old
field in the output event message contains all the column data.
{
"id": 0,
...
"type": "UPDATE",
...
"sqlType": {
...
},
"mysqlType": {
...
},
"data": [
{
"c_bigint": "9223372036854775807",
"c_int": "0",
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "0",
"id": "2"
}
],
"old": [ // In TiCDC, this field contains all the column data.
{
"c_bigint": "9223372036854775807",
"c_int": "2147483647", // Modified column
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "127", // Modified column
"id": "2"
}
]
}
For the official Canal, the old
field in the output event message contains only the modified column data, as shown below.
{
"id": 0,
...
"type": "UPDATE",
...
"sqlType": {
...
},
"mysqlType": {
...
},
"data": [
{
"c_bigint": "9223372036854775807",
"c_int": "0",
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "0",
"id": "2"
}
],
"old": [ // In Canal, this field contains only the modified column data.
{
"c_int": "2147483647", // Modified column
"c_tinyint": "127", // Modified column
}
]
}
For the mysqlType
field, if a type contains parameters, the official Canal contains the full information of the type parameter. TiCDC does not contain such information.
In the following example, the table-defining SQL statement contains a parameter for each column, such as the ones for decimal
, char
, varchar
and enum
. By comparing the Canal-JSON formats generated by TiCDC and the official Canal, you can see that TiCDC only contains the basic MySQL information in the mysqlType
field. If you need the full information of the type parameter, you need to implement it by other means.
Assume that the following SQL statements are executed sequentially in the upstream TiDB:
create table t (
id int auto_increment,
c_decimal decimal(10, 4) null,
c_char char(16) null,
c_varchar varchar(16) null,
c_binary binary(16) null,
c_varbinary varbinary(16) null,
c_enum enum('a','b','c') null,
c_set set('a','b','c') null,
c_bit bit(64) null,
constraint pk
primary key (id)
);
insert into t (c_decimal, c_char, c_varchar, c_binary, c_varbinary, c_enum, c_set, c_bit)
values (123.456, "abc", "abc", "abc", "abc", 'a', 'a,b', b'1000001');
The output of TiCDC is as follows:
{
"id": 0,
...
"isDdl": false,
"sqlType": {
...
},
"mysqlType": {
"c_binary": "binary",
"c_bit": "bit",
"c_char": "char",
"c_decimal": "decimal",
"c_enum": "enum",
"c_set": "set",
"c_varbinary": "varbinary",
"c_varchar": "varchar",
"id": "int"
},
"data": [
{
...
}
],
"old": null,
}
The output of the official Canal is as follows:
{
"id": 0,
...
"isDdl": false,
"sqlType": {
...
},
"mysqlType": {
"c_binary": "binary(16)",
"c_bit": "bit(64)",
"c_char": "char(16)",
"c_decimal": "decimal(10, 4)",
"c_enum": "enum('a','b','c')",
"c_set": "set('a','b','c')",
"c_varbinary": "varbinary(16)",
"c_varchar": "varchar(16)",
"id": "int"
},
"data": [
{
...
}
],
"old": null,
}
From v5.4.0, the old
field of the Delete
events has changed.
The following is a Delete
event message. Before v5.4.0, the old
field contains the same content as the "data" field. In v5.4.0 and later versions, the old
field is set to null. You can get the deleted data by using the "data" field.
{
"id": 0,
"database": "test",
...
"type": "DELETE",
...
"sqlType": {
...
},
"mysqlType": {
...
},
"data": [
{
"c_bigint": "9223372036854775807",
"c_int": "0",
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "0",
"id": "2"
}
],
"old": null,
// The following is an example before v5.4.0. The `old` field contains the same content as the "data" field.
"old": [
{
"c_bigint": "9223372036854775807",
"c_int": "0",
"c_mediumint": "8388607",
"c_smallint": "32767",
"c_tinyint": "0",
"id": "2"
}
]
}