Skip to content

Latest commit

 

History

History
448 lines (327 loc) · 20.3 KB

tidb-lightning-data-source.md

File metadata and controls

448 lines (327 loc) · 20.3 KB
title summary aliases
TiDB Lightning Data Sources
Learn all the data sources supported by TiDB Lightning.
/docs/dev/tidb-lightning/migrate-from-csv-using-tidb-lightning/
/docs/dev/reference/tools/tidb-lightning/csv/
/tidb/dev/migrate-from-csv-using-tidb-lightning/

TiDB Lightning Data Sources

TiDB Lightning supports importing data from multiple data sources to TiDB clusters, including CSV, SQL, and Parquet files.

To specify the data source for TiDB Lightning, use the following configuration:

[mydumper]
# Local source data directory or the URI of the external storage such as S3. For more information about the URI of the external storage, see https://docs.pingcap.com/tidb/v6.6/backup-and-restore-storages#uri-format.
data-source-dir = "/data/my_database"

When TiDB Lightning is running, it looks for all files that match the pattern of data-source-dir.

File Type Pattern
Schema file Contains the CREATE TABLE DDL statement ${db_name}.${table_name}-schema.sql
Schema file Contains the CREATE DATABASE DDL statement ${db_name}-schema-create.sql
Data file If the data file contains data for a whole table, the file is imported into a table named ${db_name}.${table_name} ${db_name}.${table_name}.${csv|sql|parquet}
Data file If the data for a table is split into multiple data files, each data file must be suffixed with a number in its filename ${db_name}.${table_name}.001.${csv|sql|parquet}
Compressed file If the file contains a compression suffix, such as gzip, snappy, or zstd, TiDB Lightning will decompress the file before importing it. Note that the Snappy compressed file must be in the official Snappy format. Other variants of Snappy compression are not supported. ${db_name}.${table_name}.${csv|sql|parquet}.{compress}

TiDB Lightning processes data in parallel as much as possible. Because files must be read in sequence, the data processing concurrency is at the file level (controlled by region-concurrency). Therefore, when the imported file is large, the import performance is poor. It is recommended to limit the size of the imported file to no greater than 256 MiB to achieve the best performance.

Rename databases and tables

TiDB Lightning follows filename patterns to import data to the corresponding database and table. If the database or table names change, you can either rename the files and then import them, or use regular expressions to replace the names online.

Rename files in batch

If you are using Red Hat Linux or a distribution based on Red Hat Linux, you can use the rename command to batch rename files in the data-source-dir directory.

For example:

rename srcdb. tgtdb. *.sql

After you modify the database name, it is recommended that you delete the ${db_name}-schema-create.sql file that contains the CREATE DATABASE DDL statement from the data-source-dir directory. If you want to modify the table name as well, you also need to modify the table name in the ${db_name}.${table_name}-schema.sql file that contains the CREATE TABLE DDL statement.

Use regular expressions to replace names online

To use regular expressions to replace names online, you can use the pattern configuration within [[mydumper.files]] to match filenames, and replace schema and table with your desired names. For more information, see Match customized files.

The following is an example of using regular expressions to replace names online. In this example:

  • The match rule for the data file pattern is ^({schema_regrex})\.({table_regrex})\.({file_serial_regrex})\.(csv|parquet|sql).
  • Specify schema as '$1', which means that the value of the first regular expression schema_regrex remains unchanged. Or specify schema as a string, such as 'tgtdb', which means a fixed target database name.
  • Specify table as '$2', which means that the value of the second regular expression table_regrex remains unchanged. Or specify table as a string, such as 't1', which means a fixed target table name.
  • Specify type as '$3', which means the data file type. You can specify type as either "table-schema" (representing the schema.sql file) or "schema-schema" (representing the schema-create.sql file).
[mydumper]
data-source-dir = "/some-subdir/some-database/"
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)-schema-create\.sql'
schema = 'tgtdb'
type = "schema-schema"
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)-schema\.sql'
schema = 'tgtdb'
table = '$2'
type = "table-schema"
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)\.(?:[0-9]+)\.(csv|parquet|sql)'
schema = 'tgtdb'
table = '$2'
type = '$3'

If you are using gzip to back up data files, you need to configure the compression format accordingly. The matching rule of the data file pattern is '^({schema_regrex})\.({table_regrex})\.({file_serial_regrex})\.(csv|parquet|sql)\.(gz)'. You can specify compression as '$4' to represent the compressed file format. For example:

[mydumper]
data-source-dir = "/some-subdir/some-database/"
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)-schema-create\.(sql)\.(gz)'
schema = 'tgtdb'
type = "schema-schema"
compression = '$4'
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)-schema\.(sql)\.(gz)'
schema = 'tgtdb'
table = '$2'
type = "table-schema"
compression = '$4'
[[mydumper.files]]
pattern = '^(srcdb)\.(.*?)\.(?:[0-9]+)\.(sql)\.(gz)'
schema = 'tgtdb'
table = '$2'
type = '$3'
compression = '$4'

CSV

Schema

CSV files are schema-less. To import CSV files into TiDB, you must provide a table schema. You can provide schema by either of the following methods:

  • Create files named ${db_name}.${table_name}-schema.sql and ${db_name}-schema-create.sql that contain DDL statements.
  • Manually create the table schema in TiDB.

Configuration

You can configure the CSV format in the [mydumper.csv] section in the tidb-lightning.toml file. Most settings have a corresponding option in the LOAD DATA statement of MySQL.

[mydumper.csv]
# The field separator. Can be one or multiple characters. The default is ','.
# If the data might contain commas, it is recommended to use '|+|' or other uncommon
# character combinations as a separator.
separator = ','
# Quoting delimiter. Empty value means no quoting.
delimiter = '"'
# Line terminator. Can be one or multiple characters. Empty value (default) means
# both "\n" (LF) and "\r\n" (CRLF) are line terminators.
terminator = ''
# Whether the CSV file contains a header.
# If `header` is true, the first line is skipped and mapped
# to the table columns.
header = true
# Whether the CSV file contains any NULL value.
# If `not-null` is true, all columns from CSV cannot be parsed as NULL.
not-null = false
# When `not-null` is false (that is, CSV can contain NULL),
# fields equal to this value will be treated as NULL.
null = '\N'
# Whether to parse backslash as escape character.
backslash-escape = true
# Whether to treat `separator` as the line terminator and trim all trailing separators.
trim-last-separator = false

If the input of a string field such as separator, delimiter, or terminator involves special characters, you can use a backslash to escape the special characters. The escape sequence must be a double-quoted string ("…"). For example, separator = "\u001f" means using the ASCII character 0X1F as the separator.

You can use single-quoted strings ('…') to suppress backslash escaping. For example, terminator = '\n' means using the two-character string, a backslash (\) followed by the letter n, as the terminator, rather than the LF \n.

For more details, see the TOML v1.0.0 specification.

separator

  • Defines the field separator.

  • Can be one or multiple characters, but must not be empty.

  • Common values:

    • ',' for CSV (comma-separated values).
    • "\t" for TSV (tab-separated values).
    • "\u0001" to use the ASCII character 0x01.
  • Corresponds to the FIELDS TERMINATED BY option in the LOAD DATA statement.

delimiter

  • Defines the delimiter used for quoting.

  • If delimiter is empty, all fields are unquoted.

  • Common values:

    • '"' quotes fields with double-quote. The same as RFC 4180.
    • '' disables quoting.
  • Corresponds to the FIELDS ENCLOSED BY option in the LOAD DATA statement.

terminator

  • Defines the line terminator.
  • If terminator is empty, both "\n" (Line Feed) and "\r\n" (Carriage Return + Line Feed) are used as the line terminator.
  • Corresponds to the LINES TERMINATED BY option in the LOAD DATA statement.

header

  • Whether all CSV files contain a header row.
  • If header is true, the first row is used as the column names. If header is false, the first row is treated as an ordinary data row.

not-null and null

  • The not-null setting controls whether all fields are non-nullable.

  • If not-null is false, the string specified by null is transformed to the SQL NULL instead of a specific value.

  • Quoting does not affect whether a field is null.

    For example, in the following CSV file:

    A,B,C
    \N,"\N",
    

    In the default settings (not-null = false; null = '\N'), the columns A and B are both converted to NULL after being imported to TiDB. The column C is an empty string '' but not NULL.

backslash-escape

  • Whether to parse backslash inside fields as escape characters.

  • If backslash-escape is true, the following sequences are recognized and converted:

    Sequence Converted to
    \0 Null character (U+0000)
    \b Backspace (U+0008)
    \n Line feed (U+000A)
    \r Carriage return (U+000D)
    \t Tab (U+0009)
    \Z Windows EOF (U+001A)

    In all other cases (for example, \"), the backslash is stripped, leaving the next character (") in the field. The character left has no special roles (for example, delimiters) and is just an ordinary character.

  • Quoting does not affect whether backslash is parsed as an escape character.

  • Corresponds to the FIELDS ESCAPED BY '\' option in the LOAD DATA statement.

trim-last-separator

  • Whether to treat separator as the line terminator and trim all trailing separators.

    For example, in the following CSV file:

    A,,B,,
    
    • When trim-last-separator = false, this is interpreted as a row of 5 fields ('A', '', 'B', '', '').
    • When trim-last-separator = true, this is interpreted as a row of 3 fields ('A', '', 'B').
  • This option is deprecated. Use the terminator option instead.

    If your existing configuration is:

    separator = ','
    trim-last-separator = true

    It is recommended to change the configuration to:

    separator = ','
    terminator = ",\n" # Use ",\n" or ",'\r\n" according to your actual file.

Non-configurable options

TiDB Lightning does not support every option supported by the LOAD DATA statement. For example:

  • There cannot be line prefixes (LINES STARTING BY).
  • The header cannot be skipped (IGNORE n LINES) and must be valid column names.

Strict format

TiDB Lightning works best when the input files have a uniform size of around 256 MiB. When the input is a single huge CSV file, TiDB Lightning can only process the file in one thread, which slows down the import speed.

This can be fixed by splitting the CSV into multiple files first. For the generic CSV format, there is no way to quickly identify where a row starts or ends without reading the whole file. Therefore, TiDB Lightning by default does not automatically split a CSV file. However, if you are certain that the CSV input adheres to certain restrictions, you can enable the strict-format setting to allow TiDB Lightning to split the file into multiple 256 MiB-sized chunks for parallel processing.

[mydumper]
strict-format = true

In a strict CSV file, every field occupies only a single line. In other words, one of the following must be true:

  • Delimiter is empty.
  • Every field does not contain the terminator itself. In the default configuration, this means every field does not contain CR (\r) or LF (\n).

If a CSV file is not strict, but strict-format is wrongly set to true, a field spanning multiple lines may be cut in half into two chunks, causing parse failure, or even quietly importing corrupted data.

Common configuration examples

CSV

The default setting is already tuned for CSV following RFC 4180.

[mydumper.csv]
separator = ',' # If the data might contain a comma (','), it is recommended to use '|+|' or other uncommon character combinations as the separator.
delimiter = '"'
header = true
not-null = false
null = '\N'
backslash-escape = true

Example content:

ID,Region,Count
1,"East",32
2,"South",\N
3,"West",10
4,"North",39

TSV

[mydumper.csv]
separator = "\t"
delimiter = ''
header = true
not-null = false
null = 'NULL'
backslash-escape = false

Example content:

ID    Region    Count
1     East      32
2     South     NULL
3     West      10
4     North     39

TPC-H DBGEN

[mydumper.csv]
separator = '|'
delimiter = ''
terminator = "|\n"
header = false
not-null = true
backslash-escape = false

Example content:

1|East|32|
2|South|0|
3|West|10|
4|North|39|

SQL

When TiDB Lightning processes a SQL file, because TiDB Lightning cannot quickly split a single SQL file, it cannot improve the import speed of a single file by increasing concurrency. Therefore, when you import data from SQL files, avoid a single huge SQL file. TiDB Lightning works best when the input files have a uniform size of around 256 MiB.

Parquet

TiDB Lightning currently only supports Parquet files generated by Amazon Aurora or Apache Hive. To identify the file structure in S3, use the following configuration to match all data files:

[[mydumper.files]]
# The expression needed for parsing Amazon Aurora parquet files
pattern = '(?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$'
schema = '$1'
table = '$2'
type = '$3'

Note that this configuration only shows how to match the parquet files exported by Aurora snapshot. You need to export and process the schema file separately.

For more information on mydumper.files, refer to Match customized file.

Compressed files

TiDB Lightning currently supports compressed files exported by Dumpling or compressed files that follow the naming rules. Currently, TiDB Lightning supports the following compression algorithms: gzip, snappy, and zstd. When the file name follows the naming rules, TiDB Lightning automatically identifies the compression algorithm and imports the file after streaming decompression, without additional configuration.

Note:

  • Because TiDB Lightning cannot concurrently decompress a single large compressed file, the size of the compressed file affects the import speed. It is recommended that a source file is no greater than 256 MiB after decompression.
  • TiDB Lightning only imports individually compressed data files and does not support importing a single compressed file with multiple data files included.
  • TiDB Lightning does not support parquet files compressed through another compression tool, such as db.table.parquet.snappy. If you want to compress parquet files, you can configure the compression format for the parquet file writer.
  • TiDB Lightning v6.4.0 and later versions only support the following compressed data files: gzip, snappy, and zstd. Other types of files cause errors. If an unsupported compressed file exists in the directory where the source data file is stored, this will cause the task to report an error. You can move those unsupported files out of the import data directory to avoid such errors.
  • The Snappy compressed file must be in the official Snappy format. Other variants of Snappy compression are not supported.

Match customized files

TiDB Lightning only recognizes data files that follow the naming pattern. In some cases, your data file might not follow the naming pattern, and thus data import is completed in a short time without importing any file.

To resolve this issue, you can use [[mydumper.files]] to match data files in your customized expression.

Take the Aurora snapshot exported to S3 as an example. The complete path of the Parquet file is S3://some-bucket/some-subdir/some-database/some-database.some-table/part-00000-c5a881bb-58ff-4ee6-1111-b41ecff340a3-c000.gz.parquet.

Usually, data-source-dir is set to S3://some-bucket/some-subdir/some-database/ to import the some-database database.

Based on the preceding Parquet file path, you can write a regular expression like (?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$ to match the files. In the match group, index=1 is some-database, index=2 is some-table, and index=3 is parquet.

You can write the configuration file according to the regular expression and the corresponding index so that TiDB Lightning can recognize the data files that do not follow the default naming convention. For example:

[[mydumper.files]]
# The expression needed for parsing the Amazon Aurora parquet file
pattern = '(?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$'
schema = '$1'
table = '$2'
type = '$3'
  • schema: The name of the target database. The value can be:
    • The group index obtained by using a regular expression, such as $1.
    • The name of the database that you want to import, such as db1. All matched files are imported into db1.
  • table: The name of the target table. The value can be:
    • The group index obtained by using a regular expression, such as $2.
    • The name of the table that you want to import, such as table1. All matched files are imported into table1.
  • type: The file type. Supports sql, parquet, and csv. The value can be:
    • The group index obtained by using a regular expression, such as $3.
  • key: The file number, such as 001 in ${db_name}.${table_name}.001.csv.
    • The group index obtained by using a regular expression, such as $4.

Import data from Amazon S3

The following examples show how to import data from Amazon S3 using TiDB Lightning. For more parameter configurations, see URI Formats of External Storage Services.

  • Use the locally configured permissions to access S3 data:

    ./tidb-lightning --tidb-port=4000 --pd-urls=127.0.0.1:2379 --backend=local --sorted-kv-dir=/tmp/sorted-kvs \
        -d 's3://my-bucket/sql-backup'
  • Use the path-style request to access S3 data:

    ./tidb-lightning --tidb-port=4000 --pd-urls=127.0.0.1:2379 --backend=local --sorted-kv-dir=/tmp/sorted-kvs \
        -d 's3://my-bucket/sql-backup?force-path-style=true&endpoint=http://10.154.10.132:8088'
  • Use a specific AWS IAM role ARN to access S3 data:

    ./tidb-lightning --tidb-port=4000 --pd-urls=127.0.0.1:2379 --backend=local --sorted-kv-dir=/tmp/sorted-kvs \
        -d 's3://my-bucket/test-data?role-arn=arn:aws:iam::888888888888:role/my-role'
  • Use access keys of an AWS IAM user to access S3 data:

    ./tidb-lightning --tidb-port=4000 --pd-urls=127.0.0.1:2379 --backend=local --sorted-kv-dir=/tmp/sorted-kvs \
        -d 's3://my-bucket/test-data?access_key={my_access_key}&secret_access_key={my_secret_access_key}'
  • Use the combination of AWS IAM role access keys and session tokens to access S3 data:

    ./tidb-lightning --tidb-port=4000 --pd-urls=127.0.0.1:2379 --backend=local --sorted-kv-dir=/tmp/sorted-kvs \
        -d 's3://my-bucket/test-data?access_key={my_access_key}&secret_access_key={my_secret_access_key}&session-token={my_session_token}'

More resources