doc: batch processing overview and pipelined dml #19818

qiancai · 2024-12-26T10:19:06Z

First-time contributors' checklist

I've signed Contributor License Agreement that's required for repo owners to accept my contribution.

What is changed, added or deleted? (Required)

A batch processing overview doc to disambiguate features
A standalone doc for Pipelined DML

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

What is the related PR or file link(s)?

This PR is translated from: doc: batch processing overview and pipelined dml docs-cn#19021
Other reference link(s):

Do your changes match any of the following descriptions?

Delete files
Change aliases
Need modification after applied to another branch
Might cause conflicts after applied to another branch

ti-chi-bot · 2024-12-27T02:23:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from qiancai, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pipelined-dml.md.md

pingcap/docs@e779ee7

…qiancai/docs into batch-processing-overview-19021

pingcap/docs@2e33e21

ekexium · 2024-12-27T10:21:20Z

TOC.md

@@ -397,6 +397,7 @@
  - [Use Load Base Split](/configure-load-base-split.md)
  - [Use Store Limit](/configure-store-limit.md)
  - [DDL Execution Principles and Best Practices](/ddl-introduction.md)
+  - [Batch Data Processing](/batch-processing.md)


Batch processing is a more commonly used term, I suppose?

ekexium · 2024-12-27T11:48:54Z

batch-processing.md

+
+- Data import
+    - `IMPORT INTO` statement (introduced in TiDB v7.2.0 and GA in v7.5.0)
+- Data inserts, updates, and deletions


inserts or insertions?

ekexium · 2024-12-27T11:52:33Z

batch-processing.md

+
+#### Key benefits
+
+- Streams data to the storage layer during transaction execution instead of caching it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing


Suggested change

- Streams data to the storage layer during transaction execution instead of caching it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing

- Streams data to the storage layer during transaction execution instead of buffering it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing

ekexium · 2024-12-27T11:54:04Z

pipelined-dml.md.md

+
+## Overview
+
+Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of caching it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:


Suggested change

Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of caching it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:

Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of buffering it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:

ekexium · 2024-12-27T11:57:24Z

pipelined-dml.md.md

+
+- Memory limits: traditional DML operations might encounter out-of-memory (OOM) errors when handling large datasets.
+- Performance bottlenecks: large transactions are often inefficient and is prone to causing workload fluctuations.
+- Operational limits: TiDB memory limits make it difficult to execute ultra-large data processing tasks.


How about removing this point as it's a duplicate of the 1st point? Also for the Chinese doc.

ekexium · 2024-12-27T12:00:45Z

pipelined-dml.md.md

+- Check the [`tidb_last_txn_info`](/system-variables.md#tidb_last_txn_info-new-in-v409) system variable to get information about the last transaction executed in the current session, including whether Pipelined DML was used.
+- Look for lines containing `"[pipelined dml]"` in TiDB logs to understand the execution process and progress of Pipelined DML, including the current stage and the amount of data written.
+- View the `affected rows` field in the [`expensive query`](/identify-expensive-queries.md#expensive-query-log-example) logs to track the progress of long-running statements.
+- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table monitor their execution progress.


Suggested change

- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table monitor their execution progress.

- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table to monitor their execution progress.

Add temp.md

c0080ee

Delete temp.md

c901e79

ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 26, 2024

qiancai marked this pull request as draft December 26, 2024 10:19

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 26, 2024

Create batch-processing.md

74fcca5

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 27, 2024

qiancai added 2 commits December 27, 2024 16:42

add pipelined dml

57cbb3c

Update system-variables.md

874e45e

qiancai commented Dec 27, 2024

View reviewed changes

pipelined-dml.md.md Outdated Show resolved Hide resolved

Update pipelined-dml.md.md

4e031cb

qiancai requested a review from ekexium December 27, 2024 08:56

qiancai self-assigned this Dec 27, 2024

qiancai commented Dec 27, 2024

View reviewed changes

pipelined-dml.md.md Outdated Show resolved Hide resolved

Update pipelined-dml.md.md

e779ee7

github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Dec 27, 2024

Preview PR pingcap/docs#19818 and this preview is triggered from commit

adfdf27

pingcap/docs@e779ee7

qiancai added 2 commits December 27, 2024 17:25

Update batch-processing.md

a5ab9e1

Merge branch 'batch-processing-overview-19021' of https://github.com/…

2e33e21

…qiancai/docs into batch-processing-overview-19021

github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Dec 27, 2024

Preview PR pingcap/docs#19818 and this preview is triggered from commit

346bc18

pingcap/docs@2e33e21

ekexium reviewed Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: batch processing overview and pipelined dml #19818

doc: batch processing overview and pipelined dml #19818

qiancai commented Dec 26, 2024

ti-chi-bot bot commented Dec 27, 2024

ekexium Dec 27, 2024

ekexium Dec 27, 2024

ekexium Dec 27, 2024

ekexium Dec 27, 2024

ekexium Dec 27, 2024

ekexium Dec 27, 2024


		#### Key benefits

		- Streams data to the storage layer during transaction execution instead of caching it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing


		## Overview

		Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of caching it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:

	- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table monitor their execution progress.
	- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table to monitor their execution progress.

doc: batch processing overview and pipelined dml #19818

Are you sure you want to change the base?

doc: batch processing overview and pipelined dml #19818

Conversation

qiancai commented Dec 26, 2024

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

ti-chi-bot bot commented Dec 27, 2024

ekexium Dec 27, 2024

Choose a reason for hiding this comment

ekexium Dec 27, 2024

Choose a reason for hiding this comment

ekexium Dec 27, 2024

Choose a reason for hiding this comment

ekexium Dec 27, 2024

Choose a reason for hiding this comment

ekexium Dec 27, 2024

Choose a reason for hiding this comment

ekexium Dec 27, 2024

Choose a reason for hiding this comment