Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: batch processing overview and pipelined dml #19818

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

qiancai
Copy link
Collaborator

@qiancai qiancai commented Dec 26, 2024

First-time contributors' checklist

What is changed, added or deleted? (Required)

  1. A batch processing overview doc to disambiguate features
  2. A standalone doc for Pipelined DML

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v8.5 (TiDB 8.5 versions)
  • v8.4 (TiDB 8.4 versions)
  • v8.3 (TiDB 8.3 versions)
  • v8.2 (TiDB 8.2 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@qiancai qiancai added area/transaction Indicates that the Issue or PR belongs to the area of transaction. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. type/enhancement The issue or PR belongs to an enhancement. v8.5 labels Dec 26, 2024
@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 26, 2024
@qiancai qiancai marked this pull request as draft December 26, 2024 10:19
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 26, 2024
Copy link

ti-chi-bot bot commented Dec 27, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from qiancai, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 27, 2024
pipelined-dml.md.md Outdated Show resolved Hide resolved
@qiancai qiancai requested a review from ekexium December 27, 2024 08:56
@qiancai qiancai self-assigned this Dec 27, 2024
pipelined-dml.md.md Outdated Show resolved Hide resolved
github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Dec 27, 2024
github-actions bot pushed a commit to qiancai/pingcap-docsite-preview that referenced this pull request Dec 27, 2024
@@ -397,6 +397,7 @@
- [Use Load Base Split](/configure-load-base-split.md)
- [Use Store Limit](/configure-store-limit.md)
- [DDL Execution Principles and Best Practices](/ddl-introduction.md)
- [Batch Data Processing](/batch-processing.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch processing is a more commonly used term, I suppose?


- Data import
- `IMPORT INTO` statement (introduced in TiDB v7.2.0 and GA in v7.5.0)
- Data inserts, updates, and deletions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inserts or insertions?


#### Key benefits

- Streams data to the storage layer during transaction execution instead of caching it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Streams data to the storage layer during transaction execution instead of caching it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing
- Streams data to the storage layer during transaction execution instead of buffering it entirely in memory, allowing transaction size no longer limited by TiDB memory and supporting ultra-large-scale data processing


## Overview

Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of caching it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of caching it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:
Pipelined DML is an experimental feature introduced in TiDB v8.0.0 to improve the performance of large-scale data write operations. When this feature is enabled, TiDB streams data directly to the storage layer during DML operations, instead of buffering it entirely in memory. This pipeline-like approach simultaneously reads data (input) and writes it to the storage layer (output), effectively resolving common challenges in large-scale DML operations as follows:


- Memory limits: traditional DML operations might encounter out-of-memory (OOM) errors when handling large datasets.
- Performance bottlenecks: large transactions are often inefficient and is prone to causing workload fluctuations.
- Operational limits: TiDB memory limits make it difficult to execute ultra-large data processing tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing this point as it's a duplicate of the 1st point? Also for the Chinese doc.

- Check the [`tidb_last_txn_info`](/system-variables.md#tidb_last_txn_info-new-in-v409) system variable to get information about the last transaction executed in the current session, including whether Pipelined DML was used.
- Look for lines containing `"[pipelined dml]"` in TiDB logs to understand the execution process and progress of Pipelined DML, including the current stage and the amount of data written.
- View the `affected rows` field in the [`expensive query`](/identify-expensive-queries.md#expensive-query-log-example) logs to track the progress of long-running statements.
- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table monitor their execution progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table monitor their execution progress.
- Query the [`INFORMATION_SCHEMA.PROCESSLIST`](/information-schema/information-schema-processlist.md) table to view transaction execution progress. Pipelined DML is typically used for large transactions, so you can use this table to monitor their execution progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transaction Indicates that the Issue or PR belongs to the area of transaction. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. type/enhancement The issue or PR belongs to an enhancement. v8.5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants