diff --git a/TOC-tidb-cloud-lake.md b/TOC-tidb-cloud-lake.md index 2c7157f3df310..9b0cdab9f656b 100644 --- a/TOC-tidb-cloud-lake.md +++ b/TOC-tidb-cloud-lake.md @@ -23,10 +23,20 @@ - Security - [Authenticate with AWS IAM Role](/tidb-cloud-lake/guides/authenticate-with-aws-iam-role.md) - [Connect with AWS PrivateLink](/tidb-cloud-lake/guides/connect-with-aws-privatelink.md) -- Integrations - - [Data Integration Overview](/tidb-cloud-lake/guides/data-integration-overview.md) - - [Integrate with MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) - - [Integrate with Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) +- Data Integration + - [Overview](/tidb-cloud-lake/guides/data-integration-overview.md) + - Data Sources + - [Overview](/tidb-cloud-lake/guides/data-sources.md) + - [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) + - [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) + - [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md) + - [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) + - Integration Tasks + - [Overview](/tidb-cloud-lake/guides/integration-tasks.md) + - [Task Management](/tidb-cloud-lake/guides/task-management.md) + - [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) + - [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md) + - [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md) - Connect - [Overview](/tidb-cloud-lake/guides/connection-overview.md) - SQL Clients diff --git a/media/tidb-cloud-lake/dataintegration-run-history-page.png b/media/tidb-cloud-lake/dataintegration-run-history-page.png deleted file mode 100644 index a988add36ffda..0000000000000 Binary files a/media/tidb-cloud-lake/dataintegration-run-history-page.png and /dev/null differ diff --git a/media/tidb-cloud-lake/feishubot-example.png b/media/tidb-cloud-lake/feishubot-example.png new file mode 100644 index 0000000000000..2c42c2433e0e9 Binary files /dev/null and b/media/tidb-cloud-lake/feishubot-example.png differ diff --git a/tidb-cloud-lake/guides/aws-credentials.md b/tidb-cloud-lake/guides/aws-credentials.md new file mode 100644 index 0000000000000..644563777a1b6 --- /dev/null +++ b/tidb-cloud-lake/guides/aws-credentials.md @@ -0,0 +1,35 @@ +--- +title: AWS - Credentials +summary: This page describes how to create an "AWS - Credentials" data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. +--- + +# AWS - Credentials + +This page describes how to create an `AWS - Credentials` data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks. + +## Use Cases + +- Manage one set of AWS Access Key and Secret Key credentials for multiple S3 import tasks +- Avoid re-entering the same S3 access credentials in every task +- Update credentials centrally when they are rotated + +## Create AWS - Credentials + +1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. +2. Select **AWS - Credentials** as the service type, then fill in the credentials: + + | Field | Required | Description | + |-------|----------|-------------| + | **Name** | Yes | A descriptive name for this data source | + | **Access Key** | Yes | AWS Access Key ID | + | **Secret Key** | Yes | AWS Secret Access Key | + +3. Click **Test Connectivity** to validate the credentials. If the test succeeds, click **OK** to save the data source. + +## Permission Requirements + +The AWS credentials must have read access to the target S3 bucket. If downstream tasks will enable **Clean Up Original Files**, the credentials must also have write and delete permissions. + +## Next Steps + +After creating this data source, you can use it to create an [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md). diff --git a/tidb-cloud-lake/guides/data-integration-overview.md b/tidb-cloud-lake/guides/data-integration-overview.md index 538a8ed57b34d..0f8b64d4a06f1 100644 --- a/tidb-cloud-lake/guides/data-integration-overview.md +++ b/tidb-cloud-lake/guides/data-integration-overview.md @@ -1,74 +1,36 @@ --- -title: Data Integration -summary: The Data Integration feature in {{{ .lake }}} enables you to load data from external sources into {{{ .lake }}} through a visual, no-code interface. You can create data sources, configure integration tasks, and monitor synchronization — all from the {{{ .lake }}} console. +title: Data Integration Overview +summary: The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing or synchronizing data from external systems into {{{ .lake }}}. --- -# Data Integration +# Data Integration Overview -The Data Integration feature in {{{ .lake }}} enables you to load data from external sources into {{{ .lake }}} through a visual, no-code interface. You can create data sources, configure integration tasks, and monitor synchronization — all from the {{{ .lake }}} console. - -## Supported Data Sources - -| Data Source | Description | -| -------------------- | ---------------------------------------------------------------------------------------- | -| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Sync data from MySQL databases with support for Snapshot, CDC, and Snapshot + CDC modes. | -| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Import files from Amazon S3 buckets with support for CSV, Parquet, and NDJSON formats. | +The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing or synchronizing data from external systems into {{{ .lake }}}. The feature centers around two key concepts: **data sources** and **integration tasks**. ## Key Concepts -### Data Source - -A data source represents a connection to an external system. It stores the credentials and connection details needed to access the source data. Once configured, a data source can be reused across multiple integration tasks. - -{{{ .lake }}} currently supports two types of data sources: - -- **MySQL - Credentials**: Connection to a MySQL database (host, port, username, password, database). -- **AWS - Credentials**: Connection to Amazon S3 (Access Key and Secret Key). - -### Integration Task - -An integration task defines how data flows from a source to a target table in {{{ .lake }}}. Each task specifies the source configuration, target warehouse and table, and operational parameters specific to the data source type. - -## Managing Data Sources - -To manage data sources, navigate to **Data** > **Data Sources** from the left sidebar. From this page you can: - -- View all configured data sources -- Create new data sources -- Edit or delete existing data sources -- Test connectivity to verify credentials - -> **Tip:** -> -> It is recommended to always test the connection before saving a data source. This helps catch common issues such as incorrect credentials or network restrictions early. - -## Managing Tasks - -### Starting and Stopping Tasks - -After creation, a task is in a **Stopped** state. To begin data synchronization, click the **Start** button on the task. - -To stop a running task, click the **Stop** button. The task will gracefully shut down and save its progress. - -### Task Status +| Concept | Description | +|---------|-------------| +| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, or a FeiShu bot webhook. | +| [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) | Executable tasks that define where data comes from, which {{{ .lake }}} table it is written to, which runtime parameters are used, and how the task is started and monitored. | -The Data Integration page displays all tasks with their current status: +Data sources do not move data by themselves. They only store the information required to access external systems. Integration tasks are the units that actually perform imports, snapshots, and continuous synchronization. -| Status | Description | -| ------- | ----------------------------- | -| Running | Task is actively syncing data | -| Stopped | Task is not running | -| Failed | Task encountered an error | +Not every data source corresponds to an ingestion task. For example, `FeiShuBot` is used for notifications rather than loading source data into {{{ .lake }}}. -### Viewing Run History +## Supported Integration Task Types -Click on a task to view its execution history. The run history includes: +| Task Type | Description | +|-----------|-------------| +| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. | +| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. | -- Execution start and end times -- Number of rows synced -- Error details (if any) +## Recommended Flow -![Run History](/media/tidb-cloud-lake/dataintegration-run-history-page.png) +1. Create and test reusable connection settings on the [Data Sources](/tidb-cloud-lake/guides/data-sources.md) page. +2. Review supported task types and their use cases on the [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) page. +3. Read the task-specific guide to configure the source, preview the data, and set the target table. +4. Use the [Task Management](/tidb-cloud-lake/guides/task-management.md) page to start tasks, check status, and troubleshoot execution issues. ## Video Tour diff --git a/tidb-cloud-lake/guides/data-sources.md b/tidb-cloud-lake/guides/data-sources.md new file mode 100644 index 0000000000000..1ae813f836fb3 --- /dev/null +++ b/tidb-cloud-lake/guides/data-sources.md @@ -0,0 +1,37 @@ +--- +title: Data Sources +summary: A data source in {{{ .lake }}} represents a connection to an external system. It stores the credentials and connection details required to access external systems and can be reused across multiple integration tasks or notification scenarios. +--- + +# Data Sources + +A data source in {{{ .lake }}} represents a connection to an external system. It stores the credentials and connection details required to access external systems and can be reused across multiple integration tasks or notification scenarios. + +Data sources do not execute synchronization by themselves. Their role is to centralize access settings so you do not need to repeatedly enter accounts, passwords, keys, or notification endpoints in every task. + +## Supported Data Source Types + +| Type | Purpose | +|------|---------| +| [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) | Stores the Access Key and Secret Key required to access Amazon S3. These credentials can be reused across multiple S3 import tasks. | +| [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) | Stores the host, port, username, password, and database information required to access MySQL. These settings can be reused across multiple MySQL sync tasks. | +| [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) | Stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. | + +Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `AWS - Credentials` and `MySQL - Credentials` are referenced by actual data import or synchronization tasks. + +## Managing Data Sources + +Navigate to **Data** > **Data Sources**. On this page, you can: + +- View all configured data sources +- Create new data sources +- Edit or delete existing data sources +- Test connectivity to validate credentials + +> **Tip:** +> +> Run **Test Connectivity** before saving a data source to catch issues such as invalid credentials, missing permissions, or network restrictions as early as possible. + +## Next Steps + +After creating a data source, you can reference it in an [integration task](/tidb-cloud-lake/guides/integration-tasks.md) or a notification configuration, depending on its purpose. diff --git a/tidb-cloud-lake/guides/feishubot.md b/tidb-cloud-lake/guides/feishubot.md new file mode 100644 index 0000000000000..ba7f57b1b1a61 --- /dev/null +++ b/tidb-cloud-lake/guides/feishubot.md @@ -0,0 +1,113 @@ +--- +title: FeiShuBot +summary: This page describes how to create a "FeiShuBot" data source. This data source stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. +--- + +# FeiShuBot + +This page describes how to create a `FeiShuBot` data source. This data source stores a FeiShu bot webhook and message template and is typically used for task failure notifications. + +## Use Cases + +- Send notifications to a FeiShu group when a task run fails +- Reuse the same bot configuration and message template across multiple tasks +- Manage the notification endpoint and message format centrally + +## Create FeiShuBot + +1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. +2. Select **FeiShuBot** as the service type, then fill in the fields: + + | Field | Required | Description | + |-------|----------|-------------| + | **Name** | Yes | A descriptive name for this data source. Only letters, numbers, and underscores are supported | + | **URL** | Yes | Custom FeiShu bot webhook URL | + | **Warehouse** | Yes | The warehouse used to create the `NOTIFICATION INTEGRATION` | + | **Payload** | Yes | Message payload type. Currently, only `Task Error` is supported | + | **Template** | Yes | Custom message template | + +3. Click **Test Connectivity** to validate the configuration. If the test succeeds, click **OK** to save the data source. + +## Usage + +`FeiShuBot` can be used with the SQL Task `ERROR_INTEGRATION` property, or referenced from the console Task Flow UI through **Error Notification**. + +### Set a SQL Task Property + +Set the Task `ERROR_INTEGRATION` property. In the following example, the data source name is `test_1`: + +```sql +CREATE TASK my_daily_task + WAREHOUSE = 'compute_wh' + SCHEDULE = USING CRON '0 0 9 * * *' 'America/Los_Angeles' + COMMENT = 'Daily summary task' + ERROR_INTEGRATION = 'test_1' +AS + INSERT INTO summary_table SELECT * FROM source_table; +``` + +### Configure It in the Task Flow UI + +On the create or edit page, set **Error Notification** to the corresponding `FeiShuBot` data source. + +### Customize the Task Error Template + +Default template: + +```text +**[ALERT] {{ .MessageType }} - {{ .TaskName }}** +--- +taskId: {{ .TaskId }} +taskName: {{ .TaskName }} +tenantId: {{ .TenantId }} + +Messages: {{ range .Messages }} +- runId: {{ .RunId }} + queryId: {{ .QueryId }} + error: {{ .ErrorKind }} ({{ .ErrorCode }}) + message: {{ .ErrorMessage }} {{ end }} + +--- +{{ .Timestamp }} +``` + +A received message looks similar to this: + +![FeiShu notification example](/media/tidb-cloud-lake/feishubot-example.png) + +Custom templates support: + +- Markdown content +- Golang template syntax + +The following variables are available: + +```golang +type ErrorIntegrationPayload struct { + Version string `json:"version"` + MessageId string `json:"messageId"` + MessageType string `json:"messageType"` + Timestamp time.Time `json:"timestamp"` + TenantId string `json:"tenantId"` + TaskName string `json:"taskName"` + TaskId string `json:"taskId"` + RootTaskName string `json:"rootTaskName"` + RootTaskId string `json:"rootTaskId"` + Messages []*ErrorMessage `json:"messages"` +} + +type ErrorMessage struct { + RunId string `json:"runId"` + ScheduledTime time.Time `json:"scheduledTime"` + QueryStartTime *time.Time `json:"queryStartTime"` + CompletedTime *time.Time `json:"completedTime"` + QueryId string `json:"queryId"` + ErrorKind string `json:"errorKind"` + ErrorCode string `json:"errorCode"` + ErrorMessage string `json:"errorMessage"` +} +``` + +## Notes + +`FeiShuBot` is a notification-oriented data source. It is not used to load business data into {{{ .lake }}}. diff --git a/tidb-cloud-lake/guides/integrate-with-amazon-s3.md b/tidb-cloud-lake/guides/integrate-with-amazon-s3.md index c3e6a9526547f..4c470e439e357 100644 --- a/tidb-cloud-lake/guides/integrate-with-amazon-s3.md +++ b/tidb-cloud-lake/guides/integrate-with-amazon-s3.md @@ -1,11 +1,13 @@ --- -title: Amazon S3 +title: Amazon S3 Integration Task summary: The Amazon S3 data integration enables you to import files from S3 buckets into {{{ .lake }}}. It supports CSV, Parquet, and NDJSON file formats, with options for one-time imports or continuous ingestion that automatically polls for new files. --- -# Amazon S3 +# Amazon S3 Integration Task -The Amazon S3 data integration enables you to import files from S3 buckets into {{{ .lake }}}. It supports CSV, Parquet, and NDJSON file formats, with options for one-time imports or continuous ingestion that automatically polls for new files. +This page describes how to create an Amazon S3 integration task that imports files from an S3 bucket into {{{ .lake }}}. CSV, Parquet, and NDJSON file formats are supported, and the task can be configured for one-time import or continuous ingestion. + +If you need to create reusable AWS credentials first, see [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md). ## Supported File Formats @@ -15,23 +17,11 @@ The Amazon S3 data integration enables you to import files from S3 buckets into | Parquet | Columnar storage format, efficient for analytical workloads | | NDJSON | Newline-delimited JSON, one JSON object per line | -## Creating an S3 Data Source - -1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. - -2. Select **AWS - Credentials** as the service type, and fill in the credentials: - - | Field | Required | Description | - |----------------|----------|--------------------------------------| - | **Name** | Yes | A descriptive name for this data source | - | **Access Key** | Yes | AWS Access Key ID | - | **Secret Key** | Yes | AWS Secret Access Key | +## Prerequisites -3. Click **Test Connectivity** to verify the credentials. If the test succeeds, click **OK** to save the data source. - -> **Tip:** -> -> The AWS credentials must have read access to the target S3 bucket. If you plan to use the **Clean Up Original Files** option, write and delete permissions are also required. +- An **AWS - Credentials** data source has already been created +- The AWS credentials have read access to the target S3 bucket +- If you plan to enable **Clean Up Original Files**, the credentials also need write and delete permissions ## Creating an S3 Integration Task @@ -41,14 +31,12 @@ The Amazon S3 data integration enables you to import files from S3 buckets into 2. Select an S3 data source, then configure the basic settings: -| Field | Required | Description | -|--------------------|----------|--------------------------------------------------------------------------------------------------| -| **Data Source** | Yes | Select an existing AWS data source from the dropdown | -| **Name** | Yes | A name for this integration task | -| **File Path** | Yes | S3 URI with optional wildcard pattern (e.g., `s3://mybucket/data/2025-*.csv`) | -| **File Type** | Auto | Auto-detected from file extension. Supported: CSV, Parquet, NDJSON | - -![Create S3 Task - Basic Info](/media/tidb-cloud-lake/create-s3-task-basic-info.png) + | Field | Required | Description | + |--------------------|----------|--------------------------------------------------------------------------------------------------| + | **Data Source** | Yes | Select an existing **AWS - Credentials** data source from the dropdown | + | **Name** | Yes | A name for this integration task | + | **File Path** | Yes | S3 URI with optional wildcard pattern (e.g., `s3://mybucket/data/2025-*.csv`) | + | **File Type** | Auto | Auto-detected from file extension. Supported: CSV, Parquet, NDJSON | #### CSV Options diff --git a/tidb-cloud-lake/guides/integrate-with-mysql.md b/tidb-cloud-lake/guides/integrate-with-mysql.md index 615efca8ca950..be329038159d3 100644 --- a/tidb-cloud-lake/guides/integrate-with-mysql.md +++ b/tidb-cloud-lake/guides/integrate-with-mysql.md @@ -1,11 +1,13 @@ --- -title: MySQL +title: MySQL Integration Task summary: The MySQL data integration enables you to sync data from MySQL databases into {{{ .lake }}} in real-time, with support for full snapshot loads, continuous Change Data Capture (CDC), or a combination of both. --- -# MySQL +# MySQL Integration Task -The MySQL data integration enables you to sync data from MySQL databases into {{{ .lake }}} in real-time, with support for full snapshot loads, continuous Change Data Capture (CDC), or a combination of both. +This page describes how to create a MySQL integration task that synchronizes data from a MySQL database into {{{ .lake }}}. MySQL tasks support full `Snapshot` loads, continuous `Change Data Capture (CDC)`, or a combination of both. + +If you need to create reusable MySQL connection settings first, see [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md). ## Sync Modes @@ -19,6 +21,9 @@ The MySQL data integration enables you to sync data from MySQL databases into {{ Before setting up MySQL data integration, ensure your MySQL instance meets the following requirements: +- A **MySQL - Credentials** data source has already been created +- The target MySQL instance is reachable from {{{ .lake }}} + ### Enable Binlog MySQL binlog must be enabled with ROW format for CDC and Snapshot + CDC modes: @@ -47,25 +52,6 @@ FLUSH PRIVILEGES; Ensure the MySQL instance is accessible from {{{ .lake }}}. Check your firewall rules and security groups to allow inbound connections on the MySQL port. -## Creating a MySQL Data Source - -1. Navigate to **Data** > **Data Sources**, and click **Create Data Source**. - -2. Select **MySQL - Credentials** as the service type, and fill in the connection details: - - | Field | Required | Description | - |-----------------|----------|-----------------------------------------------------------------------------| - | **Name** | Yes | A descriptive name for this data source | - | **Hostname** | Yes | MySQL server hostname or IP address | - | **Port Number** | Yes | MySQL server port (default: 3306) | - | **DB Username** | Yes | MySQL user with replication permissions | - | **DB Password** | Yes | Password for the MySQL user | - | **Database Name** | Yes | The source database name | - | **DB Charset** | No | Character set (default: utf8mb4) | - | **Server ID** | No | Unique binlog replication identifier. Auto-generated if not provided | - -3. Click **Test Connectivity** to verify the connection. If the test succeeds, click **OK** to save the data source. - ## Creating a MySQL Integration Task ### Step 1: Basic Info @@ -76,19 +62,19 @@ Ensure the MySQL instance is accessible from {{{ .lake }}}. Check your firewall 2. Configure the basic settings: -| Field | Required | Description | -|----------------------------|-------------|--------------------------------------------------------------------------------------------------| -| **Data Source** | Yes | Select an existing MySQL data source from the dropdown | -| **Name** | Yes | A name for this integration task | -| **Source Database** | — | Automatically displayed based on the selected data source | -| **Source Table** | Yes | Select the table to sync from the MySQL database | -| **Sync Mode** | Yes | Choose from **Snapshot**, **CDC Only**, or **Snapshot + CDC** | -| **Primary Key** | Conditional | The unique identifier column for merge operations. Required for CDC Only and Snapshot + CDC modes | -| **Sync Interval** | Yes | Interval (in seconds) between write operations (default: 3) | -| **Batch Size** | No | Number of rows per batch | -| **Allow Delete** | No | Whether to permit DELETE operations in CDC. Available for CDC Only and Snapshot + CDC modes | - -![Create Task - Basic Info](/media/tidb-cloud-lake/create-mysql-task-step1-basic-info.png) + | Field | Required | Description | + |----------------------------|-------------|--------------------------------------------------------------------------------------------------| + | **Data Source** | Yes | Select an existing **MySQL - Credentials** data source from the dropdown | + | **Name** | Yes | A name for this integration task | + | **Source Database** | — | Automatically displayed based on the selected data source | + | **Source Table** | Yes | Select the table to sync from the MySQL database | + | **Sync Mode** | Yes | Choose from **Snapshot**, **CDC Only**, or **Snapshot + CDC** | + | **Primary Key** | Conditional | The unique identifier column for merge operations. Required for CDC Only and Snapshot + CDC modes | + | **Sync Interval** | Yes | Interval (in seconds) between write operations (default: 3) | + | **Batch Size** | No | Number of rows per batch | + | **Allow Delete** | No | Whether to permit DELETE operations in CDC. Available for CDC Only and Snapshot + CDC modes | + + ![Create Task - Basic Info](/media/tidb-cloud-lake/create-mysql-task-step1-basic-info.png) #### Snapshot Mode Options diff --git a/tidb-cloud-lake/guides/integrate-with-postgresql.md b/tidb-cloud-lake/guides/integrate-with-postgresql.md new file mode 100644 index 0000000000000..5ca2c51fbd524 --- /dev/null +++ b/tidb-cloud-lake/guides/integrate-with-postgresql.md @@ -0,0 +1,196 @@ +--- +title: PostgreSQL Integration Task +summary: This page describes how to create a PostgreSQL integration task that synchronizes data from a PostgreSQL database into {{{ .lake }}}. +--- + +# PostgreSQL Integration Task + +This page describes how to create a PostgreSQL integration task that synchronizes data from a PostgreSQL database into {{{ .lake }}}. PostgreSQL tasks support full `Snapshot` loads, continuous `Change Data Capture (CDC)`, or a combination of both. + +If you need to create reusable PostgreSQL connection settings first, see [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md). + +## Sync Modes + +| Sync Mode | Description | +|----------------|--------------------------------------------------------------------------------------------------------------| +| Snapshot | Performs a one-time full data load from the source table. Ideal for initial data migration or periodic bulk imports. | +| CDC Only | Continuously captures real-time changes (inserts, updates, deletes) via PostgreSQL logical replication. Requires a primary key for merge operations. | +| Snapshot + CDC | First performs a full snapshot, then seamlessly transitions to continuous CDC. Recommended for most use cases. | + +## Prerequisites + +Before setting up PostgreSQL data integration, ensure your PostgreSQL instance meets the following requirements: + +- A **PostgreSQL - Credentials** data source has already been created +- The target PostgreSQL instance is reachable from {{{ .lake }}} +- PostgreSQL version 10 or later + +### Enable Logical Replication + +PostgreSQL WAL (Write-Ahead Log) must be configured with logical level for CDC and Snapshot + CDC modes: + +```ini title='postgresql.conf' +wal_level = logical +max_replication_slots = 4 +max_wal_senders = 4 +``` + +After modifying the configuration, restart PostgreSQL for the changes to take effect. + +### Create a Dedicated User (Recommended) + +Create a PostgreSQL user with the necessary permissions for data replication: + +```sql +CREATE USER databend_cdc WITH PASSWORD 'your_password' REPLICATION; +GRANT CONNECT ON DATABASE your_database TO databend_cdc; +GRANT USAGE ON SCHEMA public TO databend_cdc; +GRANT SELECT ON ALL TABLES IN SCHEMA public TO databend_cdc; +ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO databend_cdc; +``` + +### Create Publication and Replication Slot (Required for CDC) + +For CDC and Snapshot + CDC modes, a publication and replication slot must exist. Because `CREATE PUBLICATION ... FOR ALL TABLES` requires superuser privileges, and adding individual tables requires table ownership, these objects should be created by a database owner or superuser before starting the CDC task. + +Run the following as a superuser or database owner: + +```sql +-- Create a publication that includes the tables you want to replicate +CREATE PUBLICATION bend_cdc_pub FOR ALL TABLES; + +-- Create a logical replication slot +SELECT * FROM pg_create_logical_replication_slot('bend_cdc_slot', 'pgoutput'); + +-- Grant the dedicated user permission to use the replication slot +ALTER ROLE databend_cdc WITH REPLICATION; +``` + +> **Note:** +> +> If you only need to replicate specific tables instead of all tables, you can use: +> +> ```sql +> CREATE PUBLICATION bend_cdc_pub FOR TABLE table1, table2; +> ``` +> +> This avoids the superuser requirement but still requires ownership of the listed tables. + +### Network Access + +Ensure the PostgreSQL instance is accessible from {{{ .lake }}}. Check your firewall rules and security groups to allow inbound connections on the PostgreSQL port. + +## Creating a PostgreSQL Integration Task + +### Step 1: Basic Info + +1. Navigate to **Data** > **Data Integration** and click **Create Task**. + +2. Configure the basic settings: + + | Field | Required | Description | + |----------------------------|-------------|--------------------------------------------------------------------------------------------------| + | **Data Source** | Yes | Select an existing **PostgreSQL - Credentials** data source from the dropdown | + | **Name** | Yes | A name for this integration task | + | **Source Database** | — | Automatically displayed based on the selected data source | + | **Source Table** | Yes | Select the table to sync from the PostgreSQL database | + | **Sync Mode** | Yes | Choose from **Snapshot**, **CDC Only**, or **Snapshot + CDC** | + | **Primary Key** | Conditional | The unique identifier column for merge operations. Required for CDC Only and Snapshot + CDC modes | + | **Sync Interval** | Yes | Interval (in seconds) between write operations (default: 3) | + | **Batch Size** | No | Number of rows per batch | + | **Allow Delete** | No | Whether to permit DELETE operations in CDC. Available for CDC Only and Snapshot + CDC modes | + +#### Snapshot Mode Options + +When using **Snapshot** mode, an additional option is available: + +- **Snapshot WHERE Condition**: A SQL WHERE clause to filter data during the snapshot (e.g., `created_at > '2024-01-01'`). This allows you to load only a subset of the source data. + +### Step 2: Preview Data + +After configuring the basic settings, click **Next** to preview the source data. + +The system fetches a sample row from the selected PostgreSQL table and displays the column names and data types. Review the data to ensure the correct table and columns are selected before proceeding. + +### Step 3: Set Target Table + +Configure the destination in {{{ .lake }}}: + +| Field | Description | +|---------------------|--------------------------------------------------------------------| +| **Warehouse** | Select the target {{{ .lake }}} warehouse for running the sync | +| **Target Database** | Choose the target database in {{{ .lake }}} | +| **Target Table** | The table name in {{{ .lake }}} (defaults to the source table name) | + +The system automatically maps source columns to the target table schema. Review the column mappings, then click **Create** to finalize the integration task. + +## Task Behavior by Sync Mode + +| Sync Mode | Behavior | +|----------------|---------------------------------------------------------------------------------------------------| +| Snapshot | Runs once and automatically stops after the full data load is complete. | +| CDC Only | Runs continuously, capturing real-time changes until manually stopped. | +| Snapshot + CDC | Completes the initial snapshot first, then transitions to continuous CDC until manually stopped. | + +For CDC tasks, the current LSN (Log Sequence Number) is saved as a checkpoint when stopped, allowing the task to resume from where it left off when restarted. + +## Sync Mode Details + +### Snapshot + +Snapshot mode performs a one-time full read of the source table and loads all data into the target table in {{{ .lake }}}. + +**Use cases:** + +- Initial data migration from PostgreSQL to {{{ .lake }}} +- Periodic full data refresh +- One-time data imports with WHERE condition filtering + +**Features:** + +- Supports WHERE condition filtering to load a subset of data +- Task automatically stops after completion + +### CDC (Change Data Capture) + +CDC mode continuously monitors the PostgreSQL WAL (Write-Ahead Log) via logical replication and captures real-time row-level changes (INSERT, UPDATE, DELETE) from the source table. + +**Use cases:** + +- Real-time data replication +- Keeping {{{ .lake }}} in sync with operational PostgreSQL databases +- Event-driven data pipelines + +**How it works:** + +1. Connects to PostgreSQL using a logical replication slot +2. Captures row-level changes in real-time via the `pgoutput` plugin +3. Writes changes to a raw staging table in {{{ .lake }}} +4. Periodically merges changes into the target table using the primary key +5. Saves checkpoint (LSN position) for crash recovery + +> **Note:** +> +> CDC mode requires PostgreSQL WAL level set to `logical`, and a primary key (unique column) must be specified. The PostgreSQL user must have `REPLICATION` privilege. + +### Snapshot + CDC + +This mode combines both approaches: it first performs a full snapshot of the source table, then seamlessly transitions to CDC mode for continuous change capture. This is the recommended mode for most data integration scenarios, as it ensures a complete initial data load followed by ongoing real-time synchronization. + +## Advanced Configuration + +### Primary Key + +The primary key specifies the unique identifier column used for MERGE operations during CDC. When a change event is captured, {{{ .lake }}} uses this key to determine whether to insert a new row or update an existing one. Typically, this should be the primary key of the source table. + +### Sync Interval + +The sync interval (in seconds) controls how frequently captured changes are merged into the target table. A shorter interval provides lower latency but may increase resource usage. The default value of 3 seconds is suitable for most workloads. + +### Batch Size + +Controls the number of rows processed per batch during data loading. Adjusting this value can help optimize throughput for large tables. Leave empty to use the system default. + +### Allow Delete + +When enabled (default for CDC modes), DELETE operations captured from PostgreSQL WAL are applied to the target table in {{{ .lake }}}. When disabled, deletes are ignored, and the target table retains all historical records. This is useful for scenarios where you want to maintain a complete audit trail. diff --git a/tidb-cloud-lake/guides/integration-tasks.md b/tidb-cloud-lake/guides/integration-tasks.md new file mode 100644 index 0000000000000..2a9ced5f4c704 --- /dev/null +++ b/tidb-cloud-lake/guides/integration-tasks.md @@ -0,0 +1,30 @@ +--- +title: Integration Tasks +summary: This page provides an overview of integration tasks in {{{ .lake }}}. Integration tasks define how data flows from external sources into {{{ .lake }}}, including source settings, target tables, and runtime parameters. +--- + +# Integration Tasks + +An integration task in {{{ .lake }}} defines how data flows from a source into a target table in {{{ .lake }}}. Each task references an existing data source and specifies source settings, a target warehouse, a target database / table, and runtime parameters that are specific to the task type. + +Unlike data sources, integration tasks are the executable units that actually perform data movement and synchronization. Data sources store access settings, while tasks handle scheduling, ingestion, synchronization, stopping, resuming, and monitoring. + +## Supported Task Types + +| Task Type | Description | +|-----------|-------------| +| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. | +| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. | +| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. | + +## Reading Guide + +Recommended reading order: + +1. Start with [Task Management](/tidb-cloud-lake/guides/task-management.md) to understand the task creation flow, start / stop behavior, status, and run history. +2. Then read the task-specific guide for the source type you want to configure. + +## Task Type Differences + +- Amazon S3 tasks are designed for file import scenarios and mainly focus on file path patterns, file formats, and ingestion behavior. +- MySQL and PostgreSQL tasks are designed for table synchronization scenarios and mainly focus on sync modes, primary keys, incremental capture, and archive scheduling. diff --git a/tidb-cloud-lake/guides/mysql-credentials.md b/tidb-cloud-lake/guides/mysql-credentials.md new file mode 100644 index 0000000000000..4205cf063a5a2 --- /dev/null +++ b/tidb-cloud-lake/guides/mysql-credentials.md @@ -0,0 +1,42 @@ +--- +title: MySQL - Credentials +summary: This page describes how to create a "MySQL - Credentials" data source. This data source stores the connection information required to access MySQL and can be reused across multiple MySQL integration tasks. +--- + +# MySQL - Credentials + +This page describes how to create a `MySQL - Credentials` data source. This data source stores the connection information required to access MySQL and can be reused across multiple MySQL integration tasks. + +## Use Cases + +- Manage host, port, and account information centrally for multiple MySQL sync tasks +- Avoid re-entering the same database connection settings in every task +- Update all dependent tasks in one place when the database endpoint or account changes + +## Create MySQL - Credentials + +1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. +2. Select **MySQL - Credentials** as the service type, then fill in the connection details: + + | Field | Required | Description | + |-------|----------|-------------| + | **Name** | Yes | A descriptive name for this data source | + | **Hostname** | Yes | MySQL server hostname or IP address | + | **Port Number** | Yes | MySQL server port (default: `3306`) | + | **DB Username** | Yes | Username used to access MySQL | + | **DB Password** | Yes | Password for the MySQL user | + | **Database Name** | Yes | The source database name | + | **DB Charset** | No | Character set (default: `utf8mb4`) | + | **Server ID** | No | Unique binlog replication identifier. Auto-generated if not provided | + +3. Click **Test Connectivity** to validate the connection. If the test succeeds, click **OK** to save the data source. + +## Usage Recommendations + +- Use a dedicated MySQL account instead of sharing one with application workloads +- If you plan to create `CDC Only` or `Snapshot + CDC` tasks, make sure the account has replication-related privileges +- Verify network access, binlog configuration, and permissions before creating downstream tasks + +## Next Steps + +After creating this data source, you can use it to create a [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md). diff --git a/tidb-cloud-lake/guides/postgresql-credentials.md b/tidb-cloud-lake/guides/postgresql-credentials.md new file mode 100644 index 0000000000000..7bab8f8a49ebb --- /dev/null +++ b/tidb-cloud-lake/guides/postgresql-credentials.md @@ -0,0 +1,41 @@ +--- +title: PostgreSQL - Credentials +summary: This page describes how to create a "PostgreSQL - Credentials" data source. This data source stores the connection information required to access PostgreSQL and can be reused across multiple PostgreSQL integration tasks. +--- + +# PostgreSQL - Credentials + +This page describes how to create a `PostgreSQL - Credentials` data source. This data source stores the connection information required to access PostgreSQL and can be reused across multiple PostgreSQL integration tasks. + +## Use Cases + +- Manage host, port, and account information centrally for multiple PostgreSQL sync tasks +- Avoid re-entering the same database connection settings in every task +- Update all dependent tasks in one place when the database endpoint or account changes + +## Create PostgreSQL - Credentials + +1. Navigate to **Data** > **Data Sources** and click **Create Data Source**. +2. Select **PostgreSQL - Credentials** as the service type, then fill in the connection details: + + | Field | Required | Description | + |-------|----------|-------------| + | **Name** | Yes | A descriptive name for this data source | + | **Hostname** | Yes | PostgreSQL server hostname or IP address | + | **Port Number** | Yes | PostgreSQL server port (default: `5432`) | + | **DB Username** | Yes | Username used to access PostgreSQL | + | **DB Password** | Yes | Password for the PostgreSQL user | + | **Database Name** | Yes | The source database name | + | **SSL Mode** | No | SSL connection mode: `disable`, `require`, `verify-ca`, or `verify-full` (default: `disable`) | + +3. Click **Test Connectivity** to validate the connection. If the test succeeds, click **OK** to save the data source. + +## Usage Recommendations + +- Use a dedicated PostgreSQL account instead of sharing one with application workloads +- If you plan to create `CDC Only` or `Snapshot + CDC` tasks, make sure the account has replication-related privileges +- Verify network access, WAL configuration, and permissions before creating downstream tasks + +## Next Steps + +After creating this data source, you can use it to create a [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md). diff --git a/tidb-cloud-lake/guides/task-management.md b/tidb-cloud-lake/guides/task-management.md new file mode 100644 index 0000000000000..fb5bbc1832895 --- /dev/null +++ b/tidb-cloud-lake/guides/task-management.md @@ -0,0 +1,53 @@ +--- +title: Task Management +summary: This page describes common operations for integration tasks, including the task creation flow, start and stop behavior, task states, and run history. For source-specific configuration, see the detailed task guides. +--- + +# Task Management + +This page describes common operations for integration tasks, including the task creation flow, start and stop behavior, task states, and run history. For source-specific configuration, see the detailed task guides. + +## General Task Creation Flow + +1. Navigate to **Data** > **Data Integration** and click **Create Task**. +2. Select an existing data source. +3. Fill in source-side parameters based on the task type, such as file path, source table, sync mode, or filter conditions. +4. Preview the source data and verify the schema and field types. +5. Select the target warehouse, target database, and target table. +6. Create the task and start it when needed. + +## Starting and Stopping Tasks + +After a task is created, its initial state is **Stopped**. To begin synchronization or ingestion, click **Start** on the task. + +To stop a running task, click **Stop**. The task will shut down gracefully and save its current progress. + +## Task Status + +The Data Integration page shows all tasks and their current status: + +| Status | Description | +|--------|-------------| +| Running | The task is actively synchronizing or importing data | +| Stopped | The task is currently not running | +| Failed | The task encountered an error during execution | + +## Viewing Run History + +Click a task to view its execution history. The run history includes: + +- Execution start or end time +- Number of rows imported or synchronized +- Error details, if any + +## Runtime Behavior by Task Type + +- S3 tasks can run once or continuously poll for new files. +- MySQL `Snapshot` tasks usually stop automatically after the full load completes. +- MySQL `CDC Only` and `Snapshot + CDC` tasks continue running until manually stopped. + +For field-level configuration and detailed behavior, continue with the relevant task guide: + +- [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) +- [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md) +- [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md)