Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions TOC-tidb-cloud-lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,20 @@
- Security
- [Authenticate with AWS IAM Role](/tidb-cloud-lake/guides/authenticate-with-aws-iam-role.md)
- [Connect with AWS PrivateLink](/tidb-cloud-lake/guides/connect-with-aws-privatelink.md)
- Integrations
- [Data Integration Overview](/tidb-cloud-lake/guides/data-integration-overview.md)
- [Integrate with MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md)
- [Integrate with Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md)
- Data Integration
- [Overview](/tidb-cloud-lake/guides/data-integration-overview.md)
- Data Sources
- [Overview](/tidb-cloud-lake/guides/data-sources.md)
- [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md)
- [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md)
- [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md)
- [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md)
- Integration Tasks
- [Overview](/tidb-cloud-lake/guides/integration-tasks.md)
- [Task Management](/tidb-cloud-lake/guides/task-management.md)
- [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md)
- [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md)
- [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md)
- Connect
- [Overview](/tidb-cloud-lake/guides/connection-overview.md)
- SQL Clients
Expand Down
Binary file not shown.
Binary file added media/tidb-cloud-lake/feishubot-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions tidb-cloud-lake/guides/aws-credentials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: AWS - Credentials
summary: This page describes how to create an "AWS - Credentials" data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks.
---

# AWS - Credentials

This page describes how to create an `AWS - Credentials` data source. This data source stores the credentials required to access Amazon S3 and can be reused across multiple S3 integration tasks.

## Use Cases

- Manage one set of AWS Access Key and Secret Key credentials for multiple S3 import tasks
- Avoid re-entering the same S3 access credentials in every task
- Update credentials centrally when they are rotated

## Create AWS - Credentials

1. Navigate to **Data** > **Data Sources** and click **Create Data Source**.
2. Select **AWS - Credentials** as the service type, then fill in the credentials:

| Field | Required | Description |
|-------|----------|-------------|
| **Name** | Yes | A descriptive name for this data source |
| **Access Key** | Yes | AWS Access Key ID |
| **Secret Key** | Yes | AWS Secret Access Key |

3. Click **Test Connectivity** to validate the credentials. If the test succeeds, click **OK** to save the data source.

## Permission Requirements

The AWS credentials must have read access to the target S3 bucket. If downstream tasks will enable **Clean Up Original Files**, the credentials must also have write and delete permissions.

## Next Steps

After creating this data source, you can use it to create an [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md).
78 changes: 20 additions & 58 deletions tidb-cloud-lake/guides/data-integration-overview.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,36 @@
---
title: Data Integration
summary: The Data Integration feature in {{{ .lake }}} enables you to load data from external sources into {{{ .lake }}} through a visual, no-code interface. You can create data sources, configure integration tasks, and monitor synchronization — all from the {{{ .lake }}} console.
title: Data Integration Overview
summary: The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing or synchronizing data from external systems into {{{ .lake }}}.
---

# Data Integration
# Data Integration Overview

The Data Integration feature in {{{ .lake }}} enables you to load data from external sources into {{{ .lake }}} through a visual, no-code interface. You can create data sources, configure integration tasks, and monitor synchronization — all from the {{{ .lake }}} console.

## Supported Data Sources

| Data Source | Description |
| -------------------- | ---------------------------------------------------------------------------------------- |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Sync data from MySQL databases with support for Snapshot, CDC, and Snapshot + CDC modes. |
| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Import files from Amazon S3 buckets with support for CSV, Parquet, and NDJSON formats. |
The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing or synchronizing data from external systems into {{{ .lake }}}. The feature centers around two key concepts: **data sources** and **integration tasks**.

## Key Concepts

### Data Source

A data source represents a connection to an external system. It stores the credentials and connection details needed to access the source data. Once configured, a data source can be reused across multiple integration tasks.

{{{ .lake }}} currently supports two types of data sources:

- **MySQL - Credentials**: Connection to a MySQL database (host, port, username, password, database).
- **AWS - Credentials**: Connection to Amazon S3 (Access Key and Secret Key).

### Integration Task

An integration task defines how data flows from a source to a target table in {{{ .lake }}}. Each task specifies the source configuration, target warehouse and table, and operational parameters specific to the data source type.

## Managing Data Sources

To manage data sources, navigate to **Data** > **Data Sources** from the left sidebar. From this page you can:

- View all configured data sources
- Create new data sources
- Edit or delete existing data sources
- Test connectivity to verify credentials

> **Tip:**
>
> It is recommended to always test the connection before saving a data source. This helps catch common issues such as incorrect credentials or network restrictions early.

## Managing Tasks

### Starting and Stopping Tasks

After creation, a task is in a **Stopped** state. To begin data synchronization, click the **Start** button on the task.

To stop a running task, click the **Stop** button. The task will gracefully shut down and save its progress.

### Task Status
| Concept | Description |
|---------|-------------|
| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, or a FeiShu bot webhook. |
| [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) | Executable tasks that define where data comes from, which {{{ .lake }}} table it is written to, which runtime parameters are used, and how the task is started and monitored. |

The Data Integration page displays all tasks with their current status:
Data sources do not move data by themselves. They only store the information required to access external systems. Integration tasks are the units that actually perform imports, snapshots, and continuous synchronization.

| Status | Description |
| ------- | ----------------------------- |
| Running | Task is actively syncing data |
| Stopped | Task is not running |
| Failed | Task encountered an error |
Not every data source corresponds to an ingestion task. For example, `FeiShuBot` is used for notifications rather than loading source data into {{{ .lake }}}.

### Viewing Run History
## Supported Integration Task Types

Click on a task to view its execution history. The run history includes:
| Task Type | Description |
|-----------|-------------|
| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. |

- Execution start and end times
- Number of rows synced
- Error details (if any)
## Recommended Flow

![Run History](/media/tidb-cloud-lake/dataintegration-run-history-page.png)
1. Create and test reusable connection settings on the [Data Sources](/tidb-cloud-lake/guides/data-sources.md) page.
2. Review supported task types and their use cases on the [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) page.
3. Read the task-specific guide to configure the source, preview the data, and set the target table.
4. Use the [Task Management](/tidb-cloud-lake/guides/task-management.md) page to start tasks, check status, and troubleshoot execution issues.

## Video Tour

Expand Down
37 changes: 37 additions & 0 deletions tidb-cloud-lake/guides/data-sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Data Sources
summary: A data source in {{{ .lake }}} represents a connection to an external system. It stores the credentials and connection details required to access external systems and can be reused across multiple integration tasks or notification scenarios.
---

# Data Sources

A data source in {{{ .lake }}} represents a connection to an external system. It stores the credentials and connection details required to access external systems and can be reused across multiple integration tasks or notification scenarios.

Data sources do not execute synchronization by themselves. Their role is to centralize access settings so you do not need to repeatedly enter accounts, passwords, keys, or notification endpoints in every task.

## Supported Data Source Types

| Type | Purpose |
|------|---------|
| [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) | Stores the Access Key and Secret Key required to access Amazon S3. These credentials can be reused across multiple S3 import tasks. |
| [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) | Stores the host, port, username, password, and database information required to access MySQL. These settings can be reused across multiple MySQL sync tasks. |
| [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) | Stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. |

Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `AWS - Credentials` and `MySQL - Credentials` are referenced by actual data import or synchronization tasks.

## Managing Data Sources

Navigate to **Data** > **Data Sources**. On this page, you can:

- View all configured data sources
- Create new data sources
- Edit or delete existing data sources
- Test connectivity to validate credentials

> **Tip:**
>
> Run **Test Connectivity** before saving a data source to catch issues such as invalid credentials, missing permissions, or network restrictions as early as possible.

## Next Steps

After creating a data source, you can reference it in an [integration task](/tidb-cloud-lake/guides/integration-tasks.md) or a notification configuration, depending on its purpose.
113 changes: 113 additions & 0 deletions tidb-cloud-lake/guides/feishubot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: FeiShuBot
summary: This page describes how to create a "FeiShuBot" data source. This data source stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios.
---

# FeiShuBot

This page describes how to create a `FeiShuBot` data source. This data source stores a FeiShu bot webhook and message template and is typically used for task failure notifications.

## Use Cases

- Send notifications to a FeiShu group when a task run fails
- Reuse the same bot configuration and message template across multiple tasks
- Manage the notification endpoint and message format centrally

## Create FeiShuBot

1. Navigate to **Data** > **Data Sources** and click **Create Data Source**.
2. Select **FeiShuBot** as the service type, then fill in the fields:

| Field | Required | Description |
|-------|----------|-------------|
| **Name** | Yes | A descriptive name for this data source. Only letters, numbers, and underscores are supported |
| **URL** | Yes | Custom FeiShu bot webhook URL |
| **Warehouse** | Yes | The warehouse used to create the `NOTIFICATION INTEGRATION` |
| **Payload** | Yes | Message payload type. Currently, only `Task Error` is supported |
| **Template** | Yes | Custom message template |

3. Click **Test Connectivity** to validate the configuration. If the test succeeds, click **OK** to save the data source.

## Usage

`FeiShuBot` can be used with the SQL Task `ERROR_INTEGRATION` property, or referenced from the console Task Flow UI through **Error Notification**.

### Set a SQL Task Property

Set the Task `ERROR_INTEGRATION` property. In the following example, the data source name is `test_1`:

```sql
CREATE TASK my_daily_task
WAREHOUSE = 'compute_wh'
SCHEDULE = USING CRON '0 0 9 * * *' 'America/Los_Angeles'
COMMENT = 'Daily summary task'
ERROR_INTEGRATION = 'test_1'
AS
INSERT INTO summary_table SELECT * FROM source_table;
```

### Configure It in the Task Flow UI

On the create or edit page, set **Error Notification** to the corresponding `FeiShuBot` data source.

### Customize the Task Error Template

Default template:

```text
**[ALERT] {{ .MessageType }} - {{ .TaskName }}**
---
taskId: {{ .TaskId }}
taskName: {{ .TaskName }}
tenantId: {{ .TenantId }}

Messages: {{ range .Messages }}
- runId: {{ .RunId }}
queryId: {{ .QueryId }}
error: {{ .ErrorKind }} ({{ .ErrorCode }})
message: {{ .ErrorMessage }} {{ end }}

---
{{ .Timestamp }}
```

A received message looks similar to this:

![FeiShu notification example](/media/tidb-cloud-lake/feishubot-example.png)

Custom templates support:

- Markdown content
- Golang template syntax

The following variables are available:

```golang
type ErrorIntegrationPayload struct {
Version string `json:"version"`
MessageId string `json:"messageId"`
MessageType string `json:"messageType"`
Timestamp time.Time `json:"timestamp"`
TenantId string `json:"tenantId"`
TaskName string `json:"taskName"`
TaskId string `json:"taskId"`
RootTaskName string `json:"rootTaskName"`
RootTaskId string `json:"rootTaskId"`
Messages []*ErrorMessage `json:"messages"`
}

type ErrorMessage struct {
RunId string `json:"runId"`
ScheduledTime time.Time `json:"scheduledTime"`
QueryStartTime *time.Time `json:"queryStartTime"`
CompletedTime *time.Time `json:"completedTime"`
QueryId string `json:"queryId"`
ErrorKind string `json:"errorKind"`
ErrorCode string `json:"errorCode"`
ErrorMessage string `json:"errorMessage"`
}
```

## Notes

`FeiShuBot` is a notification-oriented data source. It is not used to load business data into {{{ .lake }}}.
42 changes: 15 additions & 27 deletions tidb-cloud-lake/guides/integrate-with-amazon-s3.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
---
title: Amazon S3
title: Amazon S3 Integration Task
summary: The Amazon S3 data integration enables you to import files from S3 buckets into {{{ .lake }}}. It supports CSV, Parquet, and NDJSON file formats, with options for one-time imports or continuous ingestion that automatically polls for new files.
---

# Amazon S3
# Amazon S3 Integration Task

The Amazon S3 data integration enables you to import files from S3 buckets into {{{ .lake }}}. It supports CSV, Parquet, and NDJSON file formats, with options for one-time imports or continuous ingestion that automatically polls for new files.
This page describes how to create an Amazon S3 integration task that imports files from an S3 bucket into {{{ .lake }}}. CSV, Parquet, and NDJSON file formats are supported, and the task can be configured for one-time import or continuous ingestion.

If you need to create reusable AWS credentials first, see [AWS - Credentials](/tidb-cloud-lake/guides/aws-credentials.md).

## Supported File Formats

Expand All @@ -15,23 +17,11 @@
| Parquet | Columnar storage format, efficient for analytical workloads |
| NDJSON | Newline-delimited JSON, one JSON object per line |

## Creating an S3 Data Source

1. Navigate to **Data** > **Data Sources** and click **Create Data Source**.

2. Select **AWS - Credentials** as the service type, and fill in the credentials:

| Field | Required | Description |
|----------------|----------|--------------------------------------|
| **Name** | Yes | A descriptive name for this data source |
| **Access Key** | Yes | AWS Access Key ID |
| **Secret Key** | Yes | AWS Secret Access Key |
## Prerequisites

3. Click **Test Connectivity** to verify the credentials. If the test succeeds, click **OK** to save the data source.

> **Tip:**
>
> The AWS credentials must have read access to the target S3 bucket. If you plan to use the **Clean Up Original Files** option, write and delete permissions are also required.
- An **AWS - Credentials** data source has already been created
- The AWS credentials have read access to the target S3 bucket
- If you plan to enable **Clean Up Original Files**, the credentials also need write and delete permissions

## Creating an S3 Integration Task

Expand All @@ -41,14 +31,12 @@

2. Select an S3 data source, then configure the basic settings:

| Field | Required | Description |
|--------------------|----------|--------------------------------------------------------------------------------------------------|
| **Data Source** | Yes | Select an existing AWS data source from the dropdown |
| **Name** | Yes | A name for this integration task |
| **File Path** | Yes | S3 URI with optional wildcard pattern (e.g., `s3://mybucket/data/2025-*.csv`) |
| **File Type** | Auto | Auto-detected from file extension. Supported: CSV, Parquet, NDJSON |

![Create S3 Task - Basic Info](/media/tidb-cloud-lake/create-s3-task-basic-info.png)
| Field | Required | Description |
|--------------------|----------|--------------------------------------------------------------------------------------------------|
| **Data Source** | Yes | Select an existing **AWS - Credentials** data source from the dropdown |
| **Name** | Yes | A name for this integration task |
| **File Path** | Yes | S3 URI with optional wildcard pattern (e.g., `s3://mybucket/data/2025-*.csv`) |

Check failure on line 38 in tidb-cloud-lake/guides/integrate-with-amazon-s3.md

View workflow job for this annotation

GitHub Actions / vale

[vale] reported by reviewdog 🐶 [PingCAP.Latin] Use 'for example' instead of 'e.g.,'. Raw Output: {"message": "[PingCAP.Latin] Use 'for example' instead of 'e.g.,'.", "location": {"path": "tidb-cloud-lake/guides/integrate-with-amazon-s3.md", "range": {"start": {"line": 38, "column": 77}}}, "severity": "ERROR"}
| **File Type** | Auto | Auto-detected from file extension. Supported: CSV, Parquet, NDJSON |

#### CSV Options

Expand Down
Loading
Loading