diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_facebook_ads.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_facebook_ads.png new file mode 100644 index 000000000..4f2dec663 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_facebook_ads.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_freshdesk.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_freshdesk.png new file mode 100644 index 000000000..36a2dcdfd Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_freshdesk.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_ads.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_ads.png new file mode 100644 index 000000000..5d52a4cf8 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_ads.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_analytics.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_analytics.png new file mode 100644 index 000000000..7403f5dd9 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_google_analytics.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_hubspot.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_hubspot.png new file mode 100644 index 000000000..87d31fe2f Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_hubspot.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_pipedrive.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_pipedrive.png new file mode 100644 index 000000000..6009b18c0 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_pipedrive.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_salesforce.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_salesforce.png new file mode 100644 index 000000000..ad492abec Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_salesforce.png differ diff --git a/docs/assets/images/guides/fs/data_source/crm_sales_analytics_shopify.png b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_shopify.png new file mode 100644 index 000000000..5f57a9538 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/crm_sales_analytics_shopify.png differ diff --git a/docs/assets/images/guides/fs/data_source/rest_api_creation.png b/docs/assets/images/guides/fs/data_source/rest_api_creation.png new file mode 100644 index 000000000..fd8a51860 Binary files /dev/null and b/docs/assets/images/guides/fs/data_source/rest_api_creation.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_rest_incremental.png b/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_rest_incremental.png new file mode 100644 index 000000000..309d6865b Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_rest_incremental.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_sql.png b/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_sql.png new file mode 100644 index 000000000..74d4d3ece Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_configure_job_sql.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_configure_rest_endpoint.png b/docs/assets/images/guides/fs/feature_group/dlthub_configure_rest_endpoint.png new file mode 100644 index 000000000..f01664aba Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_configure_rest_endpoint.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_rest_page_number_pagination.png b/docs/assets/images/guides/fs/feature_group/dlthub_rest_page_number_pagination.png new file mode 100644 index 000000000..1b1e2cdcd Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_rest_page_number_pagination.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_review_modal.png b/docs/assets/images/guides/fs/feature_group/dlthub_review_modal.png new file mode 100644 index 000000000..f8958ba5a Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_review_modal.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_select_crm_resource.png b/docs/assets/images/guides/fs/feature_group/dlthub_select_crm_resource.png new file mode 100644 index 000000000..7553f93f6 Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_select_crm_resource.png differ diff --git a/docs/assets/images/guides/fs/feature_group/dlthub_select_sql_table.png b/docs/assets/images/guides/fs/feature_group/dlthub_select_sql_table.png new file mode 100644 index 000000000..1c7bda34a Binary files /dev/null and b/docs/assets/images/guides/fs/feature_group/dlthub_select_sql_table.png differ diff --git a/docs/user_guides/fs/data_source/creation/crm_sales_analytics.md b/docs/user_guides/fs/data_source/creation/crm_sales_analytics.md new file mode 100644 index 000000000..3f8516235 --- /dev/null +++ b/docs/user_guides/fs/data_source/creation/crm_sales_analytics.md @@ -0,0 +1,167 @@ +# How-To set up a CRM, Sales & Analytics Data Source + +## Introduction + +The `CRM, Sales & Analytics` data source lets you connect Hopsworks to supported business applications and marketing platforms. +The following sources are available: + +- Facebook Ads +- Freshdesk +- Google Ads +- Google Analytics +- HubSpot +- Pipedrive +- Salesforce +- Shopify + +In this guide, you will configure a Data Source in Hopsworks by saving the credentials required by the selected source. + +!!! note + Currently, it is only possible to create data sources in the Hopsworks UI. + You cannot create a data source programmatically. + +## Prerequisites + +Before you begin, make sure you have: + +- A unique name for the data source in Hopsworks. +- Read credentials for the external system you want to connect. +- Any source-specific identifiers required by that system, such as account, customer, property, or domain identifiers. +- For Google Ads and Google Analytics, a service account JSON keyfile that can be uploaded to the Hopsworks project. + +## Creation in the UI + +### Step 1: Set up new Data Source + +Head to the Data Source View on Hopsworks (1) and set up a new data source (2). + +
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
+
+ +### Step 2: Select storage and source + +Choose `CRM, Sales & Analytics` as the storage type. +Then enter a unique **Name**, an optional **Description**, and select the source you want to configure. + +
+ ![CRM, Sales & Analytics - Facebook Ads](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_facebook_ads.png) +
CRM, Sales & Analytics data source selection
+
+ +### Step 3: Enter source-specific credentials + +The required fields depend on the selected source. + +#### Facebook Ads + +Required fields: + +- **Access Token** +- **Account Id** + +
+ ![Facebook Ads Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_facebook_ads.png) +
Facebook Ads data source form
+
+ +#### Freshdesk + +Required fields: + +- **API Key** +- **Domain** + +
+ ![Freshdesk Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_freshdesk.png) +
Freshdesk data source form
+
+ +#### Google Ads + +Required fields: + +- **Authentication JSON Keyfile** +- **Developer Token** +- **Customer Id** +- **Impersonated Email** + +The JSON keyfile can be selected either from an existing project file or uploaded as a new file. + +
+ ![Google Ads Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_google_ads.png) +
Google Ads data source form
+
+ +#### Google Analytics + +Required fields: + +- **Authentication JSON Keyfile** +- **Property Id** + +The JSON keyfile can be selected either from an existing project file or uploaded as a new file. + +
+ ![Google Analytics Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_google_analytics.png) +
Google Analytics data source form
+
+ +#### HubSpot + +Required fields: + +- **API Key** + +
+ ![HubSpot Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_hubspot.png) +
HubSpot data source form
+
+ +#### Pipedrive + +Required fields: + +- **API Key** + +
+ ![Pipedrive Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_pipedrive.png) +
Pipedrive data source form
+
+ +#### Salesforce + +Required fields: + +- **Security Token** +- **Username** +- **Password** + +
+ ![Salesforce Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_salesforce.png) +
Salesforce data source form
+
+ +#### Shopify + +Required fields: + +- **Shop URL** +- **Private App Password** + +
+ ![Shopify Data Source](../../../../assets/images/guides/fs/data_source/crm_sales_analytics_shopify.png) +
Shopify data source form
+
+ +### Step 4: Save the credentials + +After entering the required fields for the selected source: + +1. Click **Save Credentials**. +2. Click **Next: Select resource** to continue configuring the data source for downstream use. + +## Next Steps + +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created data source. diff --git a/docs/user_guides/fs/data_source/creation/rest_api.md b/docs/user_guides/fs/data_source/creation/rest_api.md new file mode 100644 index 000000000..d19033a52 --- /dev/null +++ b/docs/user_guides/fs/data_source/creation/rest_api.md @@ -0,0 +1,71 @@ +# How-To set up a REST API Data Source + +## Introduction + +The `REST API` data source lets you connect Hopsworks to external HTTP APIs. +You can use it to store the base connection details, optional headers, and the authentication method required by the target API. + +In this guide, you will configure a REST API Data Source in the Hopsworks UI. + +!!! note + Currently, it is only possible to create data sources in the Hopsworks UI. + You cannot create a data source programmatically. + +## Prerequisites + +Before you begin, make sure you have: + +- A unique name for the data source in Hopsworks. +- The **Base URL** of the target API. +- Any headers you want to send with requests. +- The authentication details required by the target API. + +## Creation in the UI + +### Step 1: Set up new Data Source + +Head to the Data Source View on Hopsworks (1) and set up a new data source (2). + +
+ ![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png) +
The Data Source View in the User Interface
+
+ +### Step 2: Enter REST API settings + +Select `REST API` as the storage type. +Then provide the common connection settings shown in the form: + +1. **Name:** A unique name for the data source. +2. **Description:** Optional description. +3. **Base URL:** The base endpoint for the external API. +4. **Headers:** Optional header key-value pairs. Use the `+` button to add headers. +5. **Authentication:** Select the authentication mode required by the API. + +The following authentication modes are available in the UI: + +- `NONE` +- `BEARER_TOKEN` +- `API_KEY` +- `HTTP_BASIC` +- `OAUTH2_CLIENT` + +
+ ![REST API Data Source](../../../../assets/images/guides/fs/data_source/rest_api_creation.png) +
REST API data source form
+
+ +!!! note + The screenshot shows the form with `NONE` selected. + When you choose another authentication mode, the form will prompt for the additional credentials required by that method. + +### Step 3: Save the credentials + +After entering the connection details: + +1. Click **Save Credentials**. +2. Click **Next: Select resource** to continue configuring the data source for downstream use. + +## Next Steps + +Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created REST API data source. diff --git a/docs/user_guides/fs/data_source/index.md b/docs/user_guides/fs/data_source/index.md index 90af4eb1b..fc0cca41e 100644 --- a/docs/user_guides/fs/data_source/index.md +++ b/docs/user_guides/fs/data_source/index.md @@ -36,6 +36,8 @@ Cloud agnostic storage systems: 2. [Snowflake](creation/snowflake.md): Query Snowflake databases and tables using SQL. 3. [Kafka](creation/kafka.md): Read data from a Kafka cluster into a Spark Structured Streaming Dataframe. 4. [HopsFS](creation/hopsfs.md): Easily connect and read from directories of Hopsworks' internal File System. +5. [CRM, Sales & Analytics](creation/crm_sales_analytics.md): Connect to supported CRM, sales, and analytics platforms. +6. [REST API](creation/rest_api.md): Connect to external HTTP APIs with configurable headers and authentication. ## AWS diff --git a/docs/user_guides/fs/data_source/usage.md b/docs/user_guides/fs/data_source/usage.md index 0cb0c72a6..d7cfba09e 100644 --- a/docs/user_guides/fs/data_source/usage.md +++ b/docs/user_guides/fs/data_source/usage.md @@ -154,6 +154,25 @@ Example for any data warehouse/SQL based external sources, we set the desired SQ This enables users to create feature groups within Hopsworks without the hassle of data migration. For more information on `Connector API`, read detailed guide about [external feature groups](../feature_group/create_external.md). +## Ingesting Data into a Managed Feature Group + +Data Sources can also be used to create a managed feature group and ingest data from the source into Hopsworks. +In this workflow, Hopsworks creates a sink-enabled feature group together with an ingestion job that copies data from the source into the feature group. + +This is different from an external feature group: + +- An **external feature group** keeps the data in the external source and stores only metadata in Hopsworks. +- A **managed feature group with ingestion enabled** copies the source data into Hopsworks and can keep it synchronized through recurring ingestion jobs. + +This workflow is especially useful when you want to: + +- Materialize source data inside Hopsworks. +- Schedule recurring ingestions. +- Use full-load or incremental ingestion strategies. +- Build managed feature groups from SQL, CRM, or REST API sources. + +For the full workflow, including schema selection, ingestion job configuration, loading strategies, and REST pagination, see [Ingest Data with dltHub][ingest-data-with-dlthub]. + ## Writing Training Data Data Sources are also used while writing training data to external sources. diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index 7cae94876..484904b7b 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -2,7 +2,7 @@ description: Documentation on how to create a Feature Group and the different APIs available to insert data to a Feature Group in Hopsworks. --- -# How to create a Feature Group +# How to create a Feature Group { #create-feature-group } ## Introduction diff --git a/docs/user_guides/fs/feature_group/create_external.md b/docs/user_guides/fs/feature_group/create_external.md index fedee8c51..bfb6e978e 100644 --- a/docs/user_guides/fs/feature_group/create_external.md +++ b/docs/user_guides/fs/feature_group/create_external.md @@ -2,7 +2,7 @@ description: Documentation on how to create an external feature group in Hopsworks and the different APIs available to interact with them. --- -# How to create an External Feature Group +# How to create an External Feature Group { #create-external-feature-group } ## Introduction diff --git a/docs/user_guides/fs/feature_group/index.md b/docs/user_guides/fs/feature_group/index.md index 4161ff878..20e0010d3 100644 --- a/docs/user_guides/fs/feature_group/index.md +++ b/docs/user_guides/fs/feature_group/index.md @@ -4,6 +4,7 @@ This section serves to provide guides and examples for the common usage of abstr - [Create a Feature Group](create.md) - [Create an external Feature Group](create_external.md) +- [Ingest Data with dltHub](ingest_with_dlthub.md) - [Deprecating Feature Group](deprecation.md) - [Data Types and Schema management](data_types.md) - [Statistics](statistics.md) diff --git a/docs/user_guides/fs/feature_group/ingest_with_dlthub.md b/docs/user_guides/fs/feature_group/ingest_with_dlthub.md new file mode 100644 index 000000000..24bff8edd --- /dev/null +++ b/docs/user_guides/fs/feature_group/ingest_with_dlthub.md @@ -0,0 +1,373 @@ +--- +description: Documentation on how to ingest data from a data source into a new feature group using dltHub in Hopsworks. +--- + +# How to ingest data into a Feature Group with dltHub { #ingest-data-with-dlthub } + +## Introduction + +Hopsworks can copy data from an existing data source into a new managed feature group using dltHub. +This workflow creates: + +- A new feature group in Hopsworks. +- An ingestion job that copies data from the selected source into that feature group. + +This is different from creating an external feature group. +An external feature group keeps the data in the source system, while the dltHub ingestion flow copies the data into Hopsworks. + +!!! note + You can configure this workflow both in the Hopsworks UI and with the Hopsworks Python APIs. + +## When to use this workflow + +Use `Ingest Data to New Feature Group` when you want to: + +- Copy data from source into Hopsworks. +- Schedule recurring ingestion jobs. +- Use incremental loading for supported source types. + +## Supported source types + +This ingestion flow supports multiple data sources: + +- SQL-like sources can either create an external feature group or ingest data into a new feature group. +- CRM and REST API sources use the ingestion path only. +- Incremental loading is available for SQL and REST API sources. +- CRM sources currently use full-load ingestion. + +## Step 1: Open the Data Source and start Feature Group creation + +Navigate to the data source you want to use and start the feature-group creation flow from the UI. + +For SQL-based sources, select a database object first and then choose `Ingest Data to New Feature Group`. + +
+ ![dltHub SQL Feature Group Selection](../../../assets/images/guides/fs/feature_group/dlthub_select_sql_table.png) +
Select a source table and choose Ingest Data to New Feature Group
+
+ +For CRM sources, choose the source resource and then configure the feature schema for the new feature group. + +
+ ![dltHub CRM Feature Group Selection](../../../assets/images/guides/fs/feature_group/dlthub_select_crm_resource.png) +
Select a CRM resource and configure the feature group schema
+
+ +For REST API sources, first configure the endpoint before fetching the schema. + +### REST endpoint pagination + +REST sources require endpoint configuration up front so Hopsworks can fetch the schema correctly. +In this step, define: + +- **Resource**: Any unique identifier for the endpoint. +- **Relative URL**: The endpoint path relative to the configured REST data source base URL. +- **Request Parameters**: Optional query parameters sent with the request. +- **Pagination Configuration**: The pagination mode and its parameters, if the API returns paged results. + +The REST pagination form supports these modes: + +- `NONE` +- `HEADER_CURSOR` +- `HEADER_LINK` +- `JSON_CURSOR` +- `JSON_LINK` +- `OFFSET` +- `PAGE_NUMBER` +- `SINGLE_PAGE` + +For example, `PAGE_NUMBER` pagination exposes: + +- **Page Parameter Name**: Name of the request parameter that contains the page number. +- **Base Page**: Starting page number used by the API, for example `0` or `1`. +- **Total Pages Path**: Response path containing the total number of pages. + +
+ ![dltHub REST Page Number Pagination](../../../assets/images/guides/fs/feature_group/dlthub_rest_page_number_pagination.png) +
REST API pagination configuration using PAGE_NUMBER
+
+ +Other pagination modes expose their own source-specific fields in the form: + +- `OFFSET`: offset parameter name, limit parameter name, limit value, total-items path, and has-more path. +- `JSON_CURSOR`: cursor parameter name and cursor path. +- `HEADER_CURSOR`: cursor header key and cursor path. +- `HEADER_LINK`: next-link header key. +- `JSON_LINK`: next URL path. + +For more details on how these pagination strategies work in dltHub, see the [dltHub REST API pagination documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api/basic#pagination). + +## Step 2: Configure the feature group schema + +After fetching metadata from the source, Hopsworks shows the feature selection table. +At this stage you can: + +- Set the **Feature Group Name**. +- Include or exclude columns. +- Edit feature names and data types. +- Mark one or more features as **Primary key**. +- Optionally select a **Partition key**. +- Optionally select an **Event time** column. +- Preview metadata and preview data before continuing. + +When you are ready, click `Next: Configure Ingestion Job`. + +!!! note + `Create External Feature Group` is not supported for CRM and REST connectors. + +!!! note + For CRM and REST sources, schema fetching reads only a small sample of records from the source. + Hopsworks uses this sample to infer the feature-group schema before you create the ingestion job. + +## Step 3: Configure the dltHub ingestion job + +The next page configures the ingestion job that will populate the feature group. + +
+ ![dltHub SQL Job Configuration](../../../assets/images/guides/fs/feature_group/dlthub_configure_job_sql.png) +
Configure the dltHub ingestion job
+
+ +### Common job settings + +The following fields are available in the job configuration: + +- **Job Name**: Name of the ingestion job created in Hopsworks. +- **Source Read Parallelism**: Number of parallel readers that pull data from the source database or API. Increase it to speed up ingestion if the source can handle the extra load. +- **Data processing parallelism**: Number of parallel processes that prepare and transform data before loading it into the feature group. Increase it if processing is slow and CPU is available. +- **Destination Write Batch Size**: Number of records written to the feature group in each batch during ingestion. +- **Max Write Batch Size (MB)**: Maximum file size, in megabytes, when writing data to the feature group. +- **Write Mode**: Controls whether incoming data is appended as-is or merged with existing rows using the primary key. +- **Start the job after creation**: Starts the ingestion job immediately after the resources are created. +- **Memory (in MB)**: Memory allocated to the ingestion job. +- **CPU Cores**: CPU cores allocated to the ingestion job. +- **Schedule**: Optional recurring schedule for future ingestion runs. +- **Alerts**: Optional alerting configuration for the ingestion job. + +### SQL-only settings + +For SQL sources, the job configuration also includes: + +- **Source Read Batch Size**: Number of records fetched per read from the SQL source. +- **Source Table Partitions**: Number of partitions used when reading from SQL sources. For very large tables, increase this value to split the read into smaller chunks that fit the allocated memory. + +These options control how data is read from the source table during ingestion. + +### Write modes + +Two write modes are available: + +- **APPEND**: Appends new data without merging with existing rows. This greatly speeds up writes and uses less memory, but can result in duplicate rows. If you are ingesting a large amount of data, this is the recommended mode and duplicates can be handled later in a separate pipeline step. +- **MERGE**: Merges incoming data with existing rows using the feature-group primary key. This avoids duplicate rows, but slows down ingestion and requires more memory, especially for large ingestions. Use it when ingesting smaller amounts of data. + +## Step 4: Choose a loading strategy + +The `Loading Strategy` section controls whether the pipeline reads the entire source or only new data. + +The following strategies are available in the UI: + +- `FULL_LOAD` +- `INCREMENTAL_ID` +- `INCREMENTAL_TIMESTAMP` +- `INCREMENTAL_DATE` + +### Full load + +`FULL_LOAD` is available for all sources in this workflow. +With a full load, the ingestion job reads the complete dataset from the source and writes it again to the destination feature group. + +In practice, this means the target feature group is refreshed from scratch for the same feature-group name and version. +Any data already stored in that feature group version is removed and replaced by the newly ingested data from the source. + +Use `FULL_LOAD` when you want the feature group to be a complete copy of the source at the time of ingestion, rather than an incremental continuation of previous runs. +This is useful when: + +- The source does not provide a reliable incremental cursor. +- You want to rebuild the feature group from a clean state. +- The source data can change retroactively and you want to re-sync the full table or endpoint. + +Because a full load rewrites the destination dataset, it is typically more expensive than incremental ingestion for large sources. +For recurring pipelines, prefer an incremental strategy when the source supports it and when you only need newly added or updated records. + +For SQL sources, you can also optionally define: + +- **Source Cursor Field**: A field used to efficiently synchronize only new or changed data from the source into the destination feature group. +- **Initial Value**: Starting value for the selected source cursor field. + +This can be used to split or optimize the load when the source table has a monotonic column, even though the ingestion mode remains a full refresh of the feature group. + +### Incremental loading + +Incremental loading is available for SQL and REST API sources. +With incremental loading, the ingestion job does not re-copy the full source on every run. +Instead, it keeps track of a cursor value and only fetches records that are newer than, or come after, the last processed value. + +This makes incremental loading the preferred option for recurring ingestion jobs when the source exposes a stable field that can be used to identify new or updated data. +Typical cursor fields are: + +- Increasing numeric identifiers. +- Update timestamps. +- Event dates. + +Compared to `FULL_LOAD`, incremental loading typically: + +- Reduces the amount of data read from the source. +- Shortens ingestion time. +- Lowers resource usage. +- Avoids rebuilding the destination feature group from scratch on every run. + +To work reliably, the selected cursor field should be monotonic or consistently ordered for the records you want to ingest. +If the source does not provide such a field, `FULL_LOAD` is usually the safer option. + +The common incremental field is: + +- **Source Cursor Field**: A field used to efficiently synchronize only new or changed data from the source into the destination feature group. + +Depending on the strategy, you must also define: + +- **INCREMENTAL_ID**: **Initial Value**, the numeric starting value for incremental reads. +- **INCREMENTAL_TIMESTAMP**: **Initial Value**, the starting Unix timestamp for incremental reads. +- **INCREMENTAL_DATE**: **Initial Date**, the starting date and time for incremental reads. + +The initial value defines where the first run starts. +After that, subsequent runs continue from the last successfully processed cursor value. + +For REST API sources, incremental loading also requires: + +- **REST Filter Param**: The actual API parameter used to request only new data since the last run, for example `start_date`, `updated_at`, or `since`. + +Choose the incremental strategy that matches the source cursor type: + +- `INCREMENTAL_ID` for sources with increasing numeric identifiers. +- `INCREMENTAL_TIMESTAMP` for sources that expose Unix timestamps. +- `INCREMENTAL_DATE` for sources that filter by date or datetime values. + +
+ ![dltHub REST Incremental Job](../../../assets/images/guides/fs/feature_group/dlthub_configure_job_rest_incremental.png) +
REST API ingestion job with incremental loading
+
+ +## Step 5: Review and create + +After configuring the ingestion job, click `Next: Review Configuration`. +The review dialog shows: + +- The source schema, table, connector, or resource. +- The final feature group name. +- Whether sink ingestion is enabled. +- The ingestion job name. +- The number of selected features. + +You can still edit the feature-group name and ingestion-job name in this step before creating the resources. + +
+ ![dltHub Review Configuration](../../../assets/images/guides/fs/feature_group/dlthub_review_modal.png) +
Review the feature group and ingestion job before creation
+
+ +Click `Create` to create the feature group and the dltHub ingestion job. + +## Result + +After creation: + +- The feature group is registered in Hopsworks. +- The ingestion job is available under project jobs. +- If `Start the job after creation` is enabled, the initial ingestion starts immediately. +- If a schedule is configured, future synchronizations will run automatically. + +## Next Steps + +- Use the [Feature Group creation guide][create-feature-group] to understand managed feature groups in more detail. +- Use the [External Feature Group guide][create-external-feature-group] if you want to query the source in place without copying data into Hopsworks. +- Use the [Online Ingestion Observability guide][online-ingestion-observability] to monitor ingestion behavior for online-enabled feature groups. + +## API support + +You can also configure data source ingestion programmatically with the Hopsworks Python APIs. +This is done by creating a sink-enabled feature group and passing a sink job configuration, including loading strategy and, for REST sources, endpoint and pagination settings. + +### Example: create a sink-enabled feature group + +```python +from hopsworks_common.core import sink_job_configuration + +fs = project.get_feature_store() +data_source = fs.get_data_source("my_sql_source").get_tables()[0] +data = data_source.get_data(use_cached=False) + +sink_job_conf = sink_job_configuration.SinkJobConfiguration( + name="sql_to_fg_ingestion", + write_mode=sink_job_configuration.WriteMode.APPEND, +) + +fg = fs.get_or_create_feature_group( + name="transactions_fg", + version=1, + description="Managed feature group populated from a data source.", + primary_key=[data.features[0]["name"]], + features=data.features, + data_source=data_source, + time_travel_format="DELTA", + sink_enabled=True, + sink_job_conf=sink_job_conf, +) +fg.save() + +# Run the ingestion job +fg.sink_job.run(await_termination=True) +``` + +### Example: REST ingestion with incremental loading + +```python +from hopsworks_common.core import rest_endpoint, sink_job_configuration +from hsfs.core import data_source as ds + +fs = project.get_feature_store() +parent_data_source = fs.get_data_source("my_rest_source") + +endpoint_config = rest_endpoint.RestEndpointConfig( + relative_url="/transactions", + query_params={"page_size": 100}, + pagination_config=rest_endpoint.PageNumberPaginationConfig( + base_page=1, + page_param="page", + total_path="total", + stop_after_empty_page=True, + ), +) + +rest_data_source = ds.DataSource( + table="transactions_rest", + rest_endpoint=endpoint_config, + storage_connector=parent_data_source.storage_connector, +) + +rest_data = rest_data_source.get_data(use_cached=False) + +loading_config = sink_job_configuration.LoadingConfig( + loading_strategy=sink_job_configuration.LoadingStrategy.INCREMENTAL_DATE, + source_cursor_field="timestamp", + initial_value="2024-01-01T00:00:00Z", + rest_filter_param="start_time", +) + +sink_job_conf = sink_job_configuration.SinkJobConfiguration( + name="rest_to_fg_ingestion", + loading_config=loading_config, +) + +fg = fs.get_or_create_feature_group( + name="transactions_rest_fg", + version=1, + description="Managed feature group populated from a REST source.", + primary_key=["id"], + features=rest_data.features, + data_source=rest_data_source, + time_travel_format="DELTA", + sink_enabled=True, + sink_job_conf=sink_job_conf, +) +fg.save() +``` diff --git a/docs/user_guides/fs/feature_group/online_ingestion_observability.md b/docs/user_guides/fs/feature_group/online_ingestion_observability.md index 734b3e120..97d9a3a72 100644 --- a/docs/user_guides/fs/feature_group/online_ingestion_observability.md +++ b/docs/user_guides/fs/feature_group/online_ingestion_observability.md @@ -2,7 +2,7 @@ description: Documentation on Online ingestion observability in Hopsworks. --- -# Online ingestion observability +# Online ingestion observability { #online-ingestion-observability } ## Introduction diff --git a/mkdocs.yml b/mkdocs.yml index 26765d1ee..a3d480a3f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -71,11 +71,14 @@ nav: - BigQuery: user_guides/fs/data_source/creation/bigquery.md - GCS: user_guides/fs/data_source/creation/gcs.md - SQL: user_guides/fs/data_source/creation/sql.md + - CRM, Sales & Analytics: user_guides/fs/data_source/creation/crm_sales_analytics.md + - REST API: user_guides/fs/data_source/creation/rest_api.md - Usage: user_guides/fs/data_source/usage.md - Feature Group: - user_guides/fs/feature_group/index.md - Create: user_guides/fs/feature_group/create.md - Create External: user_guides/fs/feature_group/create_external.md + - Ingest Data with dltHub: user_guides/fs/feature_group/ingest_with_dlthub.md - Create Spine: user_guides/fs/feature_group/create_spine.md - Deprecate: user_guides/fs/feature_group/deprecation.md - Data Types and Schema management: user_guides/fs/feature_group/data_types.md