Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions docs/user_guides/fs/data_source/creation/mongodb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# How-To set up a MongoDB Data Source { #data-source-mongodb }

## Introduction

MongoDB is a document database widely used as the operational store behind web and mobile applications.
Documents are stored in collections and grouped into databases on a MongoDB server or Atlas cluster.

A MongoDB Data Source in Hopsworks stores the connection details required to read collections from a MongoDB deployment.
Once configured, you can use the same data source as the basis for an external (on-demand) Feature Group, or as the source for a dltHub-driven ingestion job that materialises MongoDB documents into a managed Feature Group.

In this guide, you will configure a Data Source in Hopsworks that holds the authentication information needed to connect to your MongoDB deployment.

!!! note
Currently, it is only possible to create data sources in the Hopsworks UI.
You cannot create a data source programmatically.

## Prerequisites

Before you begin this guide you'll need to retrieve the following information from your MongoDB deployment (self-hosted or Atlas).
The following options are **mandatory**:

- **Connection String**: The MongoDB connection URI without embedded credentials, for example `mongodb+srv://my-cluster.abcde.mongodb.net` for an Atlas cluster or `mongodb://mongo.example.com:27017` for a self-hosted deployment.
The username and password live in separate fields so the URI itself can be checked into project metadata without leaking credentials.
- **User**: The MongoDB user the connector authenticates as.
- **Password**: The password for that user.

These are a few additional **optional** arguments:

- **Database**: The default database the connector points at when no database is explicitly selected.
The picker can target any database the user has access to, regardless of this default.
- **Collection**: The default collection used when no collection is provided at read time.
- **Auth Source**: The authentication database for the user (typically `admin`).
Required when the user is created in a database other than the one being read from — typical for Atlas.
- **Auth Mechanism**: The MongoDB authentication mechanism, e.g. `SCRAM-SHA-256`.
Leave empty to let the server negotiate the default.

!!! info "Drivers"
Hopsworks ships the MongoDB drivers needed to read from MongoDB out of the box.
The Hopsworks Spark image bundles the `mongo-spark-connector` for Spark reads, and the dlt ingestion image and Arrow Flight server bundle `pymongo` for the on-demand read path.
You do not need to install or upload the drivers yourself.

## Creation in the UI

### Step 1: Set up new Data Source

Head to the Data Source View on Hopsworks and start the creation flow for a new data source.

<figure markdown>
![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png)
<figcaption>The Data Source View in the User Interface</figcaption>
</figure>

### Step 2: Enter MongoDB Settings

Enter the details for your MongoDB connector.
Start by giving it a **name** and an optional **description**.

01. Select "MongoDB" as storage.
02. Specify the **Connection String** of your MongoDB cluster.
03. Provide the **User** name of the MongoDB database user.
04. Provide the **Password** for that user.
05. Optionally set a default **Database** and **Collection**.
06. Optionally set an **Auth Source** (typically `admin` for Atlas) and an **Auth Mechanism**.
07. Click on "Save Credentials".

## Use it as an ingestion source

Once the MongoDB data source exists, you can use it with the dltHub-based ingestion workflow described in [Ingest Data with dltHub][ingest-data-with-dlthub].
MongoDB is treated as a NoSQL document source, so the ingestion job runs a `pymongo` aggregation pipeline against the chosen collection and materialises the result into a managed Feature Group.

## Type mapping

MongoDB collections are schemaless: documents in the same collection can have different fields and different types per field.
When you browse a collection in the data source picker, Hopsworks samples a small batch of documents and infers a per-field type by classifying each observed value and applying promotion rules — `int + float → float`, `date + timestamp → timestamp`, `list` or nested document → `string` (JSON).
The inferred types are then projected to Hopsworks offline feature types using the table below.

| Python value type (Compass-style) | Hopsworks offline feature type |
| --- | --- |
| `int` | `bigint` |
| `float` | `double` |
| `Decimal128` | `decimal(38,18)` |
| `bool` | `boolean` |
| `datetime` | `timestamp` |
| `date` | `date` |
| `ObjectId` / `str` / `UUID` | `string` |
| `bytes` / `Binary` | `binary` |
| `list` / nested `dict` | `string` (JSON-encoded) |

You can also override the inferred type for any column in the Feature Group creation form.

## Known limitations

### Schemaless collections

The picker's schema inference walks a sampled batch of documents, not the full collection.
If a rare field type appears only in documents outside the sample, it may not be reflected in the inferred Feature Group schema.
You can override or extend the schema in the Feature Group creation form before saving.

### Nested documents and arrays

Nested documents and arrays are read as JSON-encoded `string` features.
If you need to expose nested fields as individual features, project them out at the source by writing a custom aggregation pipeline as the data source `query`.

### Online ingestion requires non-null primary keys

When you create a managed Feature Group fed from MongoDB via DLT and enable online serving, online ingestion validates that every row has a non-null value in the Feature Group's primary-key column.
If the source documents can carry `null` (or omit the field entirely) in that column, either filter them out at source, pick a different primary key on the Feature Group, or disable online serving for the Feature Group.

### `_id` is renamed

MongoDB's reserved `_id` field starts with an underscore, which Hopsworks' feature-name rule (`^[a-z][a-z0-9_]*$`) does not accept.
The data-source flow surfaces it as the feature `id` by default; you can rename it in the Feature Group creation form before saving.

## Next Steps

Move on to the [usage guide for data sources][data-source-usage] to see how you can use your newly created MongoDB connector.
19 changes: 19 additions & 0 deletions docs/user_guides/fs/data_source/creation/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ In this guide, you will configure a Data Source in Hopsworks to save all the aut
When you're finished, you'll be able to read files using Spark through Hopsworks APIs.
You can also use the connector to write out training data from the Feature Store, in order to make it accessible by third parties.

Once the S3 data source is set up, the bucket is **browsable** in the data-source picker.
You can register any individual parquet file as an external Feature Group, or register a directory of parquet files as a single external Feature Group — schema differences across files in the directory are unioned (`union_by_name`) and Hive-style partition columns (e.g. `year=2024/month=03/...`) are exposed and pushed down at read time.
See the [Create External Feature Group](../../feature_group/create_external.md) guide for the full workflow.

!!! note
Currently, it is only possible to create data sources in the Hopsworks UI.
You cannot create a data source programmatically.
Expand Down Expand Up @@ -116,6 +120,21 @@ You can also add options to configure the S3A client. For example, to disable SS

Click on "Save Credentials".

## Browse the bucket for parquet files and directories

After saving, click **Next: Select Tables** to open the data-source picker.

The picker walks the bucket (and the optional **Path** prefix) and surfaces, side by side:

- **Each individual `*.parquet` file** as its own selectable entry.
The full key is the path that the read engine uses.
- **Each prefix that contains two or more parquet files** as a single directory entry.
Selecting the directory registers the whole prefix as one external Feature Group; the read engine globs `s3://<bucket>/<prefix>/**/*.parquet` with `union_by_name=true` and `hive_partitioning=true`, so schema drift across files and Hive-partitioned subdirectories (`year=2024/month=03/...`) are handled automatically.
- **Each `_delta_log/`-shaped directory** as a single Delta Lake entry.
- **Each `.hoodie/`-shaped directory** as a single Apache Hudi entry.

`*.orc`, `*.avro`, and `*.csv` files are surfaced individually but are not currently supported as the read source for an external Feature Group through the Feature Query Service; use Spark via [`prepare_spark`](../usage.md#prepare-spark-api) for those.

## Next Steps

Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created S3 connector.
1 change: 1 addition & 0 deletions docs/user_guides/fs/data_source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Cloud agnostic storage systems:
5. [CRM, Sales & Analytics](creation/crm_sales_analytics.md): Connect to supported CRM, sales, and analytics platforms.
6. [REST API](creation/rest_api.md): Connect to external HTTP APIs with configurable headers and authentication.
7. [SAP HANA][data-source-sap-hana]: Query SAP HANA tenant databases using SQL.
8. [MongoDB][data-source-mongodb]: Read collections from a self-hosted MongoDB deployment or Atlas cluster.

## AWS

Expand Down
47 changes: 46 additions & 1 deletion docs/user_guides/fs/feature_group/create_external.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,19 @@ Once you have defined the metadata, you can

#### Data Lake based external feature group

External Feature Groups backed by an S3 data source can point at either a single parquet file or a directory of parquet files.
The directory form globs `**/*.parquet` and unions schemas across files (`union_by_name`), so adding columns to newer files in the prefix does not require recreating the Feature Group; Hive-style partition segments (`year=2024/month=03/...`) are exposed as columns and pushed down at filter time.

=== "Python"

```python
# Single parquet file on S3
fg = feature_store.create_external_feature_group(
name="sales",
version=1,
description="Physical shop sales features",
data_format="parquet",
path="sales/2024.parquet",
data_source=ds,
primary_key=["ss_store_sk"],
event_time="sale_date",
Expand All @@ -79,6 +84,46 @@ Once you have defined the metadata, you can
fg.save()
```

=== "Python"

```python
# Directory of parquet files on S3 — schema evolution + partition pushdown
fg = feature_store.create_external_feature_group(
name="sales",
version=1,
description="Physical shop sales features",
data_format="parquet",
path="sales/",
data_source=ds,
primary_key=["ss_store_sk"],
event_time="sale_date",
)

fg.save()
```

#### MongoDB external feature group

External Feature Groups backed by a [MongoDB data source][data-source-mongodb] read documents from a chosen collection.
Provide the database and collection on the data source — the per-Feature-Group selection overrides the connector's default database and default collection.

=== "Python"

```python
ds.database = "sample_mflix"
ds.collection = "comments"

fg = feature_store.create_external_feature_group(
name="comments",
version=1,
description="Movie comments",
data_source=ds,
primary_key=["id"],
)

fg.save()
```

You can read the full [`FeatureStore.create_external_feature_group`][hsfs.feature_store.FeatureStore.create_external_feature_group] documentation for more details.
`name` is a mandatory parameter of the `create_external_feature_group` and represents the name of the feature group.

Expand Down Expand Up @@ -135,7 +180,7 @@ Users can select which subset of the feature group data they want to make availa

Hopsworks Feature Store does not support time-travel queries on external feature groups.

Additionally, support for `.read()` and `.show()` methods when using by the Python engine is limited to external feature groups defined on BigQuery and Snowflake and only when using the [Feature Query Service](../../../setup_installation/common/arrow_flight_duckdb.md).
Additionally, support for `.read()` and `.show()` methods when using by the Python engine is limited to external feature groups defined on BigQuery, Snowflake, Redshift, SQL (MySQL / PostgreSQL / Oracle), SAP HANA, Unity Catalog, MongoDB, and S3 parquet (single file or directory) — and only when using the [Feature Query Service](../../../setup_installation/common/arrow_flight_duckdb.md).
Nevertheless, external feature groups defined top of any data source can be used to create a training dataset from a Python environment invoking one of the following methods: [`FeatureView.create_training_data`][hsfs.feature_view.FeatureView.create_training_data], [`FeatureView.create_train_test_split`][hsfs.feature_view.FeatureView.create_train_test_split] or [`FeatureView.create_train_validation_test_split`][hsfs.feature_view.FeatureView.create_train_validation_test_split].

### API Reference
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ nav:
- REST API: user_guides/fs/data_source/creation/rest_api.md
- Unity Catalog: user_guides/fs/data_source/creation/unity_catalog.md
- SAP HANA: user_guides/fs/data_source/creation/sap_hana.md
- MongoDB: user_guides/fs/data_source/creation/mongodb.md
- Usage: user_guides/fs/data_source/usage.md
- Feature Group:
- user_guides/fs/feature_group/index.md
Expand Down
Loading