Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion content/en/observability_pipelines/processors/sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,33 @@ products:

{{< product-availability >}}

{{% observability_pipelines/processors/sample %}}
## Overview

This processor samples your logging traffic for a representative subset at the rate that you define, dropping the remaining logs. As an example, you can use this processor to sample 20% of logs from a noisy non-critical service.

The sampling only applies to logs that match your filter query and does not impact other logs. If a log is dropped at this processor, none of the processors below receives that log.

## Setup

To set up the sample processor:
1. Define a **filter query**. Only logs that match the specified [filter query](#filter-query-syntax) are sampled at the specified retention rate below. The sampled logs and the logs that do not match the filter query are sent to the next step in the pipeline.
1. Enter your desired sampling rate in the **Retain** field. For example, entering `2` means 2% of logs are retained out of all the logs that match the filter query.
1. Optionally, enter a **Group By** field to create separate sampling groups for each unique value for that field. For example, `status:error` and `status:info` are two unique field values. Each bucket of events with the same field is sampled independently. Click Add Field if you want to add more fields to partition by. See the [group-by example](#group-by-example).

### Group-by example

If you have the following setup for the sample processor:
- Filter query: `env:staging`
- Retain: `40%` of matching logs
- Group by: `status` and `service`

{{< img src="observability_pipelines/processors/group-by-example-service.png" alt="The sample processor with example values" style="width:40%;" >}}

Then, 40% of logs for each unique combination of `status` and `service` from `env:staging` is retained. For example:

- 40% of logs with `status:info` and `service:networks` are retained.
- 40% of logs with `status:info` and `service:core-web` are retained.
- 40% of logs with `status:error` and `service:networks` are retained.
- 40% of logs with `status:error` and `service:core-web` are retained.

{{% observability_pipelines/processors/filter_syntax %}}
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,114 @@ products:

{{< product-availability >}}

{{% observability_pipelines/processors/sensitive_data_scanner %}}
## Overview

The Sensitive Data Scanner processor scans logs to detect and redact or hash sensitive information such as PII, PCI, and custom sensitive data. You can pick from Datadog's library of predefined rules, or input custom Regex rules to scan for sensitive data.

## Setup

To set up the processor:

1. Define a filter query. Only logs that match the specified filter query are scanned and processed. All logs are sent to the next step in the pipeline, regardless of whether they match the filter query.
1. Click **Add Scanning Rule**.
1. Select one of the following:

{{% collapse-content title="Add rules from the library" level="h5" %}}

{{% observability_pipelines/processors/sds_library_rules %}}
1. In the dropdown menu, select the library rule you want to use.
1. Recommended keywords are automatically added based on the library rule selected. After the scanning rule has been added, you can [add additional keywords or remove recommended keywords](#add-additional-keywords).
1. In the **Define rule target and action** section, select if you want to scan the **Entire Event**, **Specific Attributes**, or **Exclude Attributes** in the dropdown menu.
- If you are scanning the entire event, you can optionally exclude specific attributes from getting scanned. Use [path notation](#path-notation-example-lib) (`outer_key.inner_key`) to access nested keys. For specified attributes with nested data, all nested data is excluded.
- If you are scanning specific attributes, specify which attributes you want to scan. Use [path notation](#path-notation-example-lib) (`outer_key.inner_key`) to access nested keys. For specified attributes with nested data, all nested data is scanned.
1. For **Define actions on match**, select the action you want to take for the matched information. **Note**: Redaction, partial redaction, and hashing are all irreversible actions.
- **Redact**: Replaces all matching values with the text you specify in the **Replacement text** field.
- **Partially Redact**: Replaces a specified portion of all matched data. In the **Redact** section, specify the number of characters you want to redact and which part of the matched data to redact.
- **Hash**: Replaces all matched data with a unique identifier. The UTF-8 bytes of the match are hashed with the 64-bit fingerprint of FarmHash.
1. Optionally, click **Add Field** to add tags you want to associate with the matched events.
1. Add a name for the scanning rule.
1. Optionally, add a description for the rule.
1. Click **Save**.

### Path notation example {#path-notation-example-lib}

For the following message structure:

```json
{
"outer_key": {
"inner_key": "inner_value",
"a": {
"double_inner_key": "double_inner_value",
"b": "b value"
},
"c": "c value"
},
"d": "d value"
}
```

- Use `outer_key.inner_key` to refer to the key with the value `inner_value`.
- Use `outer_key.inner_key.double_inner_key` to refer to the key with the value `double_inner_value`.

### Add additional keywords

After adding scanning rules from the library, you can edit each rule separately and add additional keywords to the keyword dictionary.

1. Navigate to your [pipeline][1].
1. In the Sensitive Data Scanner processor with the rule you want to edit, click **Manage Scanning Rules**.
1. Toggle **Use recommended keywords** if you want the rule to use them. Otherwise, add your own keywords to the **Create keyword dictionary** field. You can also require that these keywords be within a specified number of characters of a match. By default, keywords must be within 30 characters before a matched value.
1. Click **Update**.

[1]: https://app.datadoghq.com/observability-pipelines

{{% /collapse-content %}}
{{% collapse-content title="Add a custom rule" level="h5" %}}

{{% observability_pipelines/processors/sds_custom_rules %}}
1. In the **Define match conditions** section, specify the regex pattern to use for matching against events in the **Define the regex** field. Enter sample data in the **Add sample data** field to verify that your regex pattern is valid.
Sensitive Data Scanner supports Perl Compatible Regular Expressions (PCRE), but the following patterns are not supported:
- Backreferences and capturing sub-expressions (lookarounds)
- Arbitrary zero-width assertions
- Subroutine references and recursive patterns
- Conditional patterns
- Backtracking control verbs
- The `\C` "single-byte" directive (which breaks UTF-8 sequences)
- The `\R` newline match
- The `\K` start of match reset directive
- Callouts and embedded code
- Atomic grouping and possessive quantifiers
1. For **Create keyword dictionary**, add keywords to refine detection accuracy when matching regex conditions. For example, if you are scanning for a sixteen-digit Visa credit card number, you can add keywords like `visa`, `credit`, and `card`. You can also require that these keywords be within a specified number of characters of a match. By default, keywords must be within 30 characters before a matched value.
1. In the **Define rule target and action** section, select if you want to scan the **Entire Event**, **Specific Attributes**, or **Exclude Attributes** in the dropdown menu.
- If you are scanning the entire event, you can optionally exclude specific attributes from getting scanned. Use [path notation](#path-notation-example-custom) (`outer_key.inner_key`) to access nested keys. For specified attributes with nested data, all nested data is excluded.
- If you are scanning specific attributes, specify which attributes you want to scan. Use [path notation](#path-notation-example-custom) (`outer_key.inner_key`) to access nested keys. For specified attributes with nested data, all nested data is scanned.
1. For **Define actions on match**, select the action you want to take for the matched information. **Note**: Redaction, partial redaction, and hashing are all irreversible actions.
- **Redact**: Replaces all matching values with the text you specify in the **Replacement text** field.
- **Partially Redact**: Replaces a specified portion of all matched data. In the **Redact** section, specify the number of characters you want to redact and which part of the matched data to redact.
- **Hash**: Replaces all matched data with a unique identifier. The UTF-8 bytes of the match is hashed with the 64-bit fingerprint of FarmHash.
1. Optionally, click **Add Field** to add tags you want to associate with the matched events.
1. Add a name for the scanning rule.
1. Optionally, add a description for the rule.
1. Click **Add Rule**.

### Path notation example {#path-notation-example-custom}

For the following message structure:

```json
{
"outer_key": {
"inner_key": "inner_value",
"a": {
"double_inner_key": "double_inner_value",
"b": "b value"
},
"c": "c value"
},
"d": "d value"
}
```

- Use `outer_key.inner_key` to refer to the key with the value `inner_value`.
- Use `outer_key.inner_key.double_inner_key` to refer to the key with the value `double_inner_value`.

{{% /collapse-content %}}

Expand Down
152 changes: 151 additions & 1 deletion content/en/observability_pipelines/processors/split_array.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,156 @@ products:

{{< product-availability >}}

{{% observability_pipelines/processors/split_array %}}
## Overview

This processor splits nested arrays into distinct events so that you can query, filter, alert, and visualize data within an array. The arrays need to already be parsed. For example, the processor can process `[item_1, item_2]`, but cannot process `"[item_1, item2]"`. The items in the array can be JSON objects, strings, integers, floats, or Booleans. All unmodified fields are added to the child events. For example, if you are sending the following items to the Observability Pipelines Worker:

```json
{
"host": "my-host",
"env": "prod",
"batched_items": [item_1, item_2]
}
```

Use the Split Array processor to send each item in `batched_items` as a separate event:

```json
{
"host": "my-host",
"env": "prod",
"batched_items": item_1
}
```

```json
{
"host": "my-host",
"env": "prod",
"batched_items": item_2
}
```

See the [split array example](#split-array-example) for a more detailed example.

## Setup

To set up this processor:

Click **Manage arrays to split** to add an array to split or edit an existing array to split. This opens a side panel.

- If you have not created any arrays yet, enter the array parameters as described in the [Add a new array](#add-a-new-array) section below.
- If you have already created arrays, click on the array's row in the table to edit or delete it. Use the search bar to find a specific array, and then select the array to edit or delete it. Click **Add Array to Split** to add a new array.

### Add a new array

1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they match the filter query, are sent to the next step in the pipeline.
1. Enter the path to the array field. Use the path notation `<OUTER_FIELD>.<INNER_FIELD>` to match subfields. See the [Path notation example](#path-notation-example-split-array) below.
1. Click **Save**.

### Split array example

This is an example event:

```json
{
"ddtags": ["tag1", "tag2"],
"host": "my-host",
"env": "prod",
"message": {
"isMessage": true,
"myfield" : {
"timestamp":14500000,
"firstarray":["one", 2]
},
},
"secondarray": [
{
"some":"json",
"Object":"works"
}, 44]
}
```

If the processor is splitting the arrays `"message.myfield.firstarray"` and `"secondarray"`, it outputs child events that are identical to the parent event, except for the values of `"message.myfield.firstarray"` and `"secondarray",` which becomes a single item from their respective original array. Each child event is a unique combination of items from the two arrays, so four child events (2 items * 2 items = 4 combinations) are created in this example.

```json
{
"ddtags": ["tag1", "tag2"],
"host": "my-host",
"env": "prod",
"message": {
"isMessage": true,
"myfield" : {"timestamp":14500000, "firstarray":"one"},
},
"secondarray": {
"some":"json",
"Object":"works"
}
}
```

```json
{
"ddtags": ["tag1", "tag2"],
"host": "my-host",
"env": "prod",
"message": {
"isMessage": true,
"myfield" : {"timestamp":14500000, "firstarray":"one"},
},
"secondarray": 44
}
```

```json
{
"ddtags": ["tag1", "tag2"],
"host": "my-host",
"env": "prod",
"message": {
"isMessage": true,
"myfield" : {"timestamp":14500000, "firstarray":2},
},
"secondarray": {
"some":"json",
"object":"works"
}
}
```

```json
{
"ddtags": ["tag1", "tag2"],
"host": "my-host",
"env": "prod",
"message": {
"isMessage": true,
"myfield" : {"timestamp":14500000, "firstarray":2},
},
"secondarray": 44
}
```

### Path notation example {#path-notation-example-split-array}

For the following message structure:

```json
{
"outer_key": {
"inner_key": "inner_value",
"a": {
"double_inner_key": "double_inner_value",
"b": "b value"
},
"c": "c value"
},
"d": "d value"
}
```

- Use `outer_key.inner_key` to refer to the key with the value `inner_value`.
- Use `outer_key.inner_key.double_inner_key` to refer to the key with the value `double_inner_value`.

{{% observability_pipelines/processors/filter_syntax %}}
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,16 @@ aliases:

## Overview

{{% observability_pipelines/processors/tags_processor %}}
For logs coming from the Datadog Agent, use this processor to exclude or include specific tags in the Datadog tags (`ddtags`) array. Tags that are excluded or not included are dropped and may reduce your outbound log volume.

## Setup

To set up the processor:

1. Define a filter query. Only matching logs are processed by this processor, but all logs continue to the next step in the pipeline.
1. Optionally, input a Datadog tags array for the **Configure tags** section. The supported formats are `["key:value", "key"]`. See [Define Tags][1] for more information about the `key:value` format.
1. In the **Configure tags** section, choose whether to **Exclude tags** or **Include tags**. If you provided a tag array in the previous step, select the tag keys you want to configure. You can also manually add tag keys. **Note**: You can select up to 100 tags.

[1]: /getting_started/tagging/#define-tags

{{% observability_pipelines/processors/filter_syntax %}}
9 changes: 2 additions & 7 deletions content/en/observability_pipelines/processors/throttle.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@ products:

{{< product-availability >}}

{{% observability_pipelines/processors/throttle %}}

## Overview

Use this processor to set a limit on the number of logs sent within a specific time window. For example, you can set a limit so that only 100 logs are sent per second. Setting a rate limit can help you catch any spikes in log ingestion and prevent unexpected billing costs.
Expand All @@ -20,11 +18,8 @@ Use this processor to set a limit on the number of logs sent within a specific t

To set up the processor:

1. Define a [filter query](#filter-query-syntax).
- Only logs that match the specified filter query are processed.
- All matched logs get throttled. Logs that are sent within the throttle limit and logs that do not match the filter are sent to the next step. Logs sent after the throttle limit has been reached, are dropped.
1. Set the throttling rate. This is the number of events allowed for a given bucket during the set time window.
- **Note**: This rate limit is applied on a **per-worker level**. If you scale the number of workers up or down, you may want to adjust the processor rate limit accordingly. You can update the rate limit programmatically using the [Observability Pipelines API][1].
1. Define a [filter query](#filter-query-syntax). Only logs that match the specified filter query are processed. All matched logs get throttled. Logs that are sent within the throttle limit and logs that do not match the filter are sent to the next step. Logs sent after the throttle limit has been reached, are dropped.
1. Set the throttling rate. This is the number of events allowed for a given bucket during the set time window. **Note**: This rate limit is applied on a **per-worker level**. If you scale the number of workers up or down, you may want to adjust the processor rate limit accordingly. You can update the rate limit programmatically using the [Observability Pipelines API][1].
1. Set the time window.
1. Optionally, click **Add Field** if you want to group by a field.

Expand Down