Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 29, 2026

Implementation Plan for Kafka Documentation and Tests

Overview

Add documentation and verification tests to replace Envoy integration tests that are being moved from the main envoy repository.

Checklist

  • Kafka Broker - Empty Topic / Consumer Waiting Behavior

    • Add new section to kafka/example.rst documenting:
      • How to create and consume from empty topic
      • Envoy correctly proxies fetch requests even with no data
      • How to observe fetch metrics increasing
      • Expected Kafka long-polling behavior
    • Add corresponding test to kafka/verify.sh:
      • Create new empty topic
      • Record initial fetch_request metric count
      • Run consumer with timeout on empty topic
      • Verify fetch_request metrics increased
      • Clean up by deleting topic
  • Kafka Mesh - High-Volume / Batched Producing

    • Add new section to kafka-mesh/example.rst documenting:
      • Kafka producers can batch records into single ProduceRequest
      • How to send multiple messages rapidly
      • All messages correctly routed to appropriate cluster
      • How to verify all messages arrived
    • Add corresponding test to kafka-mesh/verify.sh:
      • Send multiple messages (20+) rapidly to trigger batching
      • Verify all messages arrived at correct upstream cluster
      • Verify produce metrics reflect the batched requests
  • Code Review Fixes

    • Add validation for metric variables with default values
    • Protect against integer expression errors
    • Apply RST formatting for technical terms (ProduceRequest, cluster1)
  • Validation

    • Run kafka/verify.sh to ensure changes work correctly ✅
    • Run kafka-mesh/verify.sh to ensure changes work correctly ✅
    • Review all changes for minimal scope and correctness ✅
Original prompt

Context

The envoyproxy/envoy repository contains Kafka integration tests that depend on the envoy-static binary being built. These tests need to be removed from envoy, but before doing so we need to ensure the functionality they test is documented and verified in the examples repo.

The examples repo is documentation-first - we add documentation that provides value to users, and the verify.sh scripts exist to test that the documentation works correctly.

Documentation to Add

1. Empty Topic / Consumer Waiting Behavior (kafka/example.rst)

User value: Users often want to understand how Envoy behaves when a consumer is waiting for messages that haven't arrived yet. Does the proxy handle long-polling correctly? Do metrics still work?

Add a new section to kafka/example.rst that:

  • Shows how to create a topic and consume from it before any messages are sent
  • Demonstrates that Envoy correctly proxies fetch requests even when no data is returned
  • Shows how to observe fetch metrics increasing even when consuming from an empty topic (proving requests are being proxied)
  • Explains that this is expected Kafka consumer behavior (long-polling)

The corresponding test in kafka/verify.sh should:

  • Create a new empty topic
  • Record the initial fetch_request metric count
  • Run a consumer with a timeout on the empty topic (expecting no messages)
  • Verify that fetch_request metrics increased (proving Envoy proxied the fetch requests correctly)
  • Clean up by deleting the topic

2. High-Volume / Batched Producing through Mesh Filter (kafka-mesh/example.rst)

User value: Users running Kafka mesh in production need to understand how Envoy handles high-throughput scenarios where producers batch records. Does routing still work correctly when a single ProduceRequest contains records for multiple topics destined for different clusters?

Add a new section to kafka-mesh/example.rst that:

  • Explains that Kafka producers can batch multiple records into a single ProduceRequest
  • Shows how to send multiple messages (e.g., 20+) rapidly to demonstrate batching behavior
  • Demonstrates that all messages are correctly routed to the appropriate upstream cluster
  • Shows how to verify all messages arrived by consuming from the upstream cluster directly

The corresponding test in kafka-mesh/verify.sh should:

  • Send multiple messages (e.g., 20) rapidly to a topic (to trigger producer batching)
  • Verify all messages arrived at the correct upstream cluster
  • Optionally verify produce metrics reflect the batched requests

Reference: Existing Envoy Integration Tests Being Replaced

For reference, here are the specific test behaviors from envoy that should be covered:

From kafka_broker_integration_test.py:

  • test_kafka_consumer_with_no_messages_received: Creates a consumer that polls an empty topic 10 times, verifies fetch metrics increase without errors

From kafka_mesh_integration_test.py:

  • test_producing_with_batched_records: Sends 100 messages each to two topics with linger_ms=1000 and batch_size=100 to force batching, verifies all messages are correctly routed to different clusters

Files to Modify

  1. kafka/example.rst - Add new documentation section
  2. kafka/verify.sh - Add test coverage for the new documentation
  3. kafka-mesh/example.rst - Add new documentation section
  4. kafka-mesh/verify.sh - Add test coverage for the new documentation

Style Guidelines

  • Follow the existing documentation style in the .rst files (use existing sections as templates)
  • Follow the existing shell script patterns in verify.sh files (use run_log, kafka_client helper, has_metric_with_at_least_1 where applicable)
  • Documentation should be user-focused, explaining the "why" not just the "how"

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*******************************************************

In production, Kafka producers often batch multiple records into a single
ProduceRequest to improve throughput. The mesh filter must correctly handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ProduceRequest to improve throughput. The mesh filter must correctly handle
``ProduceRequest`` to improve throughput. The mesh filter must correctly handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in c53084e

these batched requests and route records to the appropriate upstream cluster.

Send multiple messages rapidly to the ``cherries`` topic (which routes to
cluster1 based on the ``c`` prefix). The producer will automatically batch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster1 based on the ``c`` prefix). The producer will automatically batch
``cluster1`` based on the ``c`` prefix). The producer will automatically batch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in c53084e

echo \"cherry message \$i\"; \
done | kafka-console-producer --request-required-acks 1 --producer-property enable.idempotence=false --broker-list proxy:10000 --topic cherries"

Now verify that all 20 messages arrived at cluster1 by consuming directly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now verify that all 20 messages arrived at cluster1 by consuming directly
Now verify that all 20 messages arrived at ``cluster1`` by consuming directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in c53084e

20

This confirms that even though the producer may have batched the records into
multiple ProduceRequests, the mesh filter correctly routed all messages to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
multiple ProduceRequests, the mesh filter correctly routed all messages to the
multiple ``ProduceRequest``s, the mesh filter correctly routed all messages to the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in c53084e

Copilot AI and others added 2 commits January 29, 2026 19:32
Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Copilot AI changed the title [WIP] Add documentation for empty topic consumer behavior in Kafka Add Kafka documentation for empty topic consumer behavior and batched produce routing Jan 29, 2026
Copilot AI requested a review from phlax January 29, 2026 19:37
Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants