[kafka_actions] Bound message reads to a start-of-check snapshot (#24162)

piochelepiotr · claude · web-flow · commit bd332d9e51ce · 2026-06-24T13:40:51.000Z
* [kafka_actions] Bound message reads to a start-of-check snapshot

read_messages could hang until its global timeout whenever a selective
filter matched fewer messages than n_messages_retrieved: once the consumer
drained the existing backlog it kept polling the live head, and because a
continuously-produced topic almost always delivers a message within the
poll window, the "no more messages" (poll == None) exit never fired.

Fix consumption to a snapshot of the log taken when the check starts:

- Capture each partition's high watermark up front and never yield a
  message at or beyond it, so messages produced after the check began are
  excluded and live-tailing is impossible.
- Enable enable.partition.eof and stop a partition on its EOF event or when
  its captured watermark is reached; return as soon as all are drained.
- Reduce the default timeout from 20s to 5s (now only a safety net) and
  surface a hit_timeout stat so a truncated read is distinguishable from a
  complete one.

Verified against a live 10-partition topic: the previously-hanging filtered
read now returns in ~0.3s instead of 20s.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Piotr Wolski &lt;piotr.wolski@datadoghq.com&gt;

* [kafka_actions] Add changelog entry

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Piotr Wolski &lt;piotr.wolski@datadoghq.com&gt;

* [kafka_actions] Fix import grouping for ruff isort

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Piotr Wolski &lt;piotr.wolski@datadoghq.com&gt;

---------

Signed-off-by: Piotr Wolski &lt;piotr.wolski@datadoghq.com&gt;
Co-authored-by: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/kafka_actions/assets/configuration/spec.yaml b/kafka_actions/assets/configuration/spec.yaml
@@ -205,8 +205,10 @@ files:
     - name: read_messages
       description: |
         Configuration for reading messages from Kafka topics.
-        Messages are streamed in real-time and sent to Datadog as they arrive.
-        The check has a 20-second timeout for the entire operation.
+        Only messages already present in the log when the check starts are read; the
+        per-partition high watermark is captured up front and consumption stops once every
+        partition is drained, so messages produced after the check began are never returned.
+        A 5-second timeout bounds the entire operation as a safety net.
         Supports JSON, BSON, Protobuf, and Avro with optional Schema Registry integration.
         Filtering is applied after deserialization using jq-style expressions.
       fleet_configurable: true
diff --git a/kafka_actions/changelog.d/24162.fixed b/kafka_actions/changelog.d/24162.fixed
@@ -0,0 +1 @@
+Fix `read_messages` hanging until the global timeout when a filter matched fewer messages than `n_messages_retrieved`. Consumption is now bounded to a snapshot of the log taken when the check starts (per-partition high watermark + `enable.partition.eof`), the default timeout is reduced from 20s to 5s, and a `hit_timeout` stat distinguishes a truncated read from a complete one.
diff --git a/kafka_actions/datadog_checks/kafka_actions/check.py b/kafka_actions/datadog_checks/kafka_actions/check.py
@@ -267,7 +267,7 @@ def _action_read_messages(self):
         start_timestamp = config.get('start_timestamp')
         n_messages_retrieved = config.get('n_messages_retrieved', 10)
         max_scanned_messages = config.get('max_scanned_messages', 1000)
-        timeout_ms = config.get('timeout_ms', 20000)
+        timeout_ms = config.get('timeout_ms', 5000)
         filter_expression = config.get('filter', '')
         consumer_group_id = config.get('consumer_group_id') or f"datadog-agent-{self.remote_config_id}"
 
@@ -326,6 +326,8 @@ def _action_read_messages(self):
         if scanned_count >= max_scanned_messages and sent_count < n_messages_retrieved:
             hit_scan_limit = True
 
+        hit_timeout = self.kafka_client.hit_timeout and not hit_retrieved_limit and not hit_scan_limit
+
         elapsed_time = time.time() - start_time
 
         stats = {
@@ -337,6 +339,7 @@ def _action_read_messages(self):
             'messages_filtered_out': filtered_out_count,
             'hit_scan_limit': hit_scan_limit,
             'hit_retrieved_limit': hit_retrieved_limit,
+            'hit_timeout': hit_timeout,
             'elapsed_time_seconds': round(elapsed_time, 3),
             'n_messages_retrieved': n_messages_retrieved,
             'max_scanned_messages': max_scanned_messages,
@@ -358,6 +361,14 @@ def _action_read_messages(self):
                 sent_count,
             )
 
+        if hit_timeout:
+            self.log.warning(
+                "Hit the %dms timeout after scanning %d messages and retrieving %d. Result may be incomplete.",
+                timeout_ms,
+                scanned_count,
+                sent_count,
+            )
+
         return stats
 
     def _evaluate_filter(self, filter_expression: str, deserialized_msg: DeserializedMessage) -> bool:
diff --git a/kafka_actions/datadog_checks/kafka_actions/data/conf.yaml.example b/kafka_actions/datadog_checks/kafka_actions/data/conf.yaml.example
@@ -197,8 +197,10 @@ instances:
 
     ## @param read_messages - mapping - optional
     ## Configuration for reading messages from Kafka topics.
-    ## Messages are streamed in real-time and sent to Datadog as they arrive.
-    ## The check has a 20-second timeout for the entire operation.
+    ## Only messages already present in the log when the check starts are read; the
+    ## per-partition high watermark is captured up front and consumption stops once every
+    ## partition is drained, so messages produced after the check began are never returned.
+    ## A 5-second timeout bounds the entire operation as a safety net.
     ## Supports JSON, BSON, Protobuf, and Avro with optional Schema Registry integration.
     ## Filtering is applied after deserialization using jq-style expressions.
     #
diff --git a/kafka_actions/datadog_checks/kafka_actions/kafka_client.py b/kafka_actions/datadog_checks/kafka_actions/kafka_client.py
@@ -32,6 +32,8 @@ def __init__(self, config: KafkaActionsConfig, log):
         self.consumer = None
         self.producer = None
         self.admin_client = None
+        # True when consume_messages stopped on the timeout rather than draining all partitions.
+        self.hit_timeout = False
 
     def _get_authentication_config(self) -> dict[str, Any]:
         """Build authentication configuration for librdkafka."""
@@ -134,6 +136,8 @@ def get_consumer(self, group_id: str = 'kafka_actions') -> Consumer:
                     'group.id': group_id,
                     'auto.offset.reset': 'earliest',
                     'enable.auto.commit': False,
+                    # Signal end-of-partition via a _PARTITION_EOF event so we stop once drained.
+                    'enable.partition.eof': True,
                 }
             )
             self.consumer = Consumer(config)
@@ -193,29 +197,34 @@ def consume_messages(
         start_offset: int = -2,
         start_timestamp: int | None = None,
         max_messages: int = 1000,
-        timeout_ms: int = 30000,
+        timeout_ms: int = 5000,
         group_id: str = 'kafka_actions',
     ):
-        """Consume messages from a Kafka topic, yielding them as they arrive.
+        """Consume the messages already present in a topic, yielding them as they are read.
 
-        This is a generator that yields messages in real-time as they're consumed,
-        allowing for immediate processing and sending to Datadog.
+        The per-partition high watermark is captured before consumption begins and no message
+        at or beyond it is yielded, so messages produced after the check starts are never
+        returned and the generator can't tail a live topic. A partition stops on EOF or when its
+        captured watermark is reached; the generator returns once all are drained. ``timeout_ms``
+        is only a safety net.
 
         Args:
             topic: Topic name
             partition: Partition number (-1 for all partitions)
             start_offset: Starting offset (-1 for latest, -2 for earliest)
             start_timestamp: Starting timestamp in milliseconds since epoch. When set, start_offset is ignored.
             max_messages: Maximum messages to consume
-            timeout_ms: Global timeout in milliseconds for the entire consumption
+            timeout_ms: Safety-net timeout in milliseconds for the entire consumption
             group_id: Consumer group ID
 
         Yields:
-            Kafka messages as they arrive
+            Kafka messages that existed in the log when consumption began
         """
         consumer = self.get_consumer(group_id)
+        admin = self.get_admin_client()
         start_time = time.time()
         global_timeout_s = timeout_ms / 1000.0
+        self.hit_timeout = False
 
         try:
             if partition == -1:
@@ -227,71 +236,113 @@ def consume_messages(
             else:
                 partition_ids = [partition]
 
-            if start_timestamp is not None:
-                # Resolve timestamp to per-partition offsets using offsets_for_times.
-                timestamp_partitions = [TopicPartition(topic, p, start_timestamp) for p in partition_ids]
-                partitions = consumer.offsets_for_times(timestamp_partitions, timeout=10)
-                for tp in partitions:
-                    if tp.offset != -1:
-                        self.log.debug(
-                            "Partition %d: timestamp %d resolved to offset %d",
-                            tp.partition,
-                            start_timestamp,
-                            tp.offset,
-                        )
-            elif start_offset == -1:
-                # For "latest" offset, seek back from the high watermark to read the last N existing messages.
-                # Use AdminClient.list_offsets to fetch all high watermarks in a single batched call.
-                admin = self.get_admin_client()
-                offset_request = {TopicPartition(topic, p): OffsetSpec.latest() for p in partition_ids}
-                futures = admin.list_offsets(offset_request, request_timeout=10)
-
-                partitions = []
-                for tp, future in futures.items():
-                    result = future.result()
-                    seek_offset = max(0, result.offset - max_messages)
-                    partitions.append(TopicPartition(topic, tp.partition, seek_offset))
-                    self.log.debug("Partition %d: high=%d, seeking to %d", tp.partition, result.offset, seek_offset)
-            else:
-                partitions = [TopicPartition(topic, p, start_offset) for p in partition_ids]
+            # Snapshot each partition's high watermark; we never read at or beyond it.
+            end_request = {TopicPartition(topic, p): OffsetSpec.latest() for p in partition_ids}
+            end_futures = admin.list_offsets(end_request, request_timeout=10)
+            end_offsets = {tp.partition: future.result().offset for tp, future in end_futures.items()}
 
-            self.log.debug("Assigning partitions: %s", partitions)
+            start_offsets = self._resolve_start_offsets(
+                consumer, admin, topic, partition_ids, start_offset, start_timestamp, max_messages, end_offsets
+            )
+
+            # Assign only partitions that have messages in [start, high_watermark).
+            partitions = []
+            active = set()
+            for p in partition_ids:
+                start = start_offsets.get(p, 0)
+                end = end_offsets.get(p, 0)
+                if start < end:
+                    partitions.append(TopicPartition(topic, p, start))
+                    active.add(p)
+                else:
+                    self.log.debug("Partition %d: nothing to read (start=%d, high=%d)", p, start, end)
+
+            if not partitions:
+                self.log.debug("No messages to read for topic %s in [start, high-watermark)", topic)
+                return
+
+            self.log.debug("Assigning partitions: %s (high watermarks: %s)", partitions, end_offsets)
             consumer.assign(partitions)
 
             consumed = 0
 
-            while consumed < max_messages:
+            while consumed < max_messages and active:
                 elapsed = time.time() - start_time
                 remaining_timeout = global_timeout_s - elapsed
 
                 if remaining_timeout <= 0:
+                    self.hit_timeout = True
                     self.log.debug("Global timeout reached after %d messages", consumed)
                     break
 
                 poll_timeout = min(1.0, remaining_timeout)
                 msg = consumer.poll(timeout=poll_timeout)
 
                 if msg is None:
-                    self.log.debug("Poll returned None (no more messages available), stopping consumption")
-                    break
+                    # End-of-data arrives as an EOF event, not None; keep polling until drained.
+                    continue
 
                 if msg.error():
                     if msg.error().code() == KafkaError._PARTITION_EOF:
-                        self.log.debug("Reached end of partition")
+                        active.discard(msg.partition())
                         continue
                     else:
                         raise KafkaException(msg.error())
 
+                p = msg.partition()
+                # Never surface a message at or beyond the captured high watermark.
+                if p not in active or msg.offset() >= end_offsets.get(p, 0):
+                    active.discard(p)
+                    continue
+
                 yield msg
                 consumed += 1
 
+                if msg.offset() >= end_offsets[p] - 1:
+                    active.discard(p)
+
             self.log.debug("Consumed %d messages from topic %s in %.2fs", consumed, topic, time.time() - start_time)
 
         finally:
             if consumer:
                 consumer.close()
                 self.consumer = None
 
+    def _resolve_start_offsets(
+        self,
+        consumer,
+        admin,
+        topic: str,
+        partition_ids: list[int],
+        start_offset: int,
+        start_timestamp: int | None,
+        max_messages: int,
+        end_offsets: dict[int, int],
+    ) -> dict[int, int]:
+        """Return a {partition: start_offset} map. A start at or beyond the high watermark
+        means there is nothing to read for that partition."""
+        if start_timestamp is not None:
+            # An offset < 0 means the timestamp is past the end of the log: nothing to read.
+            timestamp_partitions = [TopicPartition(topic, p, start_timestamp) for p in partition_ids]
+            resolved = consumer.offsets_for_times(timestamp_partitions, timeout=10)
+            start_offsets = {}
+            for tp in resolved:
+                end = end_offsets.get(tp.partition, 0)
+                start_offsets[tp.partition] = tp.offset if tp.offset is not None and tp.offset >= 0 else end
+            return start_offsets
+
+        if start_offset == -1:
+            # "latest": seek back from the high watermark to read the last N existing messages.
+            return {p: max(0, end_offsets.get(p, 0) - max_messages) for p in partition_ids}
+
+        if start_offset == -2:
+            # "earliest": use the low watermark as the numeric start.
+            low_request = {TopicPartition(topic, p): OffsetSpec.earliest() for p in partition_ids}
+            low_futures = admin.list_offsets(low_request, request_timeout=10)
+            return {tp.partition: future.result().offset for tp, future in low_futures.items()}
+
+        return dict.fromkeys(partition_ids, start_offset)
+
     def produce_message(
         self,
         topic: str,
diff --git a/kafka_actions/tests/test_unit.py b/kafka_actions/tests/test_unit.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+Fix `read_messages` hanging until the global timeout when a filter matched fewer messages than `n_messages_retrieved`. Consumption is now bounded to a snapshot of the log taken when the check starts (per-partition high watermark + `enable.partition.eof`), the default timeout is reduced from 20s to 5s, and a `hit_timeout` stat distinguishes a truncated read from a complete one.