Skip to content

Fix flaky kafka tests by flushing producer before ending sender transactions#2619

Merged
xrmx merged 5 commits intomainfrom
copilot/investigate-flaky-kafka-test
Apr 24, 2026
Merged

Fix flaky kafka tests by flushing producer before ending sender transactions#2619
xrmx merged 5 commits intomainfrom
copilot/investigate-flaky-kafka-test

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 20, 2026

producer.send() in kafka-python is async — messages are queued to a background I/O thread. The consumer has a 500ms timeout. Without flushing the producer, messages may not reach the broker before the consumer gives up, resulting in zero items consumed and zero spans.

A matrix run showed that flushing after ending the sender-side transaction could change timing enough to break test_kafka_consume under python-3.9 / kafka-python-newest.

  • Updated all 6 delayed_send functions across kafka tests to use a single producer.flush()
  • In the sender-thread tests that open and close a transaction, moved producer.flush() to run before elasticapm_client.end_transaction("foo")
def delayed_send():
    time.sleep(0.2)
    elasticapm_client.begin_transaction("foo")
    producer.send(topic="foo", key=b"foo", value=b"bar")
    producer.send("bar", key=b"foo", value=b"bar")
    producer.send("test", key=b"foo", value=b"bar")
    producer.flush()
    elasticapm_client.end_transaction("foo")

Copilot AI linked an issue Apr 20, 2026 that may be closed by this pull request
The producer.send() method in kafka-python is asynchronous - it queues
messages to be sent by a background I/O thread. Without calling
producer.flush(), messages may not be delivered to the Kafka broker
before the consumer's consumer_timeout_ms=500 expires. This creates a
race condition where the consumer loop exits with no items consumed,
resulting in no spans being created.

Added producer.flush() after producer.send() calls in all delayed_send
functions to ensure messages are actually delivered to Kafka.

Agent-Logs-Url: https://github.com/elastic/apm-agent-python/sessions/e5ad59a6-a44a-4dd2-978e-fe4844b5c198

Co-authored-by: xrmx <12932+xrmx@users.noreply.github.com>
Copilot AI changed the title [WIP] Investigate flaky kafka test Fix flaky kafka tests by flushing producer before consumer timeout Apr 20, 2026
Copilot AI requested a review from xrmx April 20, 2026 13:02
Copilot finished work on behalf of xrmx April 20, 2026 13:02
Copy link
Copy Markdown
Member

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test matrix

@xrmx
Copy link
Copy Markdown
Member

xrmx commented Apr 21, 2026

@copilot please take a look at the failure and update the PR with a fix

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 21, 2026

@copilot please take a look at the failure and update the PR with a fix

I investigated the matrix failure and updated the PR in 059906f. The failing job was python-3.9 / kafka-python-newest, where test_kafka_consume regressed after the earlier flush() change. I replaced those flush() calls with producer.send(...).get(timeout=5) in the kafka consumer sender threads so delivery is deterministic without shifting the transaction timing.

Copilot AI changed the title Fix flaky kafka tests by flushing producer before consumer timeout Fix flaky kafka tests by waiting for producer send futures Apr 21, 2026
Copilot finished work on behalf of xrmx April 21, 2026 16:04
Copilot AI requested a review from xrmx April 21, 2026 16:04
@xrmx
Copy link
Copy Markdown
Member

xrmx commented Apr 22, 2026

@copilot what about calling flush before closing the transaction instead, could that have the same effect instead of waiting for every send? Am I misunderstanding?

@xrmx
Copy link
Copy Markdown
Member

xrmx commented Apr 23, 2026

@copilot please investigate on calling flush before closing the transaction instead of waiting for every send

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 23, 2026

@copilot what about calling flush before closing the transaction instead, could that have the same effect instead of waiting for every send? Am I misunderstanding?

Yes — in the tests that open and close a producer-side transaction inside delayed_send(), calling producer.flush() before elasticapm_client.end_transaction("foo") should give the same delivery guarantee without waiting on every individual send. I updated the PR in d2bad11 to use that approach.

@copilot please investigate on calling flush before closing the transaction instead of waiting for every send

I investigated that approach and updated the PR in d2bad11. The tests now use a single producer.flush() per delayed_send(), and in the producer-side transaction cases it runs before end_transaction().

Copilot AI changed the title Fix flaky kafka tests by waiting for producer send futures Fix flaky kafka tests by flushing producer before ending sender transactions Apr 23, 2026
Copilot finished work on behalf of xrmx April 23, 2026 10:27
Copy link
Copy Markdown
Member

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test matrix

@xrmx xrmx marked this pull request as ready for review April 23, 2026 15:58
@xrmx xrmx requested a review from a team as a code owner April 23, 2026 15:58
Copy link
Copy Markdown

@hegerchr hegerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xrmx xrmx merged commit 1b79b46 into main Apr 24, 2026
95 of 96 checks passed
@xrmx xrmx deleted the copilot/investigate-flaky-kafka-test branch April 24, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate flaky kafka test

3 participants