Skip to content

[bug] OTLP exporters break in Swoole coroutines: gRPC broken pipe on reused connection + HTTP eventLoop conflict #23

Description

@vitucs

Summary

When running hyperf-opentelemetry inside a long-lived Hyperf/Swoole worker, the
OTLP exporters become unreliable after the collector recycles the HTTP/2
connection (e.g. LoadBalancer idle-timeout, rolling restart of the collector pod).
Two independent issues surface:

  1. SwooleGrpcTransport::getClient() never detects a server-side close, so
    the transport keeps reusing a half-closed Swoole\Coroutine\Http2\Client and
    every subsequent export fails with "Broken pipe".
  2. OtlpHttpTransportFactory falls back to PsrTransport with a blocking cURL
    client
    , which is incompatible with Swoole's coroutine scheduler — throwing
    eventLoop already created whenever the batch processor flushes inside a
    coroutine context.

Both issues are reproducible with the exporters otlp_grpc and otlp_http as
documented in publish/open-telemetry.php, using:

  • opencodeco/hyperf-opentelemetry: 0.3.6 (latest)
  • PHP: 8.2+ / 8.3
  • ext-swoole: 5.1.x / 5.x
  • Hyperf: 3.1
  • Collector: otel/opentelemetry-collector-contrib (any recent version), reached
    through a Kubernetes Service / ALB that closes idle HTTP/2 connections after
    a few minutes.

Bug 1 — SwooleGrpcTransport::getClient() does not detect closed server connections

Location

src/Transport/SwooleGrpcTransport.php

private function getClient(): Client
{
    if ($this->client === null || ! $this->client->connected) {
        // (re)connect ...
    }
    return $this->client;
}

Problem

Swoole\Coroutine\Http2\Client::$connected only reflects whether
connect() was called — it does not flip to false when the peer sends a
TCP FIN / HTTP/2 GOAWAY. After the collector rotates, the client is still marked
connected === true, so getClient() returns the stale instance and the next
send() fails with:

RuntimeException: Failed to send gRPC request: Broken pipe

or, depending on timing:

RuntimeException: Failed to receive gRPC response: recv timeout

The transport keeps the same dead client forever, so every subsequent span/
metric export fails for the lifetime of the worker.

Additional noise

On PHP 8.2+ the same code path also triggers:

PHP Deprecated:  Creation of dynamic property Swoole\Coroutine\Http2\Client::$serverLastStreamId is deprecated

The property is written by Swoole's own extension code inside recv(), so the
deprecation cannot be fixed by userland — it should at least be filtered.

Suggested fix

In send(), treat "broken pipe", "connection reset", "connection closed",
"GOAWAY" and send/recv false as retryable: drop and rebuild the client, then
retry once before surfacing the error. The stored client should also be nulled
on any error so the next batch re-connects.

Minimal patch:

public function send(string $payload, ?CancellationInterface $c = null): FutureInterface
{
    if ($this->closed) {
        return new ErrorFuture(new RuntimeException('Transport is closed'));
    }

    $attempts = 0;
    retry:
    try {
        $client = $this->getClient();
        $streamId = $client->send($this->buildRequest($payload));
        if ($streamId === false || $streamId <= 0) {
            throw new RuntimeException('send failed: ' . ($client->errMsg ?: 'unknown'));
        }
        $response = $client->recv($this->timeout);
        if ($response === false) {
            throw new RuntimeException('recv failed: ' . ($client->errMsg ?: 'timeout'));
        }
        // ... grpc-status handling
        return new CompletedFuture(null);
    } catch (Throwable $e) {
        $this->client?->close();
        $this->client = null;
        if ($attempts++ < 1 && $this->isRetryable($e)) {
            goto retry;
        }
        return new ErrorFuture($e);
    }
}

Bonus: wrap send()/recv() in a scoped set_error_handler that swallows
only the serverLastStreamId E_DEPRECATED noise.

Workaround we shipped

We had to subclass and wire our own ResilientSwooleGrpcTransport via
config/autoload/dependencies.php, overriding both
OtlpGrpcTraceExporterFactory and OtlpGrpcMetricExporterFactory. Happy to
open a PR with the fix.


Bug 2 — OtlpHttpTransportFactory / PsrTransport blocks the Swoole event loop

Location

src/Factory/Trace/Exporter/OtlpHttpTraceExporterFactory.php
src/Factory/Log/Exporter/OtlpHttpLogExporterFactory.php
src/Factory/Metric/Exporter/OtlpHttpMetricExporterFactory.php

These factories resolve to open-telemetry/exporter-otlp's
OtlpHttpTransportFactory, which picks a PSR-18 client via discovery. In our
environment (Hyperf + Swoole 5.x) that discovery lands on the blocking cURL
stack.

Problem

When the BatchSpan/Metric/LogProcessor flushes inside a worker coroutine, the
blocking cURL call triggers:

RuntimeException: eventLoop already created. Swoole only support one eventLoop,
it cannot be created repeatedly.

…and the worker dies on the next exporter tick.

Suggested fix

OtlpHttp*ExporterFactory should prefer HyperfGuzzle (already declared via
Support/HyperfGuzzle) or a custom coroutine-safe PSR transport by default
when running under Swoole, and only fall back to cURL when ext-swoole is not
loaded. Alternatively, expose a config key to force the PSR client.

Workaround

Currently we are forced to use otlp_grpc exclusively (which hit Bug 1). A
coroutine-native HTTP transport in the library would let us avoid gRPC
altogether.


Reproduction

  1. Hyperf 3.1 skeleton with opencodeco/hyperf-opentelemetry: ^0.3.6.
  2. Enable traces + metrics exporting to a local collector over gRPC.
  3. tc qdisc add dev eth0 root netem loss 100% for ~30s against the collector
    (or simply restart the collector pod) to force a server-side close.
  4. After traffic resumes, every exporter tick fails with "Broken pipe" until the
    worker is killed.

For Bug 2: set OTEL_TRACES_EXPORTER=otlp_http, fire a request, observe
eventLoop already created on the first batch flush.


Stack traces

Bug 1 (gRPC broken pipe)

RuntimeException: Failed to send gRPC request: Broken pipe
  at Hyperf\OpenTelemetry\Transport\SwooleGrpcTransport::send()
     /vendor/opencodeco/hyperf-opentelemetry/src/Transport/SwooleGrpcTransport.php:60
  at OpenTelemetry\Contrib\Otlp\SpanExporter::export()
  at OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor::flush()

Companion:

PHP Deprecated: Creation of dynamic property Swoole\Coroutine\Http2\Client::$serverLastStreamId is deprecated
  in /vendor/opencodeco/hyperf-opentelemetry/src/Transport/SwooleGrpcTransport.php line ~69

Bug 2 (HTTP / eventLoop)

RuntimeException: eventLoop already created. Swoole only support one eventLoop, it cannot be created repeatedly.
  at OpenTelemetry\SDK\Common\Export\Http\PsrTransport::send()
  at OpenTelemetry\Contrib\Otlp\SpanExporter::export()
  at OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor::flush()

Environment

Component Version
opencodeco/hyperf-opentelemetry 0.3.6
open-telemetry/sdk latest compatible
PHP 8.2.x / 8.3.x
ext-swoole 5.1.x / 5.x
Hyperf 3.1
OS Linux (k8s pod, Alpine/Debian base)

Happy to contribute

If a maintainer can confirm the approach, I can open a PR for Bug 1 (retry +
error-handler scoping) and start a discussion on Bug 2 (Swoole-native HTTP
transport).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions