Skip to content

[Bug] --connection-mode pull is not applied to the CopyOnMaster path (replicated / small tables fail in dest-only-reachable topologies) #32

@talmacschen-arch

Description

@talmacschen-arch

Apache Cloudberry and cbcopy version

cbcopy: v1.1.5 (main, current HEAD 4fa9725)

What happened

--connection-mode pull is documented as "destination connects to source",
but the CopyOnMaster strategy ignores --connection-mode entirely — it
always uses src master → dest master for the data connection regardless of
the flag. CopyOnMaster is forced for two classes of tables:

  • any table with rows <= --on-segment-threshold (default 1,000,000), and
  • any DISTRIBUTED REPLICATED table (hardcoded, no flag to opt out).

As a consequence, in any topology where the destination cluster cannot be
actively dialed by the source cluster (e.g. destination GP cluster running
inside Kubernetes, source GP cluster outside, only dest → src is
reachable), cbcopy fails as soon as it encounters such a table — even when
the user sets --connection-mode pull and --on-segment-threshold 0,
because replicated tables still take the CopyOnMaster path.

On top of that, when both --connection-mode pull and CopyOnMaster are
in play, the helper-port temp table is created on src but queried from
dest, producing:

ERROR: relation "public.cbcopy_ports_temp_onmaster_<ts>" does not exist (SQLSTATE 42P01)

What you think should happen instead

--connection-mode pull should apply uniformly to every copy strategy,
including CopyOnMaster. Specifically, under pull:

  • src master runs cbcopy_helper --listen ... --direction send
  • dest master runs cbcopy_helper --host <src> --port <p> --direction receive

so that the only network requirement is dest → src, matching the documented
intent of pull mode.

The helper-port temp table inconsistency goes away on its own once the
direction is correct (both _onmaster_ and _onall_ external tables end up
on the same side, src, under pull).

How to reproduce

Minimal logical repro (a real two-cluster setup just makes the failure mode
obvious; same-host environments mask the bug because src and dest masters
share /tmp):

  1. Set up source and destination clusters where the source side has no
    inbound reachability
    to the destination side, only dest → src works.
  2. Create a DISTRIBUTED REPLICATED table in the source schema, or any
    small table (rows ≤ 1,000,000).
  3. Run:
    cbcopy \
        --source-host <src> --source-port 5432 --source-user gpadmin \
        --dest-host <dest> --dest-port <port> --dest-user gpadmin \
        --schema gpadmin.<schema> --dest-schema <destdb>.<schema> \
        --connection-mode pull \
        --on-segment-threshold 0 \
        --data-port-range 50000-60000
    
  4. cbcopy attempts src master → dest master for the replicated/small
    table; with the dest-only-reachable topology this fails with either a TCP
    connection error or the relation "public.cbcopy_ports_temp_onmaster_*" does not exist SQLSTATE 42P01 above.

Root cause references (against current main):

  • copy/copy_command.go:126-163CopyOnMaster.CopyTo / CopyFrom hardcode
    src master --host/--port and dest master --listen, with no
    if cc.ConnectionMode == option.ConnectionModePull branch (compare with
    CopyOnSegment.CopyTo/CopyFrom at lines 179-225, which do have the branch).
  • copy/copy.go:271-280 — port temp table built on srcManageConn when
    connectionMode == pull.
  • copy/copy_operation.go:135-147 — but the port query uses
    destManageConn when connectionMode == push || op.command.IsMasterCopy(),
    so pull + CopyOnMaster queries from dest while the table lives on src.

Note: the existing e2e test at end_to_end/basic_test.go:427
({"CopyOnMaster", "pull", ...}) does not catch this because in that test
src and dest are two databases on the same GP cluster — same master process,
same /tmp — so the bug is masked.

Operating System

Linux (issue is environment-independent; observed on RHEL 9.5 / CentOS 7
class hosts and Kubernetes Pod-based dest clusters).

Anything else

Workaround (only covers part of the problem): if the schema has no
DISTRIBUTED REPLICATED tables
, the user can set
--on-segment-threshold 0 and --exclude-table for any replicated tables,
keeping CopyOnMaster from being hit. This is verified to work but does not
cover schemas with replicated tables, which are forced into CopyOnMaster
unconditionally.

I have a fix design ready (minimal: adds the pull branch to
CopyOnMaster.CopyTo/CopyFrom symmetric to CopyOnSegment, and drops the
|| op.command.IsMasterCopy() special case in copy_operation.go). Happy
to submit it as a PR linked to this issue.

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions