Skip to content

[Ascend][P0] Missing device-side non-contiguous copy / InfiniOps rearrange wrapper #621

@Ceng23333

Description

@Ceng23333

Summary

InfiniLM on Ascend NPU throws during forward/compile when device tensors are non-contiguous, because TensorImpl::copy_from / contiguous() require an external InfiniOps rearrange wrapper that is not implemented today.

Downstream symptom: RuntimeError: RankWorker stopped during run (RankWorker sets should_exit_ after the real C++ exception).

Priority: P0 — inference worker exits permanently; inference_server restart required.

Not in scope: KV block recycle cliff (#186) and lm_eval slowdown (#89) are separate InfiniLM / capacity issues.


Primary errors (InfiniLM explicitly names InfiniOps)

From InfiniLM/csrc/core/src/infinicore/tensor/copy.cc:

# Message
A Device-side non-contiguous copy requires an external InfiniOps rearrange wrapper
B Device-side contiguous() for non-contiguous tensors requires an external InfiniOps rearrange wrapper
C Device-side non-contiguous H2D copy requires an external InfiniOps rearrange wrapper

Same-device D2D branch (non-contiguous → throw):

} else if (this->is_contiguous() && src->is_contiguous()) {
    context::memcpyD2D(...);
} else {
    throw std::runtime_error("Device-side non-contiguous copy requires an external InfiniOps rearrange wrapper");
}

CPU path uses rearrange_cpu(); Ascend non-contiguous path has no fallback.


Log evidence (2026-05-21, production)

  • Container: infinilm-ascend-run, log: /tmp/infinilm-server.log
  • Device: ASCEND:0, flags: --enable-paged-attn --enable-graph, model 9g_8b_thinking_llama
  • ≥6 hits between 03:28–05:35 UTC, e.g.:
exception during forward: Device-side non-contiguous copy requires an external InfiniOps rearrange wrapper
→ Error in step loop: RankWorker stopped during run

Also seen:

  • exception during compile (same message, graph compile)
  • Device-side contiguous() for non-contiguous tensors requires an external InfiniOps rearrange wrapper
  • Related: RotaryEmbedding: InfiniOps adapter requires contiguous Q/K tensors

Reproduction

  1. Ascend + InfiniLM + InfiniOps, inference_server.py --device=ascend --enable-paged-attn --enable-graph
  2. Sustained POST /v1/chat/completions (variable batch/seq len), or
  3. Minimal: non-contiguous Ascend tensor → copy_from() or contiguous() → immediate throw

Expected vs actual

Item Expected Actual
Non-contiguous D2D copy InfiniOps rearrange / general copy throw, worker exit
Device contiguous() allocate + rearrange on device throw, worker exit
Service continuous forward process restart

Requested deliverables (P0)

  1. Ascend device-side general rearrange / copy — arbitrary shape/strides → contiguous (BF16/FP16 minimum)
  2. Integration hook for InfiniLM copy.cc throw sites (lines ~46, ~72, ~83)
  3. Tests: non-contiguous→contiguous on Ascend; compatible with graph capture (exception during compile also failed)

Acceptance: No上述 throw on Ascend serving path; >500 forward steps without RankWorker stopped.


References (InfiniLM side)

  • InfiniLM/csrc/core/src/infinicore/tensor/copy.cc — throw sites
  • InfiniLM/csrc/core/utils/rearrange.h — CPU reference
  • InfiniLM/csrc/engine/rank_worker.cppRankWorker stopped during run
  • Full write-up: InfiniLM deploy repo rca/issue-infinops-rearrange-en.md (can attach log excerpt 03:28–05:35 UTC)

Environment

Item Value
NPU Huawei Ascend davinci2
CANN 8.5.1 (from deploy logs)
InfiniLM Paged KV + graph, OpenAI API server
Downstream lm_eval @ http://<host>:8000/v1

infinilm-server.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions