Scalable conversion to GraphAr format

### Describe the enhancement requested

Today the conversion/import path does not seem to scale well for larger datasets.

From reading the current code and trying it in practice, I see that:
- the C++ high-level builders are convenience APIs and keep data in memory until `Dump()`
- the Spark writer scales better in principle, but still does heavy batch work such as index generation, joins, sorting, repartitioning, and offset construction

Because GraphAr is positioned for use with "large-scale graph data", it would be useful for a community to have a clearer path for scalable conversion.

Assuming I'm not missing something, my suggestion is:
  - keep the C++ high-level writer/builder path simple/reference-oriented and convenient for small/medium imports
  - optimize the Spark API/writer as the primary path for large-scale conversion

This way we treat Spark as the practical scalable backend for data lake, object stores, HDFS, and distributed preprocessing.

Why Spark seems like the better place to optimize first:
- Spark is considered a data lake first-class citizen, used by many orgs in production and thus in practice is more accessible for end-users (compared to a dedicated VM only for C++ import)
- storage backends such as S3/HDFS are abstracted through Spark/Hadoop
- large joins / remapping / repartitioning are natural Spark workloads
- avoiding two separate “fully optimized” implementations (Spark and Cpp) may be easier to maintain long-term

To sum up, would the project agree with this direction?

If it sounds reasonable, I’d be happy to help investigate and propose concrete improvements in the Spark conversion path.

### Component(s)

C++, Spark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable conversion to GraphAr format #917

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scalable conversion to GraphAr format #917

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions