Skip to content

Scalable conversion to GraphAr format #917

@Iskander14yo

Description

@Iskander14yo

Describe the enhancement requested

Today the conversion/import path does not seem to scale well for larger datasets.

From reading the current code and trying it in practice, I see that:

  • the C++ high-level builders are convenience APIs and keep data in memory until Dump()
  • the Spark writer scales better in principle, but still does heavy batch work such as index generation, joins, sorting, repartitioning, and offset construction

Because GraphAr is positioned for use with "large-scale graph data", it would be useful for a community to have a clearer path for scalable conversion.

Assuming I'm not missing something, my suggestion is:

  • keep the C++ high-level writer/builder path simple/reference-oriented and convenient for small/medium imports
  • optimize the Spark API/writer as the primary path for large-scale conversion

This way we treat Spark as the practical scalable backend for data lake, object stores, HDFS, and distributed preprocessing.

Why Spark seems like the better place to optimize first:

  • Spark is considered a data lake first-class citizen, used by many orgs in production and thus in practice is more accessible for end-users (compared to a dedicated VM only for C++ import)
  • storage backends such as S3/HDFS are abstracted through Spark/Hadoop
  • large joins / remapping / repartitioning are natural Spark workloads
  • avoiding two separate “fully optimized” implementations (Spark and Cpp) may be easier to maintain long-term

To sum up, would the project agree with this direction?

If it sounds reasonable, I’d be happy to help investigate and propose concrete improvements in the Spark conversion path.

Component(s)

C++, Spark

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions