Describe the enhancement requested
Today the conversion/import path does not seem to scale well for larger datasets.
From reading the current code and trying it in practice, I see that:
- the C++ high-level builders are convenience APIs and keep data in memory until
Dump()
- the Spark writer scales better in principle, but still does heavy batch work such as index generation, joins, sorting, repartitioning, and offset construction
Because GraphAr is positioned for use with "large-scale graph data", it would be useful for a community to have a clearer path for scalable conversion.
Assuming I'm not missing something, my suggestion is:
- keep the C++ high-level writer/builder path simple/reference-oriented and convenient for small/medium imports
- optimize the Spark API/writer as the primary path for large-scale conversion
This way we treat Spark as the practical scalable backend for data lake, object stores, HDFS, and distributed preprocessing.
Why Spark seems like the better place to optimize first:
- Spark is considered a data lake first-class citizen, used by many orgs in production and thus in practice is more accessible for end-users (compared to a dedicated VM only for C++ import)
- storage backends such as S3/HDFS are abstracted through Spark/Hadoop
- large joins / remapping / repartitioning are natural Spark workloads
- avoiding two separate “fully optimized” implementations (Spark and Cpp) may be easier to maintain long-term
To sum up, would the project agree with this direction?
If it sounds reasonable, I’d be happy to help investigate and propose concrete improvements in the Spark conversion path.
Component(s)
C++, Spark
Describe the enhancement requested
Today the conversion/import path does not seem to scale well for larger datasets.
From reading the current code and trying it in practice, I see that:
Dump()Because GraphAr is positioned for use with "large-scale graph data", it would be useful for a community to have a clearer path for scalable conversion.
Assuming I'm not missing something, my suggestion is:
This way we treat Spark as the practical scalable backend for data lake, object stores, HDFS, and distributed preprocessing.
Why Spark seems like the better place to optimize first:
To sum up, would the project agree with this direction?
If it sounds reasonable, I’d be happy to help investigate and propose concrete improvements in the Spark conversion path.
Component(s)
C++, Spark