|
| 1 | +--- |
| 2 | +icon: simple/apachespark |
| 3 | +tags: |
| 4 | + - Introduction |
| 5 | + - Explainer |
| 6 | +--- |
| 7 | +# Apache Spark within Corporate Memory Build |
| 8 | + |
| 9 | +## Introduction |
| 10 | + |
| 11 | +This documentation provides an overview of Apache Spark and its integration within Corporate Memory’s Build component. |
| 12 | +The goal is to provide a conceptual understanding of Spark, its purpose in Build, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. |
| 13 | + |
| 14 | +The documentation is structured in two parts: |
| 15 | + |
| 16 | +1. What is Apache Spark? |
| 17 | +2. Apache Spark within Build |
| 18 | + |
| 19 | +## What is Apache Spark? |
| 20 | + |
| 21 | +The main data processing use-cases of Apache Spark are: |
| 22 | + |
| 23 | +* data loading, |
| 24 | +* SQL queries, |
| 25 | +* machine learning, |
| 26 | +* streaming, |
| 27 | +* graph processing. |
| 28 | + |
| 29 | +Additionally, there are other functionalities stemming from hundreds of plugins. |
| 30 | + |
| 31 | +By itself, Apache Spark is detached from any data and Input/Output (IO) operations. |
| 32 | +Within Corporate Memory, the relevant configuration is documented in the [Spark configuration](../../deploy-and-configure/configuration/dataintegration/#spark-configuration). |
| 33 | + |
| 34 | +## Apache Spark within Build |
| 35 | + |
| 36 | +### Why is Apache Spark used in Corporate Memory? |
| 37 | + |
| 38 | +Apache Spark is integrated into Corporate Memory to enable scalable, distributed execution of data integration workflows within its Build component. |
| 39 | +While Corporate Memory’s overall architecture already consists of multiple distributed services (e.g. Build for data integration, Explore for knowledge graph management), the execution of workflows in Build is typically centralized. |
| 40 | +Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. |
| 41 | + |
| 42 | +The rationale behind using Spark aligns with its general strengths: |
| 43 | + |
| 44 | +- **Parallelization and scalability** for high-volume transformations and joins. |
| 45 | +- **Fault tolerance** through resilient distributed datasets (RDDs). |
| 46 | +- **Optimization** via Spark’s DAG-based execution planner, minimizing data movement. |
| 47 | +- **Interoperability** with widely used big data formats (Parquet, ORC, Avro). |
| 48 | + |
| 49 | +By leveraging Spark, Corporate Memory can handle data integration workflows that would otherwise be constrained by processing limits, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem. |
| 50 | +However, since Spark support is optional, its usage depends on specific deployment needs and data volumes. |
| 51 | + |
| 52 | +### How and where is Apache Spark used by Build? |
| 53 | + |
| 54 | +Within the Build component, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**. |
| 55 | +These workflows connect datasets, apply transformations, and produce outputs, with Spark that handles large volumes of data and complex computations efficiently. |
| 56 | + |
| 57 | +For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. |
| 58 | +In such cases, Build’s standard local execution engine is sufficient. |
| 59 | +Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. |
| 60 | + |
| 61 | +Each Spark-aware dataset corresponds to an **executor-aware entity**. |
| 62 | +During workflow execution, Build translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. |
| 63 | +The results are then materialized or written back into Corporate Memory’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. |
| 64 | + |
| 65 | +### Types of Spark-aware datasets |
| 66 | + |
| 67 | +The main types of Spark-aware datasets include: |
| 68 | + |
| 69 | +- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. |
| 70 | +- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning. |
| 71 | +- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. |
| 72 | +- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. |
| 73 | +- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. |
| 74 | +- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. |
| 75 | +- **JDBC datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. |
| 76 | +- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC or ODBC without persistent storage, optionally cached in memory. |
| 77 | + |
| 78 | +### What is the relation between Build’s Spark-aware workflows and the Knowledge Graph? |
| 79 | + |
| 80 | +The Spark-aware workflows operate on datasets within Build, executing transformations and producing outputs. |
| 81 | +The Knowledge Graph, managed by Explore, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. |
| 82 | +Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. |
| 83 | + |
| 84 | +This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of Corporate Memory’s architecture around it. |
| 85 | +Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. |
| 86 | + |
| 87 | +### What is the relation between Spark-aware dataset plugins and other Build plugins? |
| 88 | + |
| 89 | +Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into Build workflows. |
| 90 | +They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations. |
| 91 | + |
| 92 | +These plugins also cover JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. |
| 93 | +Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial. |
0 commit comments