Merge pull request #212 from eccenca/feature/cmem-build-spark-doc

rpietzsch · web-flow · commit 53fc63d02852 · 2026-02-23T12:58:04.000+01:00
Add Apache Spark within CMEM BUILD documentation
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -2,7 +2,7 @@
 
 👍🎉 First off, thanks for taking the time to contribute! 🎉👍
 
-The following is a set of guidelines for contributing to the eccenca Corporate Memory documention project.
+The following is a set of guidelines for contributing to the eccenca Corporate Memory documentation project.
 
 ## How Can I Contribute?
 
@@ -124,7 +124,7 @@ On this page is search function for icons available as well.
   <summary>Extend section</summary>
 
 -   do not use a cluttered desktop
--   do not show other esp. personal project artifacts then relevant for the tutorial / feature to show
+-   do not show other esp. personal project artifacts than relevant for the tutorial / feature to show
 -   select cropping area carefully (omit backgrounds, lines on the edges, etc.)
 -   use the same or a similar area for similar screens
 -   all relevant elements should be clearly visible and not be truncated
diff --git a/docs/build/.pages b/docs/build/.pages
@@ -18,3 +18,4 @@ nav:
     - Project and Global Variables: variables
     - Evaluate Template Operator: evaluate-template
     - Build Knowledge Graphs from Kafka Topics: kafka-consumer
+    - Spark: spark
diff --git a/docs/build/index.md b/docs/build/index.md
@@ -9,30 +9,50 @@ hide:
 
 # :material-star: Build
 
-The Build stage is used to turn your legacy data points from existing datasets into an Enterprise Knowledge Graph structure. The subsections introduce the features of Corporate Memory that support this stage and provide guidance through your first lifting activities.
+The Build stage turns your source data—across files, databases, APIs, and streams—into an Enterprise Knowledge Graph. The sections below explain the Build workspace and guide you from first lifting steps to reusable patterns and reference material.
 
 **:octicons-people-24: Intended audience:** Linked Data Experts
 
 <div class="grid cards" markdown>
 
--   :eccenca-application-dataintegration: Introduction and Best Practices
+-   :eccenca-application-dataintegration: Foundations: Introduction and Best Practices
 
     ---
 
     - [Introduction to the User Interface](introduction-to-the-user-interface/index.md) --- a short introduction to the **Build** workspace incl. projects and tasks management.
     - [Rule Operators](rule-operators/index.md) --- Overview on operators that can be used to build linkage and transformation rules.
-    - [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company.
-    - [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Define Prefixes / Namespaces — Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle.
+    - [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company.
+    - [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle.
+    - [Spark](spark/index.md) --- Explainer of Apache Spark and its integration within the BUILD platform.
 
 -   :material-list-status: Tutorials
 
     ---
 
-     - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from from Tabular Data such as CSV, XSLX or Database Tables.
-     - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files.
-     - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API.
-     - [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters.
-     - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph.
+    - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from tabular data such as CSV, XSLX or database tables.
+    - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files.
+    - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API.
+    - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph.
+    - [Active learning](active-learning/index.md) --- Advanced workflows that improve results iteratively by incorporating feedback signals.
+    - [Connect to Snowflake](snowflake-tutorial/index.md) --- Connect Snowflake as a scalable cloud warehouse and lift/link its data in Corporate Memory to unify it with your other sources in one knowledge graph.
+    - [Build Knowledge Graphs from Kafka Topics](kafka-consumer/index.md) --- Consume Kafka topics and lift event streams into a Knowledge Graph.
+    - [Evaluate Jinja Template and Send an Email Message](evaluate-template/index.md) --- Template and send an email after a workflow execution.
+    - [Link Intrusion Detection Systems to Open-Source INTelligence](tutorial-how-to-link-ids-to-osint/index.md) --- Link IDS data to OSINT sources.
+
+-   :fontawesome-regular-snowflake: Patterns
+
+    ---
+
+    - [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters.
+    - [Project and Global Variables](variables/index.md) --- Define and reuse variables across tasks and projects.
+
+-   :material-book-open-variant-outline: Reference
+
+    ---
+
+    - [Mapping Creator](mapping-creator/index.md) --- Create and manage mappings to lift legacy data into a Knowledge Graph.
+    - [Integrations](integrations/index.md) --- Supported integrations and configuration options for connecting data sources and sinks.
+    - [Task and Operator Reference](reference/index.md) --- Reference documentation for tasks and operators in the Build workspace.
 
 </div>
 
diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md
@@ -0,0 +1,93 @@
+---
+icon: simple/apachespark
+tags:
+    - Introduction
+    - Explainer
+---
+# Apache Spark within Corporate Memory Build
+
+## Introduction
+
+This documentation provides an overview of Apache Spark and its integration within Corporate Memory’s Build component.
+The goal is to provide a conceptual understanding of Spark, its purpose in Build, and how workflows leverage Spark-aware datasets for efficient, distributed data processing.
+
+The documentation is structured in two parts:
+
+1. What is Apache Spark?
+2. Apache Spark within Build
+
+## What is Apache Spark?
+
+The main data processing use-cases of Apache Spark are:
+
+* data loading,
+* SQL queries,
+* machine learning,
+* streaming,
+* graph processing.
+
+Additionally, there are other functionalities stemming from hundreds of plugins.
+
+By itself, Apache Spark is detached from any data and Input/Output (IO) operations.
+Within Corporate Memory, the relevant configuration is documented in the [Spark configuration](../../deploy-and-configure/configuration/dataintegration/#spark-configuration).
+
+## Apache Spark within Build
+
+### Why is Apache Spark used in Corporate Memory?
+
+Apache Spark is integrated into Corporate Memory to enable scalable, distributed execution of data integration workflows within its Build component.
+While Corporate Memory’s overall architecture already consists of multiple distributed services (e.g. Build for data integration, Explore for knowledge graph management), the execution of workflows in Build is typically centralized.
+Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets.
+
+The rationale behind using Spark aligns with its general strengths:
+
+- **Parallelization and scalability** for high-volume transformations and joins.
+- **Fault tolerance** through resilient distributed datasets (RDDs).
+- **Optimization** via Spark’s DAG-based execution planner, minimizing data movement.
+- **Interoperability** with widely used big data formats (Parquet, ORC, Avro).
+
+By leveraging Spark, Corporate Memory can handle data integration workflows that would otherwise be constrained by processing limits, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem.
+However, since Spark support is optional, its usage depends on specific deployment needs and data volumes.
+
+### How and where is Apache Spark used by Build?
+
+Within the Build component, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**.
+These workflows connect datasets, apply transformations, and produce outputs, with Spark that handles large volumes of data and complex computations efficiently.
+
+For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used.
+In such cases, Build’s standard local execution engine is sufficient.
+Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine.
+
+Each Spark-aware dataset corresponds to an **executor-aware entity**.
+During workflow execution, Build translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster.
+The results are then materialized or written back into Corporate Memory’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph.
+
+### Types of Spark-aware datasets
+
+The main types of Spark-aware datasets include:
+
+- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing.
+- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning.
+- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression.
+- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly.
+- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing.
+- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations.
+- **JDBC datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames.
+- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC or ODBC without persistent storage, optionally cached in memory.
+
+### What is the relation between Build’s Spark-aware workflows and the Knowledge Graph?
+
+The Spark-aware workflows operate on datasets within Build, executing transformations and producing outputs.
+The Knowledge Graph, managed by Explore, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph.
+Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately.
+
+This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of Corporate Memory’s architecture around it.
+Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism.
+
+### What is the relation between Spark-aware dataset plugins and other Build plugins?
+
+Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into Build workflows.
+They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations.
+
+These plugins also cover JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints.
+Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial.