Skip to content

Commit 53fc63d

Browse files
authored
Merge pull request #212 from eccenca/feature/cmem-build-spark-doc
Add Apache Spark within CMEM BUILD documentation
2 parents 3cea581 + 7ad9669 commit 53fc63d

4 files changed

Lines changed: 125 additions & 11 deletions

File tree

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
👍🎉 First off, thanks for taking the time to contribute! 🎉👍
44

5-
The following is a set of guidelines for contributing to the eccenca Corporate Memory documention project.
5+
The following is a set of guidelines for contributing to the eccenca Corporate Memory documentation project.
66

77
## How Can I Contribute?
88

@@ -124,7 +124,7 @@ On this page is search function for icons available as well.
124124
<summary>Extend section</summary>
125125

126126
- do not use a cluttered desktop
127-
- do not show other esp. personal project artifacts then relevant for the tutorial / feature to show
127+
- do not show other esp. personal project artifacts than relevant for the tutorial / feature to show
128128
- select cropping area carefully (omit backgrounds, lines on the edges, etc.)
129129
- use the same or a similar area for similar screens
130130
- all relevant elements should be clearly visible and not be truncated

docs/build/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ nav:
1818
- Project and Global Variables: variables
1919
- Evaluate Template Operator: evaluate-template
2020
- Build Knowledge Graphs from Kafka Topics: kafka-consumer
21+
- Spark: spark

docs/build/index.md

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,30 +9,50 @@ hide:
99

1010
# :material-star: Build
1111

12-
The Build stage is used to turn your legacy data points from existing datasets into an Enterprise Knowledge Graph structure. The subsections introduce the features of Corporate Memory that support this stage and provide guidance through your first lifting activities.
12+
The Build stage turns your source data—across files, databases, APIs, and streams—into an Enterprise Knowledge Graph. The sections below explain the Build workspace and guide you from first lifting steps to reusable patterns and reference material.
1313

1414
**:octicons-people-24: Intended audience:** Linked Data Experts
1515

1616
<div class="grid cards" markdown>
1717

18-
- :eccenca-application-dataintegration: Introduction and Best Practices
18+
- :eccenca-application-dataintegration: Foundations: Introduction and Best Practices
1919

2020
---
2121

2222
- [Introduction to the User Interface](introduction-to-the-user-interface/index.md) --- a short introduction to the **Build** workspace incl. projects and tasks management.
2323
- [Rule Operators](rule-operators/index.md) --- Overview on operators that can be used to build linkage and transformation rules.
24-
- [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company.
25-
- [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Define Prefixes / Namespaces — Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle.
24+
- [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company.
25+
- [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle.
26+
- [Spark](spark/index.md) --- Explainer of Apache Spark and its integration within the BUILD platform.
2627

2728
- :material-list-status: Tutorials
2829

2930
---
3031

31-
- [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from from Tabular Data such as CSV, XSLX or Database Tables.
32-
- [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files.
33-
- [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API.
34-
- [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters.
35-
- [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph.
32+
- [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from tabular data such as CSV, XSLX or database tables.
33+
- [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files.
34+
- [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API.
35+
- [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph.
36+
- [Active learning](active-learning/index.md) --- Advanced workflows that improve results iteratively by incorporating feedback signals.
37+
- [Connect to Snowflake](snowflake-tutorial/index.md) --- Connect Snowflake as a scalable cloud warehouse and lift/link its data in Corporate Memory to unify it with your other sources in one knowledge graph.
38+
- [Build Knowledge Graphs from Kafka Topics](kafka-consumer/index.md) --- Consume Kafka topics and lift event streams into a Knowledge Graph.
39+
- [Evaluate Jinja Template and Send an Email Message](evaluate-template/index.md) --- Template and send an email after a workflow execution.
40+
- [Link Intrusion Detection Systems to Open-Source INTelligence](tutorial-how-to-link-ids-to-osint/index.md) --- Link IDS data to OSINT sources.
41+
42+
- :fontawesome-regular-snowflake: Patterns
43+
44+
---
45+
46+
- [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters.
47+
- [Project and Global Variables](variables/index.md) --- Define and reuse variables across tasks and projects.
48+
49+
- :material-book-open-variant-outline: Reference
50+
51+
---
52+
53+
- [Mapping Creator](mapping-creator/index.md) --- Create and manage mappings to lift legacy data into a Knowledge Graph.
54+
- [Integrations](integrations/index.md) --- Supported integrations and configuration options for connecting data sources and sinks.
55+
- [Task and Operator Reference](reference/index.md) --- Reference documentation for tasks and operators in the Build workspace.
3656

3757
</div>
3858

docs/build/spark/index.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
icon: simple/apachespark
3+
tags:
4+
- Introduction
5+
- Explainer
6+
---
7+
# Apache Spark within Corporate Memory Build
8+
9+
## Introduction
10+
11+
This documentation provides an overview of Apache Spark and its integration within Corporate Memory’s Build component.
12+
The goal is to provide a conceptual understanding of Spark, its purpose in Build, and how workflows leverage Spark-aware datasets for efficient, distributed data processing.
13+
14+
The documentation is structured in two parts:
15+
16+
1. What is Apache Spark?
17+
2. Apache Spark within Build
18+
19+
## What is Apache Spark?
20+
21+
The main data processing use-cases of Apache Spark are:
22+
23+
* data loading,
24+
* SQL queries,
25+
* machine learning,
26+
* streaming,
27+
* graph processing.
28+
29+
Additionally, there are other functionalities stemming from hundreds of plugins.
30+
31+
By itself, Apache Spark is detached from any data and Input/Output (IO) operations.
32+
Within Corporate Memory, the relevant configuration is documented in the [Spark configuration](../../deploy-and-configure/configuration/dataintegration/#spark-configuration).
33+
34+
## Apache Spark within Build
35+
36+
### Why is Apache Spark used in Corporate Memory?
37+
38+
Apache Spark is integrated into Corporate Memory to enable scalable, distributed execution of data integration workflows within its Build component.
39+
While Corporate Memory’s overall architecture already consists of multiple distributed services (e.g. Build for data integration, Explore for knowledge graph management), the execution of workflows in Build is typically centralized.
40+
Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets.
41+
42+
The rationale behind using Spark aligns with its general strengths:
43+
44+
- **Parallelization and scalability** for high-volume transformations and joins.
45+
- **Fault tolerance** through resilient distributed datasets (RDDs).
46+
- **Optimization** via Spark’s DAG-based execution planner, minimizing data movement.
47+
- **Interoperability** with widely used big data formats (Parquet, ORC, Avro).
48+
49+
By leveraging Spark, Corporate Memory can handle data integration workflows that would otherwise be constrained by processing limits, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem.
50+
However, since Spark support is optional, its usage depends on specific deployment needs and data volumes.
51+
52+
### How and where is Apache Spark used by Build?
53+
54+
Within the Build component, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**.
55+
These workflows connect datasets, apply transformations, and produce outputs, with Spark that handles large volumes of data and complex computations efficiently.
56+
57+
For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used.
58+
In such cases, Build’s standard local execution engine is sufficient.
59+
Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine.
60+
61+
Each Spark-aware dataset corresponds to an **executor-aware entity**.
62+
During workflow execution, Build translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster.
63+
The results are then materialized or written back into Corporate Memory’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph.
64+
65+
### Types of Spark-aware datasets
66+
67+
The main types of Spark-aware datasets include:
68+
69+
- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing.
70+
- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning.
71+
- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression.
72+
- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly.
73+
- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing.
74+
- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations.
75+
- **JDBC datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames.
76+
- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC or ODBC without persistent storage, optionally cached in memory.
77+
78+
### What is the relation between Build’s Spark-aware workflows and the Knowledge Graph?
79+
80+
The Spark-aware workflows operate on datasets within Build, executing transformations and producing outputs.
81+
The Knowledge Graph, managed by Explore, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph.
82+
Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately.
83+
84+
This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of Corporate Memory’s architecture around it.
85+
Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism.
86+
87+
### What is the relation between Spark-aware dataset plugins and other Build plugins?
88+
89+
Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into Build workflows.
90+
They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations.
91+
92+
These plugins also cover JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints.
93+
Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial.

0 commit comments

Comments
 (0)