Skip to content

Latest commit

 

History

History
437 lines (341 loc) · 18.4 KB

File metadata and controls

437 lines (341 loc) · 18.4 KB

Hydra-Java

This directory contains a complete Java implementation of Hydra. Hydra-Java passes all tests in the common test suite, ensuring identical behavior with Hydra-Haskell and Hydra-Python.

Hydra is a type-aware data transformation toolkit which aims to be highly flexible and portable. It has its roots in graph databases and type theory, and provides APIs in Haskell, Java, Python, Scala, TypeScript, and Lisp. See the main Hydra README for more details.

JavaDocs for Hydra-Java can be found here, and releases can be found on Maven Central here.

Getting Started

Hydra-Java requires Java 11 or later. The Gradle wrapper lives at heads/java/gradlew; all subprojects (hydra-java, hydra-rdf4j, hydra-neo4j, hydra-pg-dsl) are configured from there.

cd heads/java
./gradlew :hydra-java:build

To publish the resulting JAR to your local Maven repository:

cd heads/java
./gradlew :hydra-java:publishToMavenLocal

You may need to set the JAVA_HOME environment variable:

cd heads/java
JAVA_HOME=/path/to/java11/installation ./gradlew :hydra-java:build

Performance note: native JDK on Apple Silicon

On Apple Silicon (M1/M2/M3/M4) Macs, using an x86_64 JDK (which runs under Rosetta 2 translation) can cause ~20x slower code generation and test execution compared to a native arm64 JDK. This applies to any Java version — the critical factor is architecture, not version number.

To check your JDK's architecture:

file "$(which java)"
# arm64 = native (fast), x86_64 = Rosetta (slow)

If you see x86_64, install a native arm64 JDK. Downloads are available from Oracle, Adoptium, or via Homebrew (brew install openjdk@11).

To compare Hydra's performance across JDK versions on your machine, use the benchmark test runner with --tag to label each run and --repeat for statistical reliability:

export JAVA_HOME=$(/usr/libexec/java_home -v 11)
bin/run-benchmark-tests.sh --hosts java --tag java11 --repeat 5

export JAVA_HOME=$(/usr/libexec/java_home -v 17)
bin/run-benchmark-tests.sh --hosts java --tag java17 --repeat 5

# Compare results
bin/run-benchmark-tests.sh dashboard diff --old java11 --new java17

Documentation

For comprehensive documentation about Hydra's architecture and usage, see:

Testing

Hydra-Java has two types of tests: the common test suite (shared across all Hydra implementations) and Java-specific tests. See the Testing wiki page for comprehensive documentation.

Common Test Suite

The common test suite (hydra.test.testSuite) ensures parity across all Hydra implementations. Passing all common test suite cases is the criterion for a true Hydra implementation.

To run all tests:

cd heads/java
./gradlew :hydra-java:test

The test suite is generated from Hydra DSL sources and includes:

  • Primitive function tests (lists, strings, math, etc.)
  • Case conversion tests (camelCase, snake_case, etc.)
  • Type inference tests
  • Type checking tests
  • Evaluation tests
  • JSON coder tests
  • Rewriting and hoisting tests

Java-Specific Tests

Java-specific tests validate implementation details and Java-specific functionality. These are located in src/test/java/ alongside the common test suite runner.

To run a specific test class:

cd heads/java
./gradlew :hydra-java:test --tests "hydra.VisitorTest"

Code organization

Hydra's Java code is split across three locations (see Code organization wiki page for the full picture):

  • This package (packages/hydra-java/src/main/java/hydra/sources/java/) — the Java coder DSL sources (written in Java). These are the source of truth for the hydra.java.* modules (Syntax, Language, Coder, Serde, Names, Utils, Environment, Testing, plus the hand-written JavaHelpers and SourceDsl support classes).

    Legacy backup: packages/hydra-java/src/main/haskell/Hydra/Sources/Java/ still contains the older Haskell-DSL versions of these modules. They are kept as a backup through the 0.15 line and produce byte-identical dist/json/hydra-java/ output, but will be dropped before 0.16. Edits should go into the Java sources, not the Haskell ones.

  • Java head (heads/java/src/main/java/) — hand-written Java runtime

    • hydra/lib/ — primitive function implementations
    • hydra/dsl/ — Java DSL (Terms, Types, Expect, ...)
    • hydra/util/ — core utilities (Either, Maybe, Pair, Lazy) plus the persistent collection helpers ConsList / PersistentMap / PersistentSet (see Collection classes under design notes)
    • hydra/tools/ — framework classes (PrimitiveFunction, MapperBase, ...)
    • hydra/UpdateJavaJson.java — driver that updates dist/json/hydra-java/ from the Java DSL sources in this package (see Generate Java code)
  • Generated Java kernel (dist/java/hydra-kernel/src/main/java/) — code-generated from the kernel DSL sources

    • hydra/core/, hydra/graph/, hydra/packaging/, hydra/coders/, hydra/typing/, ...
    • hydra/reduction/, hydra/rewriting/, hydra/hoisting/
    • hydra/inference/, hydra/checking/
  • Generated Java test suite (dist/java/hydra-kernel/src/test/java/) — the common test suite compiled into Java.

Generate Java code

Java code generation has two stages: first the Java coder modules' DSL sources are exported to JSON (Phase 1), then the JSON is loaded by the Java host and used to generate dist/java/hydra-kernel/ (Phase 2). The two stages live in different scripts and can be invoked independently.

Phase 1: regenerate dist/json/hydra-java/ from the Java DSL sources

bin/generate-hydra-java-from-java.sh is the self-hosting entry point: it runs the Java DSL sources in this package through the Java host and writes dist/json/hydra-java/.

# Regenerate hydra-java JSON from packages/hydra-java/src/main/java/hydra/sources/java/
bin/generate-hydra-java-from-java.sh

# Same, with byte-compare against the existing canonical
bin/generate-hydra-java-from-java.sh --compare

# Force a rebuild of the Java host (kernel JSON + dist/java/hydra-kernel) first
bin/generate-hydra-java-from-java.sh --force-rebuild

The script:

  1. Runs bin/sync.sh to ensure every per-language dist/java/hydra-* tree is current (the gradle rollup imports hydra.python.*, hydra.haskell.*, etc., so a scoped sync-java.sh is not sufficient). Gated by HYDRA_IN_SYNC=1 so that sync.sh Phase 5 invoking us doesn't recurse. Warm-cache sync is ~3 minutes.
  2. Compiles the rollup ((cd heads/java && ./gradlew :hydra-java:compileHeadsExtrasJava)).
  3. Runs hydra.UpdateJavaJson (via bin/update-java-json.sh), which loads the kernel universe from dist/json/hydra-kernel/, discovers the Java DSL source modules via reflection, infers types for those that don't carry pre-computed type schemes (Coder ships its schemes pre-computed; see bin/update-java-json.md for the rationale), and writes the resulting JSON.

End-to-end is ~30 seconds once dist/ is current.

Note: bin/sync.sh Phase 5 invokes generate-hydra-java-from-java.sh automatically — the native Java DSL path is authoritative. The legacy Haskell DSL copy at packages/hydra-java/src/main/haskell/ remains as a bootstrap fallback (used by Phase 1 on a cold checkout) and will be retired before 0.16. See claude/pitfalls.md for the HYDRA_IN_SYNC convention around wrapper-script self-syncing.

Phase 2: regenerate dist/java/ from the JSON

The narrowest end-to-end script is:

bin/sync-java.sh

(equivalent to bin/sync.sh --hosts java --targets java)

This will:

  1. Generate / refresh dist/json/ from the legacy Haskell DSL sources
  2. Generate the Java kernel into dist/java/hydra-kernel/src/main/java
  3. Generate the default lib modules
  4. Generate the kernel tests into dist/java/hydra-kernel/src/test/java

sync-java.sh does not run the Java test suite. To validate against the generated tests, run heads/java/bin/test-distribution.sh hydra-kernel afterward (or do a full bootstrap demo via bin/run-bootstrapping-demo.sh --hosts java --targets java, which includes test runs).

Note on Phase 5 (Java self-host). sync-java.sh will also run Phase 5 (generate-hydra-java-from-java.sh), which compiles hydra.Generation and friends. That compile imports hydra.{python,haskell,lisp,typescript}.* from per-language dist/java/ trees, so on a cold checkout Phase 5 will fail until those siblings have been populated by bin/sync.sh (full matrix) or bin/sync.sh --hosts java --targets <every-language>. See claude/pitfalls.md §"gradle :hydra-java:test needs all coder language packages in dist/java/".

Design notes

Algebraic data types

The Java coder DSL sources that drive Java code generation live here. A variety of techniques are used in order to materialize Hydra's core language in Java, including a pattern for representing algebraic data types which was originally proposed by Gabriel Garcia, and used in Dragon.

For example, the generated Vertex class represents a property graph vertex, and corresponds to a record type:

public class Vertex<V> {
  public final hydra.pg.model.VertexLabel label;
  public final V id;
  public final java.util.Map<hydra.pg.model.PropertyKey, V> properties;

  public Vertex (hydra.pg.model.VertexLabel label, V id, java.util.Map<hydra.pg.model.PropertyKey, V> properties) {
    java.util.Objects.requireNonNull((label));
    java.util.Objects.requireNonNull((id));
    java.util.Objects.requireNonNull((properties));
    this.label = label;
    this.id = id;
    this.properties = properties;
  }

  @Override
  public boolean equals(Object other) {
    if (!(other instanceof Vertex)) {
      return false;
    }
    Vertex o = (Vertex) (other);
    return label.equals(o.label) && id.equals(o.id) && properties.equals(o.properties);
  }

  @Override
  public int hashCode() {
    return 2 * label.hashCode() + 3 * id.hashCode() + 5 * properties.hashCode();
  }

  // ... with* methods for immutable updates
}

See Vertex.java for the complete class, as well as the Vertex type in Pg/Model.hs for comparison. Both files were generated from the property graph model defined here.

Collection classes

Hydra's term-level lists, maps, and sets get persistent (immutable, structurally-shared) implementations under hydra.util:

  • ConsList<T> — singly-linked list with O(1) cons and tail sharing, matching Haskell's [a].
  • PersistentMap<K, V> — ordered red-black tree map with O(log n) insert, delete, and union, matching Data.Map in Haskell. Iteration is sorted by key. Keys must be Comparable at runtime.
  • PersistentSet<T> — wrapper over PersistentMap, matching Data.Set.

These are the concrete instances behind the standard java.util.List, java.util.Map, and java.util.Set interfaces in generated kernel code: the kernel's API surfaces (POJO fields, primitive apply signatures, accessors) all expose the JDK interfaces, but the values flowing through them are persistent. This matches the Haskell semantics — every "incremental" operation (cons, insert, delete, union) returns a new collection that shares structure with the original instead of copying — so an O(n) Haskell algorithm runs in O(n) on the JVM instead of O(n²) (which is what naive new ArrayList<>(old); old.add(x) would yield).

Two boundaries exist where plain JDK collections still appear:

  • Internal sort scratch buffers in algorithms that need O(1) random access (e.g. Sort, SortOn, Transpose). These never escape the function and return a ConsList to the caller.
  • LinkedHashMap in JSON output (hydra.json.JsonEncoding.ObjectBuilder), to preserve insertion-order key emission. This is a deliberate user-visible ordering choice, not a bug.

The Java coder also emits these helpers automatically when lowering Hydra term-level list/map/set literals, so generated code in dist/java/ is consistent with the runtime's choice. See encodeTermInternal's _Term_list, _Term_map, and _Term_set arms in Coder.java.

Union Types and Visitors

Union types (sum types) are represented using the visitor pattern. For example, the Element type is a tagged union of Vertex and Edge:

public abstract class Element<V> {
  private Element () {}

  public abstract <R> R accept(Visitor<V, R> visitor) ;

  public interface Visitor<V, R> {
    R visit(Vertex<V> instance) ;
    R visit(Edge<V> instance) ;
  }

  public interface PartialVisitor<V, R> extends Visitor<V, R> {
    default R otherwise(Element<V> instance) {
      throw new IllegalStateException("Non-exhaustive patterns when matching: " + (instance));
    }
    default R visit(Vertex<V> instance) { return otherwise((instance)); }
    default R visit(Edge<V> instance) { return otherwise((instance)); }
  }

  public static final class Vertex<V> extends hydra.pg.model.Element<V> {
    public final hydra.pg.model.Vertex<V> value;
    // ... constructor, equals, hashCode, accept
  }

  public static final class Edge<V> extends hydra.pg.model.Element<V> {
    public final hydra.pg.model.Edge<V> value;
    // ... constructor, equals, hashCode, accept
  }
}

See Element.java for the complete class. The Visitor class is for pattern matching over the alternatives, and PartialVisitor is a convenient extension which allows supplying a default value for alternatives not matched explicitly.

The Rewriting and Reduction classes are good examples of pattern matching in action, and there are simpler examples in VisitorTest.java.

Future enhancements

Recommendations from #233 that haven't been adopted yet. Recorded here so the design intent survives any future re-evaluation. These are deliberate non-goals today, not bugs.

Gated on a Java 21 minimum

The current minimum Java version for hydra-java and its generated code is Java 11. The visitor pattern shown above is the most ergonomic encoding of sum types compatible with that floor. Several worthwhile improvements become available if the minimum is raised to Java 21.

Sealed classes + pattern-matching switch (JEP 441, Java 21)

Today's generated union types use an abstract base class with nested subclasses and a Visitor/PartialVisitor for dispatch. Java 21's sealed hierarchies combined with pattern-matching switch expressions would let consumers write:

String label = switch (term) {
    case Term.Literal l    -> "literal: " + l.value;
    case Term.Variable v   -> "var: "     + v.value;
    case Term.Function f   -> "function";
    case Term.Application a -> "app";
    // ... compiler enforces exhaustiveness; missing cases are a compile error
};

Compared to today's PartialVisitor (which throws at runtime on unhandled cases), this would give compile-time exhaustiveness checking, remove the accept/visit boilerplate from every consumer site, and align the Java emission with what equivalent Hydra code looks like in Haskell, Scala, Python (match/case), and the Lisp dialects. The generated classes would need sealed/permits keywords; downstream code would migrate from visitor implementations to switch blocks. For the trade-off analysis and references to issue #233's JAVA-SEALED-SWITCH and JAVA-FUNCTIONAL-MATCH recommendations, see the branch plan in feature_233_edsls-plan.md.

Records for product types (Java 14+, refined in 21)

Generated record types currently use explicit fields plus hand-rolled equals/hashCode/constructors. Java 14+ record declarations would collapse those into a single line per type, with structural deconstruction available in switch patterns:

public record Field(Name name, Term term) { }

// And in a consumer:
case Field(var name, var term) -> ...

Raising the floor to Java 21 affects every downstream consumer of generated Hydra code (the bindings/java/* adapters, hydrapop, demo projects, external integrations that haven't been surveyed). The benefit is substantial but the cost is a coordinated platform bump that needs explicit buy-in. Until then, hydra-java stays on Java 11 and the visitor pattern remains the canonical encoding.