Sample to show how process different types of avro subjects in a single topic by buddhike · Pull Request #98 · aws-samples/amazon-managed-service-for-apache-flink-examples

buddhike · 2025-04-03T05:37:02Z

No description provided.

…le topic

nicusX

Many thanks for the contribution.
I would suggest some changes to make it simpler to understand, and also not to hint any not-so-good practice.

nicusX · 2025-04-15T07:04:39Z

.gitignore in the subfolder is not required.
There is a gitignore at top level. If there is anything missing there please update that one

nicusX · 2025-04-15T07:05:22Z

+* Flink API: DataStream API
+* Language: Java (11)
+
+This example demonstrates how to serialize/deserialize Avro messages in Kafka when one topic stores multiple subject types.


Explain this is specific to Confluent Schema Registry

nicusX · 2025-04-15T07:06:32Z

+
+* Flink version: 1.20
+* Flink API: DataStream API
+* Language: Java (11)


We usually add to the list the connectors used in the example. In this case, it's also important to add that the example uses AVRO Confluent Schema Registry

nicusX · 2025-04-15T07:07:31Z

+
+This example uses Avro-generated classes (more details [below](#using-avro-generated-classes)).
+
+A `KafkaSource` produces a stream of Avro data objects (`SpecificRecord`), fetching the writer's schema from AWS Glue Schema Registry. The Avro Kafka message value must have been serialized using AWS Glue Schema Registry.


I think you mean Confluent. Schema Registry?

nicusX · 2025-04-15T07:08:09Z

+            env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
+            env.enableCheckpointing(60000);
+        }
+        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);


This is not really required

nicusX · 2025-04-29T13:52:30Z

+        env.execute("avro-one-topic-many-subjects");
+    }
+
+    private static void setupAirQualityGenerator(String bootstrapServers, String sourceTopic, String schemaRegistryUrl, Map<String, Object> schemaRegistryConfig, StreamExecutionEnvironment env) {


Even though having the data generator within the same Flink app works, we are deliberately avoiding doing it in any of the examples. The reason is that building jobs with multiple dataflows is strongly discouraged.
We are avoiding using any bad practice in examples, not to suggest it may be a good idea doing it.

I reckon it's more complicated, but you can add a separate module with a standalone Java application which generates data. Something similar to what we do in this example, even though in that case it's Kinesis

nicusX · 2025-04-29T14:02:33Z

+ * strategies and event time extraction. However, for those scenarios to work
+ * all subjects should have a standard set of fields.
+ */
+class Option {


Maybe you can use org.apache.flink.types.SerializableOptional<T> that comes with Flink

Option type in this PR is a container type to hold any possible deserialized value. SerializableOptional<T> is for optional values. I guess it would not be the right choice here? Am I missing something? 👀 🙏🏾

BTW: I could have used Object instead of creating an Option with the Object type value field. However, having Option type helps if we want to generate watermarks via source operator using a common timestamp field.

Is there a better way to do this?

nicusX · 2025-04-29T14:06:31Z

+}
+
+// Custom deserialization schema for handling multiple generic Avro record types
+class OptionDeserializationSchema implements KafkaRecordDeserializationSchema<Option> {


Please, move to a top level class for readability

nicusX · 2025-04-29T14:06:48Z

+    }
+}
+
+class RecordNameSerializer<T> implements KafkaRecordSerializationSchema<T>


Move to top level class

nicusX · 2025-04-29T14:24:29Z

+        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
+
+        Properties applicationProperties = loadApplicationProperties(env).get(APPLICATION_CONFIG_GROUP);
+        String bootstrapServers = Preconditions.checkNotNull(applicationProperties.getProperty("bootstrap.servers"), "bootstrap.servers not defined");


The code building the dataflow is a bit hard to follow.
I would suggest to do what we tend to do in other examples

In runtime configuration, use a PropertyGroup for each source and sink, even if some configurations are repeated

Instantiate Source and Sink in a local method, the Properties which contains all configuration for that specific component. Extract specific properties, like topic name, within the method rather than in the main() directly

Build the dataflow just attaching the operators one after the others, using intermediate streams variables only when it helps readability

Avoid having methods that attach operators to the dataflow. Practically, any method which expects a DataStream or StreamingExecutionEnvironment as a parameter should be avoided.

If an operator implementation like a map for a filter is simple, try using a lambda and inlining it. If the operator implementation is complex externalize the implementation to a separate class

See examples here

We are not following these patterns in all examples, but we are trying to converge as possible

buddhike · 2025-05-12T07:22:14Z

@nicusX Thanks for reviewing this PR. I've addressed your points. Could you please take another look? 🙏🏾

buddhike · 2025-05-12T07:55:00Z

@nicusX Do you think this is a good approach to share run configurations with IntelliJ users?

Uhm, I will avoid any IntelliJ specific. We are avoiding it in all other examples. People may have different setup. We give instructions.

nicusX

Splitting the data generator makes it much more readable.
I added some comments about few things to cleanup.
I would also remove the IntelliJ configuration. Instructions are more than sufficient.

nicusX · 2025-05-28T11:44:29Z

Uhm, I will avoid any IntelliJ specific. We are avoiding it in all other examples. People may have different setup. We give instructions.

nicusX · 2025-05-28T11:46:50Z

+        <maven.compiler.source>${target.java.version}</maven.compiler.source>
+        <main.class>com.amazonaws.services.msf.StreamingJob</main.class>
+
+        <scala.binary.version>2.12</scala.binary.version>


Is this property required?

nicusX · 2025-05-28T11:49:41Z

+            <artifactId>flink-connector-kafka</artifactId>
+            <version>${kafka.clients.version}</version>
+        </dependency>
+        <dependency>


I don't think this dependency is required. This is a super-old dependency containing AWS connectors before they were part of the Apache Flink project

nicusX · 2025-05-28T11:50:13Z

+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-runtime-web</artifactId>
+            <version>${flink.version}</version>


should be provided

nicusX · 2025-05-28T11:50:28Z

+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-connector-base</artifactId>
+            <version>${flink.version}</version>


should be provided

nicusX · 2025-05-28T12:13:25Z

+        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
+
+        if (isLocal(env)) {
+            env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new org.apache.flink.configuration.Configuration());


Creating the local env with webUI is no longer required. With 1.20 the UI is automatically created when running locally if the flink-web dependency is included

nicusX · 2025-05-28T12:16:43Z

+                            </filters>
+                            <transformers>
+                                <transformer
+                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>


This is not needed for a plain-java application

nicusX · 2025-05-28T12:17:22Z

+                            <goal>shade</goal>
+                        </goals>
+                        <configuration>
+                            <artifactSet>


This is not needed for a plain java application.
It will actually prevent the packaged application from logging

nicusX · 2025-05-28T12:17:30Z

+                                    <exclude>org.apache.logging.log4j:*</exclude>
+                                </excludes>
+                            </artifactSet>
+                            <filters>


nicusX · 2025-05-28T12:18:59Z

+        <commons.version>1.9.0</commons.version>
+    </properties>
+
+    <repositories>


This is not required
io.confluent:kafka-avro-serializer are in Maven central

Sample to show how process different types of avro subjects in a sing…

a050abc

…le topic

buddhike requested a review from nicusX April 3, 2025 05:37

nicusX suggested changes Apr 29, 2025

View reviewed changes

Buddhike de Silva added 6 commits April 30, 2025 13:57

Remove redundant .gitignore file

f6469f6

Highlight the relevance of Confluent Schema Registry

76f6600

Highlight that sample uses Confluent Schema Registry

e7b40e8

Runtime mode is STREAMING by default

c30178b

Split producer and consumer into different applications

1defa0b

Link to detailed instructions on how to run locally

d693cf3

Share run configuration for IntelliJ

fdcbdd8

buddhike commented May 12, 2025

View reviewed changes

buddhike requested a review from nicusX May 13, 2025 11:12

nicusX suggested changes May 28, 2025

View reviewed changes


		This example uses Avro-generated classes (more details [below](#using-avro-generated-classes)).

		A `KafkaSource` produces a stream of Avro data objects (`SpecificRecord`), fetching the writer's schema from AWS Glue Schema Registry. The Avro Kafka message value must have been serialized using AWS Glue Schema Registry.

Conversation

buddhike commented Apr 3, 2025

Uh oh!

nicusX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buddhike commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicusX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!