Skip to content

Commit 9abc2ec

Browse files
jubradclaude
andcommitted
test: add PrivateLink cloudtest with Toxiproxy simulation
Adds cloudtest infrastructure for testing Kafka PrivateLink connections using Toxiproxy as a network proxy to simulate VPC endpoint routing. Includes two tests: - `test_privatelink_e2e_connectivity`: validates basic connectivity through a simulated PrivateLink path, tests failure detection when the proxy is disabled, and recovery when re-enabled. - `test_privatelink_pattern_matching`: patches Redpanda to advertise an AZ-specific broker address, then verifies that MATCHING rules route post-metadata traffic through the AZ-specific proxy. The default proxy is disabled after bootstrap to prove pattern matching works. Also adds `doc/developer/testing-confluent-privatelink.md` with a step-by-step guide for manual testing against Confluent Cloud PrivateLink using a scratch VM with dnsmasq DNS overrides. Fixes the cloudtest `reset` script to clean up configmaps and vpcendpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a1fc2ad commit 9abc2ec

4 files changed

Lines changed: 1212 additions & 0 deletions

File tree

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
# Testing Confluent Cloud PrivateLink Connectivity
2+
3+
This guide walks through setting up a scratch VM to test Kafka connectivity
4+
to a Confluent Cloud cluster over AWS PrivateLink. This is useful for
5+
debugging transport-level issues (SNI, TLS, broker routing) without needing
6+
a full Materialize environment.
7+
8+
## Prerequisites
9+
10+
- A Confluent Cloud dedicated cluster with PrivateLink enabled
11+
- A Confluent Cloud API key for the cluster
12+
- Access to `bin/scratch` in the Materialize repo
13+
14+
## 1. Deploy a scratch instance
15+
16+
```bash
17+
bin/scratch create <name>
18+
```
19+
20+
## 2. Set up a VPC endpoint to Confluent
21+
22+
Create a VPC endpoint in the scratch instance's VPC pointing to the Confluent
23+
PrivateLink service. You'll need the VPC endpoint service name from the
24+
Confluent Cloud networking settings (e.g.,
25+
`com.amazonaws.vpce.us-east-1.vpce-svc-006836f18cb2d0819`).
26+
27+
### Allow access from the scratch VM's AWS account
28+
29+
Confluent requires you to explicitly allow each AWS account that connects via
30+
PrivateLink. In the Confluent Cloud console, go to your cluster's networking
31+
settings and add the scratch VM's AWS account ID to the PrivateLink access
32+
list. Without this, the VPC endpoint will appear "available" in AWS but all
33+
traffic will be silently dropped.
34+
35+
### Enable the correct AZs
36+
37+
AWS creates a separate ENI per AZ, each with its own private IP. Confluent
38+
brokers in a given AZ are only reachable through that AZ's ENI — traffic to
39+
the wrong AZ's ENI will black-hole.
40+
41+
Confluent requires you to choose at least 3 AZs when configuring PrivateLink.
42+
However, **only the AZs where your Materialize compute is running will
43+
actually carry traffic.** When creating the VPC endpoint in AWS, you only
44+
need to enable subnets in the AZs that match your compute placement.
45+
46+
For the scratch VM test, enable the VPC endpoint in the AZ where the scratch
47+
instance is running. For a production Materialize deployment, enable the AZs
48+
where your cluster replicas are scheduled.
49+
50+
Note each AZ's ENI private IP and which AWS AZ it's in. You can find these in
51+
the AWS console under the VPC endpoint's "Subnets" tab, or via:
52+
53+
```bash
54+
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids <VPCE_ID> \
55+
--query 'VpcEndpoints[0].NetworkInterfaceIds' --output text | \
56+
xargs -I{} aws ec2 describe-network-interfaces --network-interface-ids {} \
57+
--query 'NetworkInterfaces[*].[AvailabilityZone,PrivateIpAddress]' --output table
58+
```
59+
60+
You'll need to map each Confluent AZ name (e.g., `use1-az1`) to the correct
61+
ENI IP. AWS AZ IDs (`use1-az1`) map to different physical zone names
62+
(`us-east-1a`, `us-east-1b`, etc.) per account, so verify the mapping
63+
carefully.
64+
65+
## 3. SSH into the scratch VM
66+
67+
```bash
68+
bin/scratch ssh <name>
69+
```
70+
71+
## 4. Configure DNS with dnsmasq
72+
73+
Confluent PrivateLink requires that the cluster's DNS hostnames resolve to the
74+
VPC endpoint IP. Since the scratch VM doesn't have Confluent's private hosted
75+
zones, we use dnsmasq to override DNS resolution.
76+
77+
### Install dnsmasq
78+
79+
```bash
80+
sudo apt-get update && sudo apt-get install -y dnsmasq
81+
```
82+
83+
### Disable systemd-resolved stub listener
84+
85+
systemd-resolved binds to port 53, which conflicts with dnsmasq.
86+
87+
```bash
88+
sudo mkdir -p /etc/systemd/resolved.conf.d
89+
sudo tee /etc/systemd/resolved.conf.d/noresolvstub.conf <<EOF
90+
[Resolve]
91+
DNSStubListener=no
92+
DNS=127.0.0.1
93+
EOF
94+
```
95+
96+
### Point resolv.conf at dnsmasq
97+
98+
```bash
99+
sudo rm /etc/resolv.conf
100+
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf
101+
```
102+
103+
### Configure dnsmasq upstream DNS
104+
105+
The VPC DNS resolver must be configured as the upstream so that non-Confluent
106+
DNS queries still work. On AWS, this is typically `169.254.169.253`.
107+
108+
```bash
109+
echo "server=169.254.169.253" | sudo tee /etc/dnsmasq.d/upstream.conf
110+
```
111+
112+
### Add Confluent PrivateLink DNS overrides
113+
114+
Confluent PrivateLink uses AZ-specific DNS subdomains for broker routing.
115+
Each Confluent AZ has its own VPC endpoint ENI, and brokers in that AZ are
116+
only reachable through the matching ENI. The DNS must route each AZ's broker
117+
hostnames to the correct ENI IP.
118+
119+
Confluent broker addresses follow the pattern:
120+
- Bootstrap/API: `lkc-XXXXX.<ENDPOINT_DOMAIN>:9092` (no AZ prefix, can use any ENI)
121+
- Brokers: `b0-lkc-XXXXX.<AZ>.<ENDPOINT_DOMAIN>:9092` (AZ-specific, must use that AZ's ENI)
122+
123+
Replace `<ENDPOINT_DOMAIN>` with the Confluent endpoint domain and each
124+
`<AZ*_IP>` with the ENI private IP for that specific Confluent AZ.
125+
126+
```bash
127+
sudo tee /etc/dnsmasq.d/confluent-privatelink.conf <<EOF
128+
# AZ-specific: each AZ's brokers MUST resolve to that AZ's VPC endpoint ENI.
129+
# Brokers are only reachable through their own AZ's ENI.
130+
address=/<AZ1>.<ENDPOINT_DOMAIN>/<AZ1_IP>
131+
address=/<AZ2>.<ENDPOINT_DOMAIN>/<AZ2_IP>
132+
address=/<AZ3>.<ENDPOINT_DOMAIN>/<AZ3_IP>
133+
134+
# Base domain: bootstrap and API addresses have no AZ prefix.
135+
# These can route through any AZ's ENI.
136+
address=/<ENDPOINT_DOMAIN>/<AZ1_IP>
137+
EOF
138+
```
139+
140+
For example, with endpoint domain `dom8pmk29rw.us-east-1.aws.confluent.cloud`
141+
and three AZs:
142+
143+
```bash
144+
sudo tee /etc/dnsmasq.d/confluent-privatelink.conf <<EOF
145+
# AZ-specific: broker traffic must go through the correct AZ's ENI
146+
address=/use1-az1.dom8pmk29rw.us-east-1.aws.confluent.cloud/172.31.10.100
147+
address=/use1-az4.dom8pmk29rw.us-east-1.aws.confluent.cloud/172.31.20.100
148+
address=/use1-az6.dom8pmk29rw.us-east-1.aws.confluent.cloud/172.31.30.100
149+
150+
# Bootstrap/API: no AZ in hostname, can go through any ENI
151+
address=/dom8pmk29rw.us-east-1.aws.confluent.cloud/172.31.10.100
152+
EOF
153+
```
154+
155+
Note: dnsmasq wildcard matching is suffix-based. `address=/foo.example.com/IP`
156+
matches `foo.example.com` and `anything.foo.example.com`. The more specific
157+
AZ entries take precedence over the base domain entry, so a broker address
158+
like `b0-lkc-825730.use1-az1.dom8pmk29rw.us-east-1.aws.confluent.cloud`
159+
resolves to the `use1-az1` ENI IP, while the bootstrap address
160+
`lkc-825730.dom8pmk29rw.us-east-1.aws.confluent.cloud` resolves to the
161+
base domain IP.
162+
163+
### Restart services
164+
165+
```bash
166+
sudo systemctl restart systemd-resolved
167+
sudo systemctl restart dnsmasq
168+
```
169+
170+
### Verify
171+
172+
```bash
173+
nslookup google.com # upstream works
174+
nslookup lkc-XXXXX.<ENDPOINT_DOMAIN> # bootstrap -> base IP
175+
nslookup b0-lkc-XXXXX.<AZ1>.<ENDPOINT_DOMAIN> # AZ1 broker -> AZ1 IP
176+
nslookup b0-lkc-XXXXX.<AZ2>.<ENDPOINT_DOMAIN> # AZ2 broker -> AZ2 IP
177+
nslookup b0-lkc-XXXXX.<AZ3>.<ENDPOINT_DOMAIN> # AZ3 broker -> AZ3 IP
178+
```
179+
180+
Each AZ lookup should return that AZ's ENI IP, not the base domain IP.
181+
182+
## 5. Install Kafka tools
183+
184+
```bash
185+
sudo apt-get install -y kafkacat default-jre-headless
186+
curl -L https://archive.apache.org/dist/kafka/3.9.0/kafka_2.13-3.9.0.tgz | tar xz
187+
```
188+
189+
## 6. Configure Kafka authentication
190+
191+
```bash
192+
cat > /tmp/kafka.properties <<EOF
193+
security.protocol=SASL_SSL
194+
sasl.mechanism=PLAIN
195+
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<API_KEY>" password="<API_SECRET>";
196+
EOF
197+
```
198+
199+
## 7. Verify connectivity
200+
201+
```bash
202+
BOOTSTRAP="<CLUSTER_ID>.<ENDPOINT_DOMAIN>:9092"
203+
204+
# List brokers and topics
205+
kcat -b "$BOOTSTRAP" \
206+
-X security.protocol=SASL_SSL \
207+
-X sasl.mechanisms=PLAIN \
208+
-X sasl.username=<API_KEY> \
209+
-X sasl.password='<API_SECRET>' \
210+
-L
211+
```
212+
213+
## 8. Create a test topic
214+
215+
Confluent Cloud doesn't allow auto-topic creation. Since the cluster is
216+
private, you can't create topics from outside the VPC. Use `kafka-topics.sh`
217+
from the scratch VM:
218+
219+
```bash
220+
kafka_2.13-3.9.0/bin/kafka-topics.sh \
221+
--bootstrap-server "$BOOTSTRAP" \
222+
--command-config /tmp/kafka.properties \
223+
--create --topic test-privatelink --partitions 3 --replication-factor 3
224+
```
225+
226+
## 9. Produce test data
227+
228+
```bash
229+
for i in $(seq 1 1000); do
230+
echo "{\"id\": $i, \"ts\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\", \"value\": $((RANDOM % 10000))}"
231+
done | kcat -b "$BOOTSTRAP" \
232+
-X security.protocol=SASL_SSL \
233+
-X sasl.mechanisms=PLAIN \
234+
-X sasl.username=<API_KEY> \
235+
-X sasl.password='<API_SECRET>' \
236+
-P -t test-privatelink
237+
```
238+
239+
### Verify
240+
241+
```bash
242+
kcat -b "$BOOTSTRAP" \
243+
-X security.protocol=SASL_SSL \
244+
-X sasl.mechanisms=PLAIN \
245+
-X sasl.username=<API_KEY> \
246+
-X sasl.password='<API_SECRET>' \
247+
-C -t test-privatelink -e -q | wc -l
248+
```
249+
250+
## 10. Test from Materialize
251+
252+
Once data is flowing, create the source in your Materialize environment:
253+
254+
```sql
255+
CREATE SECRET confluent_api_secret AS '<API_SECRET>';
256+
257+
-- The AVAILABILITY ZONES here only need to include the AZs where your
258+
-- Materialize compute is running. Confluent requires 3 AZs on their side,
259+
-- but only the AZ(s) matching your compute placement will carry traffic.
260+
CREATE CONNECTION confluent_privatelink
261+
TO AWS PRIVATELINK (
262+
SERVICE NAME '<VPCE_SERVICE_NAME>',
263+
AVAILABILITY ZONES ('<AZ1>', '<AZ2>', '<AZ3>')
264+
);
265+
266+
-- Wait for the VPC endpoint to become available. You must also add the
267+
-- Materialize AWS account to the Confluent PrivateLink access list.
268+
-- Then:
269+
270+
CREATE CONNECTION confluent_kafka TO KAFKA (
271+
AWS PRIVATELINKS (
272+
'<CLUSTER_ID>.<ENDPOINT_DOMAIN>:9092'
273+
TO confluent_privatelink (AVAILABILITY ZONE '<AZ>'),
274+
'*<AZ1>*' TO confluent_privatelink (AVAILABILITY ZONE '<AZ1>'),
275+
'*<AZ2>*' TO confluent_privatelink (AVAILABILITY ZONE '<AZ2>'),
276+
'*<AZ3>*' TO confluent_privatelink (AVAILABILITY ZONE '<AZ3>')
277+
),
278+
SASL MECHANISMS 'PLAIN',
279+
SASL USERNAME '<API_KEY>',
280+
SASL PASSWORD SECRET confluent_api_secret,
281+
SECURITY PROTOCOL 'SASL_SSL'
282+
);
283+
284+
CREATE SOURCE test_privatelink_source
285+
FROM KAFKA CONNECTION confluent_kafka (
286+
TOPIC 'test-privatelink'
287+
);
288+
289+
CREATE TABLE test_privatelink_tbl
290+
FROM SOURCE test_privatelink_source (
291+
REFERENCE "test-privatelink"
292+
)
293+
FORMAT BYTES
294+
ENVELOPE NONE;
295+
296+
SELECT COUNT(*) FROM test_privatelink_tbl;
297+
```
298+
299+
## Troubleshooting
300+
301+
### Transport failure on connection validation
302+
303+
Check which AZ endpoints are reachable. Not all VPC endpoint ENIs may have
304+
active NLB targets behind them:
305+
306+
```bash
307+
# From inside the environmentd pod
308+
for ip in <IP1> <IP2> <IP3>; do
309+
timeout 3 bash -c "echo > /dev/tcp/$ip/9092" 2>&1 && echo "$ip: OPEN" || echo "$ip: CLOSED"
310+
done
311+
```
312+
313+
If only some IPs are reachable, ensure the Confluent PrivateLink is enabled
314+
for all the AZs you specified, or change the bootstrap rule to use a working AZ.
315+
316+
### Enable librdkafka debug logging
317+
318+
From a Materialize SQL session:
319+
320+
```sql
321+
ALTER SYSTEM SET log_filter = 'info,librdkafka=debug';
322+
```
323+
324+
Or in launchdarkly set your org up with the kafka debug variant.
325+
326+
Then recreate the connection to trigger validation. Check environmentd logs
327+
for SSL handshake details, connection state transitions, etc. Reset after:
328+
329+
```sql
330+
ALTER SYSTEM RESET log_filter;
331+
```

0 commit comments

Comments
 (0)