Skip to content

vitess: add vschemas with vindexes#8634

Merged
beautifulentropy merged 17 commits intomainfrom
vschemas-and-vindexes
Mar 13, 2026
Merged

vitess: add vschemas with vindexes#8634
beautifulentropy merged 17 commits intomainfrom
vschemas-and-vindexes

Conversation

@jsha
Copy link
Copy Markdown
Contributor

@jsha jsha commented Feb 19, 2026

In order to exercise sharding logic in Vitess, we need our integration test environment to use sharded keyspaces, which means we need to use vindexes. This PR enables vindexes and changes DB initialization in support. It does not yet shard the keyspaces since doing so causes some consistency issues I still need to debug.

DB initialization changes:

  • Stop using sql-migrate inside the boulder container.
  • Make the bmariadb and bvitess containers responsible for applying schemas when they start up.
  • Use different database / keyspace names for different test environments: boulder_sa / boulder_sa_next. This allows the two to coexist, which means we don't need to down --volumes the bmariadb container when running locally and switching from ./tn.sh to ./t.sh.
  • The bmariadb image has support for applying SQL schemas from a directory. Use that. The vttestserver has support for applying SQL schemas and vschemas from a directory.

Foreign keys: we had one foreign key constraint (serials.registrationID -> registrations), but we probably don't actually want this in the Vitess world. It makes things more complicated with no benefit. I removed it from the SQL schemas.

test/db.go: In vttestserver we can't delete more than 10k rows, so I changed deleteEverythingInAllTables to delete in a loop. Incidentally I cleaned up some stuff about foreign keys and 1 = 1 that is no longer needed. Also, I changed deletion calls in sa_test.go to use the shared deletion functions.

Deletion from overrides was failing under this setup with Error 1105 (HY000): VT09015: schema tracking required. I concluded this was because the table had no primary key. I changed its UNIQUE KEY to a PRIMARY KEY.

test/vars/vars.go: calculate DSNs based on the BOULDER_CONFIG_DIR variables, to select between boulder_sa and boulder_sa_next (etc.).

@jsha jsha marked this pull request as ready for review February 27, 2026 04:58
@jsha jsha requested a review from a team as a code owner February 27, 2026 04:58
@jsha jsha requested a review from aarongable February 27, 2026 04:58
@github-actions
Copy link
Copy Markdown
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Feb 27, 2026

One thing worth noting about this approach: while it's kind of nice that the bmariadb and bvitess containers handle their own SQL schema setup, it means that syntax errors for the .sql files get buried in the logs for those two containers, and need e.g. docker compose logs bvitess to see them.

Copy link
Copy Markdown
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great for this PR to come with some explicit documentation pointing towards the mariadb and vitess docs around loading schemas from disk. I had to dig for a bit until I found mariadb's docker-entrypoint process_init_files and finally understood how the schemas were getting loading into mariadb.

Overall LGTM, it took me a minute to wrap my head around how things work in this new world but the result is much cleaner and simpler than the status quo. I like it.

Comment thread sa/db/01-boulder_sa_next.sql Outdated
Comment thread sa/db/02-users_next.sql
Comment thread sa/vtschema/README.md Outdated
Comment thread test/vtcomboserver/run.sh Outdated
Comment thread sa/db/01-boulder_sa_next.sql
Comment thread sa/db/01-incidents_sa.sql
aarongable
aarongable previously approved these changes Mar 3, 2026
@aarongable
Copy link
Copy Markdown
Contributor

Hmm, I thought the integration failure was a flake, but the exact same failure happened on my retry. Looks like it might be real.

@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Mar 4, 2026

The error is from revocation_test:297:

			if !strings.Contains(err.Error(), "urn:ietf:params:acme:error:serverInternal") ||
				!strings.Contains(err.Error(), "SCT embedding") {
				t.Fatal(err)
			}

This is testing that when we cause CT to reject a certificate (so we never create a final certificate), we can still revoke it. As part of the setup, we check that (a) the certificate issuance really did fail, and (b) the error message said something about SCT embeddeding.

In fact we got some errors like these:

2026-03-04T05:20:55.555677+00:00Z 87a3d59626fe unknown boulder-wfe2[694]: 6 boulder-wfe2 Vmx34w POST /acme/finalize/ 398 500 14973 0.0.0.0 JSON={"Slug":"398/343","InternalErrors":["rpc error: code = Unknown desc = persisting linting precert to database: rpc error: code = Unknown desc = failed to insert *sa.certificateStatusModel: context deadline exceeded (also, while rolling back: sql: transaction has already been committed or rolled back)"],"Error":"500 :: serverInternal :: Error finalizing order","ua":"eggsampler-acme/v3 Go-http-client/1.1","Extra":{"KeyType":"ECDSA P-256"},"Identifiers":[{"type":"dns","value":"9d232e.com"}]}

and these:

2026-03-04T05:20:55.566687+00:00Z 87a3d59626fe unknown boulder-ra[656]: 6 boulder-ra hYHGmA [AUDIT] Certificate request - error JSON={"ID":"RwYsuZHRkdvSMeE8Cf7Tl_ZA16uJMnbUOUb6v8K8Pus","Requester":403,"OrderID":348,"VerifiedFields":["subject.commonName","subjectAltName"],"Identifiers":[{"Ident":{"type":"dns","value":"df35d8.com"},"Authz":"356","Challenge":"http-01","Validated":"2026-03-04T05:20:40Z"}],"NotBefore":"0001-01-01T00:00:00Z","NotAfter":"0001-01-01T00:00:00Z","RequestTime":"2026-03-04T05:20:40.604393426Z","ResponseTime":"2026-03-04T05:20:55.566657086Z","Error":"rpc error: code = Unknown desc = getting SCTs: ra.SCTProvider.GetSCTs timed out after 892 ms","PreviousCertificateIssued":"0001-01-01T00:00:00Z","UserAgent":"eggsampler-acme/v3 Go-http-client/1.1"}

Also, with the addition of some improved error returns in the failOrder path I saw:

boulder-wfe2[694]: 6 boulder-wfe2 1i_XSA POST /acme/finalize/ 400 500 3054 0.0.0.0 JSON={"Slug":"400/336","InternalErrors":["getting SCTs: failed to get 2 SCTs, got 6 error(s): ct submission to \"A1 Current\" (\"http://boulder.service.consul:4600\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\"; ct submission to \"A2 Current\" (\"http://boulder.service.consul:4603\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\"; ct submission to \"B2\" (\"http://boulder.service.consul:4605\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\"; ct submission to \"C1\" (\"http://boulder.service.consul:4606\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\"; ct submission to \"D1\" (\"http://boulder.service.consul:4607\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\"; ct submission to \"B1\" (\"http://boulder.service.consul:4604\") failed: rpc error: code = Unknown desc = got HTTP status \"400 Bad Request\""],"Error":"500 :: serverInternal :: Error finalizing order :: Unable to meet CA SCT embedding requirements","ua":"eggsampler-acme/v3 Go-http-client/1.1","Extra":{"KeyType":"ECDSA P-256"},"Identifiers":[{"type":"dns","value":"b95747.com"}]}
 boulder-ra S5A7EQ [AUDIT] Persisting failed order JSON={"error":"rpc error: code = Unknown desc = error updating order error field:%!(EXTRA db.ErrDatabaseOp=failed to exec orders: context deadline exceeded) (also, while rolling back: sql: transaction has already been committed or rolled back)","order":339,"prob":"problemType:\"serverInternal\" detail:\"Error finalizing order :: Unable to meet CA SCT embedding requirements\" httpStatus:500","requester":397}

I'm able to reproduce locally, intermittently. Still needs more investigation. Also I'm noticing a couple of things in the output from TestRevocation:

  • crl-storer spams the logs a lot, making it hard to find relevant log lines.
  • we call t.Fatal in a goroutine, which we're not supposed to do.

It was causing timeouts under vitess.
@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Mar 7, 2026

After a bunch of running in a loop, I concluded that this bug was triggered by the parallelism in TestRevocation. Two kinds of certs (precert, final cert), times three revocation reasons, times four methods of revocation, resulted in 24 parallel issuances.

I don't understand why 24 parallel issuances caused the symptoms we saw. After adding some debug prints, it appears that on certain runs, several concurrent requests to AddCertificate would error out with invalid connection (which was preceded in the logs by [mysql] read tcp ... i/o timeout). Seems like several of the open connections to bvitess just stalled out at once, for over 14 seconds.

I'd still like to understand this better, but given that only this test with its high parallelism runs into the problem, I decided to reduce the parallelism for now.

@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Mar 7, 2026

Okay, now I think I understand the cause if not the mechanism. As @beautifulentropy previously found, vttestserver passes --queryserver-config-transaction-cap 4 to vtcomboserver, and doesn't allow overriding that. That's why our current run.sh on main has --queryserver-config-transaction-cap 80.

The issue can be more reliably reproduced with the below sa unittest. This fails if the parallelism is 4, passes if the parallelism is 3. And passes, even with high parallelism, on main.

diff --git i/sa/sa_test.go w/sa/sa_test.go
index b2a0054d5..c67d3ebf2 100644
--- i/sa/sa_test.go
+++ w/sa/sa_test.go
@@ -23,6 +23,7 @@ import (
 	"slices"
 	"strconv"
 	"strings"
+	"sync"
 	"testing"
 	"time"
 
@@ -94,7 +95,7 @@ func initSA(t testing.TB) (*SQLStorageAuthority, clock.FakeClock) {
 	t.Helper()
 	features.Reset()
 
-	dbMap, err := DBMapForTest(vars.DBConnSA)
+	dbMap, err := DBMapForTestWithLog(vars.DBConnSA, blog.StdoutLogger(7))
 	if err != nil {
 		t.Fatalf("Failed to create dbMap: %s", err)
 	}
@@ -395,6 +396,40 @@ func TestGetSerialMetadata(t *testing.T) {
 	test.AssertEquals(t, m.Expires.AsTime(), timestamppb.New(hourLater).AsTime())
 }
 
+func TestAddPrecertificateParallel(t *testing.T) {
+	sa, clk := initSA(t)
+
+	reg := createWorkingRegistration(t, sa)
+
+	issue := func() {
+		_, testCert := test.ThrowAwayCert(t, clk)
+
+		ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
+		defer cancel()
+
+		// Add the cert as a precertificate
+		regID := reg.Id
+		issuedTime := mustTimestamp("2018-04-01 07:00")
+		t.Logf("AddPrecertificate")
+		_, err := sa.AddPrecertificate(ctx, &sapb.AddCertificateRequest{
+			Der:          testCert.Raw,
+			RegID:        regID,
+			Issued:       issuedTime,
+			IssuerNameID: 1,
+		})
+		if err != nil {
+			t.Errorf("failed to add precertificate: %v", err)
+		}
+		t.Logf("AddPrecertificate success")
+	}
+
+	var wg sync.WaitGroup
+	for i := 0; i < 4; i++ {
+		wg.Go(issue)
+	}
+	wg.Wait()
+}
+
 func TestAddPrecertificate(t *testing.T) {
 	ctx := context.Background()
 	sa, clk := initSA(t)

What I still don't understand is exactly what is supposed to happen, when we hit the transaction cap. What happens in practice, if you run that unittest with parallelism 4, is that it opens all four transactions, does the selects, and then hangs on the inserts until the queries time out.

All that said, I am reasonably satisfied with the understanding I've reached here, and consider this PR ready for review.

aarongable
aarongable previously approved these changes Mar 9, 2026
Comment thread test/integration/revocation_test.go
Copy link
Copy Markdown
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks great, just a couple nits and one actual issue.

Comment thread sa/vtschema/boulder_sa/vschema.json Outdated
Comment thread test/vars/vars.go Outdated
Comment thread test/integration/revocation_test.go
Comment thread test/vtcomboserver/Dockerfile Outdated
Comment thread test/db.go Outdated
Comment thread test/db.go Outdated
@beautifulentropy beautifulentropy merged commit d2c1c53 into main Mar 13, 2026
31 checks passed
@beautifulentropy beautifulentropy deleted the vschemas-and-vindexes branch March 13, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants