diff api tweaks by mitchellwrosen · Pull Request #58 · unisoncomputing/share-api

mitchellwrosen · 2025-04-24T15:59:12Z

Overview

This PR amends the namespace diff and contribution diff endpoints to explicitly model two additional states:

Diff is still being computed
Diff could not be computed

Additionally, it modifies the contribution diff endpoint to defer to the background diff computing worker, rather than eagerly try to compute the diff itself in the foreground. A diff error (due to incoherent decl) is now explicitly stored, too, rather than logged n' tossed.

The background worker was fixed to acquire a row lock on a contribution_diff_queue row and hold the lock while computing the diff. This necessitated refactoring the diff implementation to run in a single transaction. I also amended a few of the slower parts of the diff to pipeline queries where possible, though many things could not be parallelized very well. Computing a diff could still be slow; more dedicated performance work is probably warranted.

There's one loose ends that should be addressed in follow-up PRs:

The "namespace diff" endpoint (where in the UI do we cause that to be hit?) still computes namespace diffs in the foreground, which can time out. Ideally, these too would be computed asynchronously by workers, but there's currently only infrastructure in place for contributions' diffs to be enqueued on a job queue.

ChrisPenner · 2025-05-01T19:01:43Z

@@ -0,0 +1,13 @@
+-- Drop and re-create the contribution_diff_queue table without a primary key constraint on contribution_id.
+-- This allows us to insert redundant rows while a diff is being computed, rather than blocking due to row lock.


I think I'm missing the "why" of why we want redundant rows while the diff is being computed?

If it's already in the queue should we perhaps just ON CONFLICT DO NOTHING?

Otherwise seems like we'll just have a bunch of workers doing redundant work churning on the same stuff.

Guessing I'm just missing something here :)

In the current implementation, a worker select-for-update and performs the entire job in one transaction before deleting the job row. So, there actually is no opportunity for a worker to grab a job, attempt to perform it, and fail.

However, I'm not sure we want to rely on being able to complete a job transactionally, forever. So, in the current implementation, merely visiting a contribution diff endpoint will result in a new contribution_diff_queue row in the case that the diff isn't already computed, which is intended to cover the case that a job was previously enqueued, but failed.

But there's no way to INSERT ON CONFLICT DO NOTHING if the row we are trying to insert has a uniqueness constraint, and the row exists, with a held lock. We have to block until the transaction holding the lock completes in order to determine whether the uniqueness constraint is violated (even though we have ON CONFLICT DO NOTHING).

Whoops, I'm realizing that the disadvantage of this in-between design is that we can enqueue one job multiple times that gets picked up and worked on in parallel, which we don't want.

Ok, I think the two reasonable options, then, are:

Keep existing design for job queue, but keep uniqueness constraint, and remove line in "get contribution diff" endpoint that tries to enqueue a job if the diff hasn't been computed.

Move to a design that involves claiming work explicitly in the database, with a recorded timestamp, and infer that a job that began X minutes ago but hasn't been completed is safe to start again.

I sorta lean towards the first option, for simplicity.

Yeah, I'm going to go with the first. The second is more robust but would require building a little heartbeating mechanism, and feels like overkill for now.

The first one sounds good, however what happens for all the old existing diffs? If we don't enqueue anything when you load the page won't those diffs simply never be computed?

I'm no expert on PG row locking, is there a Transaction Isolation level where you could run a quick query to see if the row already exists, then proceed with an insert with "ON CONFLICT DO NOTHING" in a follow-up transaction if it appears it doesn't already exist?

ChrisPenner · 2025-05-01T19:03:36Z

+import System.Clock qualified as Clock

-- | Check every 10 minutes if we haven't heard on the notifications channel.
+-- | Check every 30 seconds if we haven't heard on the notifications channel.


This is fine, just curious if something motivated the change or if there's an issue with the PG notify stuff? if the PG Notify stuff is working then this is just a fallback :)

Oh, I just thought that after a full web server restart it would be odd for a worker to wait for 10 entire minutes before checking deciding to peek at the work queue. But we can revert, it's a corner case either way.

Yeah that's true, for that we should just change the startup behaviour, maybe start with a notification on each channel when we first initialize the TVar in Share.Postgres.Notifications; that way we check every channel for work immediately on startup :)

Agreed, I'll do that and bump this back up to 10 minutes

ChrisPenner · 2025-05-01T22:56:09Z

+-- | Ideally, we'd use this – a runtime with lookup actions in transaction, not IO. But that will require refactoring to
+-- the runtime interface in ucm, so we can't use it for now. That's bad: we end up unsafely running separate
+-- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved
+-- connection pool.


Good catch on this one, swapping to use the same transaction or pipelining or something in the runtime could be a pretty significant speedup if we can get it to work.

ChrisPenner · 2025-05-01T22:57:05Z

+        unisonRuntime
+      }
+
+badAskUnliftCodebaseRuntime ::


Can you add a note, either an altered version of the note from codebaseRuntime' or something, just to explain why this one is "bad" :)

ChrisPenner · 2025-05-01T22:57:44Z

+-- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved
+-- connection pool.
+codebaseRuntime' :: Runtime Symbol -> CodebaseEnv -> IO (CodebaseRuntime (PG.Transaction e))
+codebaseRuntime' unisonRuntime CodebaseEnv {codebaseOwner} = do


Just a nit; but could we name this codebaseRuntimeTransaction or something like that? I'm dubious of "prime" names after the confusion it's caused in UCM, so I definitely prefer more descriptive names, even if longer :)

ChrisPenner · 2025-05-01T22:59:26Z

-            case mayTermAndType of
-              Just termAndType -> do
-                UnliftIO.atomically $ UnliftIO.modifyTVar cacheVar $ \CodeLookupCache {termCache, ..} ->
+codeLookupForUser :: TVar CodeLookupCache -> UserId -> CL.CodeLookup Symbol (PG.Transaction e) Ann


Nice work refactoring this 👍🏼

ChrisPenner · 2025-05-01T23:16:55Z

+    let hydrateTerms ::
+          UserId ->
+          BiMultimap Referent Name ->
+          PG.Transaction e (Map Name (TermReferenceId, (Term Symbol Ann, Type Symbol Ann)))
+        hydrateTerms codebaseUser termReferents = do
+          let termReferenceIds = Map.mapMaybe Referent.toTermReferenceId (BiMultimap.range termReferents)
+          termIds <-
+            PG.pFor termReferenceIds \refId ->
+              (refId,) <$> DefnsQ.pipelinedExpectTermId refId
+          v2Terms <-
+            PG.pFor termIds \(refId, termId) ->
+              (refId,) <$> DefnsQ.pipelinedExpectTermById codebaseUser refId termId
+          v1Terms <-
+            for v2Terms \(refId, (term, typ)) ->
+              (refId,) <$> Codebase.convertTerm2to1 (Reference.idToHash refId) term typ
+          pure v1Terms


Good call pipelining this, I'm not positive but I'd suspect this is a large number of the overall queries we end up doing in a diff. I really wonder whether what the path to doing even better here looks like; presumably best would be if we don't need to convert 2 to 1 at all, which would be a big UCM change :'(

ChrisPenner · 2025-05-01T23:19:59Z

+        hydrateTypes codebaseUser typeReferences = do
+          let typeReferenceIds = Map.mapMaybe Reference.toId (BiMultimap.range typeReferences)
+          typeIds <-
+            PG.pFor typeReferenceIds \refId ->


I'm not positive, but I bet we could do even better than pipelining here and collect all the typeIds in a single batch query rather than a pipelined for.

ChrisPenner · 2025-05-01T23:31:44Z

+-- | Mitchell says: this could/should just have replaced 'expectNamespaceIdsByCausalIdsOf', but that function has
+-- many callers, so having two temporarily eases the transition.


Hrmm, I wonder maybe the better option might be to absorb pUnrecoverableEitherMap into the QueryA class, that way you can just have a QueryA m constraint here and wouldn't need to touch any callsites.

I admit, the transaction types are getting away from me a bit as the number of them slowly grows 😓

ChrisPenner · 2025-05-01T23:40:55Z

The "namespace diff" endpoint (where in the UI do we cause that to be hit?)

I don't believe we have a page for this yet, but may add one at some point, cc @hojberg .

ChrisPenner · 2025-05-01T23:43:08Z

Computing a diff could still be slow; more dedicated performance work is probably warranted.

You clearly did a ton of work pipelining where possible, in lots of spots which may even speed up term rendering and stuff too 🙌🏼

Q: do you have a sense (even perhaps a vague one) of how the pipelining changes have affected diff performance? Have you pushed a gnarly diff up to staging and timed the difference for instance? I think that'd be good info to have, especially since writing all this pipeline stuff and forcing things into Applicative style is annoying, it'd be nice to know how much we should be prioritizing it in the future.

mitchellwrosen · 2025-05-06T14:45:57Z

Q: do you have a sense (even perhaps a vague one) of how the pipelining changes have affected diff performance? Have you pushed a gnarly diff up to staging and timed the difference for instance? I think that'd be good info to have, especially since writing all this pipeline stuff and forcing things into Applicative style is annoying, it'd be nice to know how much we should be prioritizing it in the future.

I do have a vague sense that it's making things noticeably faster. I feel the applicative style is just sort of a stopgap solution to rewriting queries to return entire result sets in one go rather than N+1 style one-at-a-time. But for the very minimal time investment I think it's worth it.

It's actually kind of hard to make a big gnarly diff to try on staging – you can't just grab two releases of a project (they don't share a history). But in my small experiments, things are faster.

ChrisPenner · 2025-05-06T21:49:14Z

It's actually kind of hard to make a big gnarly diff to try on staging – you can't just grab two releases of a project (they don't share a history). But in my small experiments, things are faster.

If you've got a diff you like to test with on prod, you can just download and push it up to staging, having good test data is worth the time, that's what staging is for after all :)

ChrisPenner · 2025-05-06T21:55:17Z

@@ -0,0 +1,3 @@
+-- Delete all previously-computed namespace diffs, because the diff payload is different now (we explicitly store
+-- errors).
+TRUNCATE namespace_diffs;


Depending on whether you choose to re-queue diffs when the page is loaded and it's not cached, you may want to repopulate the queue with all existing contribution diffs after wiping them out.

If you have it set to requeue on page load then no need

ChrisPenner · 2025-05-06T22:03:52Z

 pipelined p = Transaction (Hasql.pipeline (unPipeline p))

+-- | Like fmap, but the provided function can throw a recoverable error by returning 'Left'.
+pEitherMap :: (a -> Either e b) -> Pipeline e a -> Pipeline e b


Ah I see, we don't expose e in QueryA, if you have ideas on re-organizing these classes let me know :)

I'm wondering if maybe having the e as part of the Transaction type is a bad idea and maybe we should just have a tryTransaction :: ExceptT e Transaction a -> m (Either e a) which knows to roll-back instead; then I guess we'd just use Compose (Either e) Pipeline for applicative situations?

At any rate that's a conversation for outside of this PR :)

ChrisPenner · 2025-05-06T22:05:08Z

+-- outstanding work.
 notifs :: TVar (Set ChannelKind)
-notifs = unsafePerformIO $ STM.newTVarIO mempty
+notifs = unsafePerformIO $ STM.newTVarIO allChannels


Nice, thanks for doing this 👍🏼

ChrisPenner

Looks good to go!

mitchellwrosen added 17 commits April 8, 2025 12:53

overhaul contribution diff worker

1402503

generalize CodebaseRuntime

35c8ccf

make diffCausals a single transaction

275055d

compute diff while holding lock on job row

acf63d1

break diffCausals up

6c7d2ad

model diff error and diff still computing states explicitly in the API

3318aa0

⅄ main → diff-api-tweaks

738c7c1

ping background worker in contribution diffs endpoint

e0dee5e

flatten json

f87012e

⅄ main → diff-api-tweaks

9fd2d38

hydrate terms with more parallelism

1fb0602

tweak indentation

d1cb4d4

fetch type decls in parallel

ffc243c

⅄ main → diff-api-tweaks

165d503

⅄ main → diff-api-tweaks

126e1e2

store namespace diff errors

169c1e5

change "errorKind" to "tag" in error json

76fe8d8

mitchellwrosen marked this pull request as ready for review May 1, 2025 18:56

mitchellwrosen requested a review from ChrisPenner May 1, 2025 18:56

ChrisPenner reviewed May 1, 2025

View reviewed changes

Comment thread src/Share/BackgroundJobs/Diffs/ContributionDiffs.hs Outdated

ChrisPenner reviewed May 1, 2025

View reviewed changes

Comment thread src/Share/BackgroundJobs/Diffs/ContributionDiffs.hs

ChrisPenner reviewed May 1, 2025

View reviewed changes

mitchellwrosen added 3 commits May 5, 2025 09:45

⅄ main → diff-api-tweaks

afde807

don't ping background worker in get contribution diff endpoint

4c256d1

check notification channels on startup

7ed3d75

address more PR comments

d6e926c

ChrisPenner reviewed May 6, 2025

View reviewed changes

ChrisPenner approved these changes May 6, 2025

View reviewed changes

⅄ main → diff-api-tweaks

c5c79f8

ChrisPenner merged commit 0c2e10a into main May 6, 2025
4 checks passed

ChrisPenner deleted the diff-api-tweaks branch May 6, 2025 23:12

		@@ -0,0 +1,13 @@
		-- Drop and re-create the contribution_diff_queue table without a primary key constraint on contribution_id.
		-- This allows us to insert redundant rows while a diff is being computed, rather than blocking due to row lock.

		-- \| Mitchell says: this could/should just have replaced 'expectNamespaceIdsByCausalIdsOf', but that function has
		-- many callers, so having two temporarily eases the transition.

Conversation

mitchellwrosen commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

ChrisPenner May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisPenner May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisPenner May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisPenner May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisPenner commented May 1, 2025

Uh oh!

ChrisPenner commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitchellwrosen commented May 6, 2025

Uh oh!

ChrisPenner commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisPenner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mitchellwrosen commented Apr 24, 2025 •

edited

Loading

ChrisPenner May 1, 2025 •

edited

Loading

ChrisPenner May 1, 2025 •

edited

Loading

ChrisPenner May 1, 2025 •

edited

Loading

ChrisPenner May 1, 2025 •

edited

Loading

ChrisPenner commented May 1, 2025 •

edited

Loading