Skip to content

diff api tweaks#58

Merged
ChrisPenner merged 22 commits into
mainfrom
diff-api-tweaks
May 6, 2025
Merged

diff api tweaks#58
ChrisPenner merged 22 commits into
mainfrom
diff-api-tweaks

Conversation

@mitchellwrosen

@mitchellwrosen mitchellwrosen commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

Overview

This PR amends the namespace diff and contribution diff endpoints to explicitly model two additional states:

  • Diff is still being computed
  • Diff could not be computed

Additionally, it modifies the contribution diff endpoint to defer to the background diff computing worker, rather than eagerly try to compute the diff itself in the foreground. A diff error (due to incoherent decl) is now explicitly stored, too, rather than logged n' tossed.

The background worker was fixed to acquire a row lock on a contribution_diff_queue row and hold the lock while computing the diff. This necessitated refactoring the diff implementation to run in a single transaction. I also amended a few of the slower parts of the diff to pipeline queries where possible, though many things could not be parallelized very well. Computing a diff could still be slow; more dedicated performance work is probably warranted.

There's one loose ends that should be addressed in follow-up PRs:

  • The "namespace diff" endpoint (where in the UI do we cause that to be hit?) still computes namespace diffs in the foreground, which can time out. Ideally, these too would be computed asynchronously by workers, but there's currently only infrastructure in place for contributions' diffs to be enqueued on a job queue.

@mitchellwrosen mitchellwrosen marked this pull request as ready for review May 1, 2025 18:56
@mitchellwrosen mitchellwrosen requested a review from ChrisPenner May 1, 2025 18:56
@@ -0,0 +1,13 @@
-- Drop and re-create the contribution_diff_queue table without a primary key constraint on contribution_id.
-- This allows us to insert redundant rows while a diff is being computed, rather than blocking due to row lock.

@ChrisPenner ChrisPenner May 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing the "why" of why we want redundant rows while the diff is being computed?

If it's already in the queue should we perhaps just ON CONFLICT DO NOTHING?

Otherwise seems like we'll just have a bunch of workers doing redundant work churning on the same stuff.

Guessing I'm just missing something here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, a worker select-for-update and performs the entire job in one transaction before deleting the job row. So, there actually is no opportunity for a worker to grab a job, attempt to perform it, and fail.

However, I'm not sure we want to rely on being able to complete a job transactionally, forever. So, in the current implementation, merely visiting a contribution diff endpoint will result in a new contribution_diff_queue row in the case that the diff isn't already computed, which is intended to cover the case that a job was previously enqueued, but failed.

But there's no way to INSERT ON CONFLICT DO NOTHING if the row we are trying to insert has a uniqueness constraint, and the row exists, with a held lock. We have to block until the transaction holding the lock completes in order to determine whether the uniqueness constraint is violated (even though we have ON CONFLICT DO NOTHING).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I'm realizing that the disadvantage of this in-between design is that we can enqueue one job multiple times that gets picked up and worked on in parallel, which we don't want.

Ok, I think the two reasonable options, then, are:

  • Keep existing design for job queue, but keep uniqueness constraint, and remove line in "get contribution diff" endpoint that tries to enqueue a job if the diff hasn't been computed.
  • Move to a design that involves claiming work explicitly in the database, with a recorded timestamp, and infer that a job that began X minutes ago but hasn't been completed is safe to start again.

I sorta lean towards the first option, for simplicity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm going to go with the first. The second is more robust but would require building a little heartbeating mechanism, and feels like overkill for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one sounds good, however what happens for all the old existing diffs? If we don't enqueue anything when you load the page won't those diffs simply never be computed?

I'm no expert on PG row locking, is there a Transaction Isolation level where you could run a quick query to see if the row already exists, then proceed with an insert with "ON CONFLICT DO NOTHING" in a follow-up transaction if it appears it doesn't already exist?

import System.Clock qualified as Clock

-- | Check every 10 minutes if we haven't heard on the notifications channel.
-- | Check every 30 seconds if we haven't heard on the notifications channel.

@ChrisPenner ChrisPenner May 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, just curious if something motivated the change or if there's an issue with the PG notify stuff? if the PG Notify stuff is working then this is just a fallback :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just thought that after a full web server restart it would be odd for a worker to wait for 10 entire minutes before checking deciding to peek at the work queue. But we can revert, it's a corner case either way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true, for that we should just change the startup behaviour, maybe start with a notification on each channel when we first initialize the TVar in Share.Postgres.Notifications; that way we check every channel for work immediately on startup :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll do that and bump this back up to 10 minutes

Comment thread src/Share/BackgroundJobs/Diffs/ContributionDiffs.hs Outdated
Comment thread src/Share/BackgroundJobs/Diffs/ContributionDiffs.hs
Comment thread src/Share/Codebase.hs
Comment on lines +196 to +199
-- | Ideally, we'd use this – a runtime with lookup actions in transaction, not IO. But that will require refactoring to
-- the runtime interface in ucm, so we can't use it for now. That's bad: we end up unsafely running separate
-- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved
-- connection pool.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on this one, swapping to use the same transaction or pipelining or something in the runtime could be a pretty significant speedup if we can get it to work.

Comment thread src/Share/Codebase.hs
unisonRuntime
}

badAskUnliftCodebaseRuntime ::

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note, either an altered version of the note from codebaseRuntime' or something, just to explain why this one is "bad" :)

Comment thread src/Share/Codebase.hs Outdated
-- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved
-- connection pool.
codebaseRuntime' :: Runtime Symbol -> CodebaseEnv -> IO (CodebaseRuntime (PG.Transaction e))
codebaseRuntime' unisonRuntime CodebaseEnv {codebaseOwner} = do

@ChrisPenner ChrisPenner May 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit; but could we name this codebaseRuntimeTransaction or something like that? I'm dubious of "prime" names after the confusion it's caused in UCM, so I definitely prefer more descriptive names, even if longer :)

Comment thread src/Share/Codebase.hs
case mayTermAndType of
Just termAndType -> do
UnliftIO.atomically $ UnliftIO.modifyTVar cacheVar $ \CodeLookupCache {termCache, ..} ->
codeLookupForUser :: TVar CodeLookupCache -> UserId -> CL.CodeLookup Symbol (PG.Transaction e) Ann

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work refactoring this 👍🏼

Comment on lines +812 to +827
let hydrateTerms ::
UserId ->
BiMultimap Referent Name ->
PG.Transaction e (Map Name (TermReferenceId, (Term Symbol Ann, Type Symbol Ann)))
hydrateTerms codebaseUser termReferents = do
let termReferenceIds = Map.mapMaybe Referent.toTermReferenceId (BiMultimap.range termReferents)
termIds <-
PG.pFor termReferenceIds \refId ->
(refId,) <$> DefnsQ.pipelinedExpectTermId refId
v2Terms <-
PG.pFor termIds \(refId, termId) ->
(refId,) <$> DefnsQ.pipelinedExpectTermById codebaseUser refId termId
v1Terms <-
for v2Terms \(refId, (term, typ)) ->
(refId,) <$> Codebase.convertTerm2to1 (Reference.idToHash refId) term typ
pure v1Terms

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call pipelining this, I'm not positive but I'd suspect this is a large number of the overall queries we end up doing in a diff. I really wonder whether what the path to doing even better here looks like; presumably best would be if we don't need to convert 2 to 1 at all, which would be a big UCM change :'(

hydrateTypes codebaseUser typeReferences = do
let typeReferenceIds = Map.mapMaybe Reference.toId (BiMultimap.range typeReferences)
typeIds <-
PG.pFor typeReferenceIds \refId ->

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not positive, but I bet we could do even better than pipelining here and collect all the typeIds in a single batch query rather than a pipelined for.

Comment thread src/Share/Postgres/Hashes/Queries.hs Outdated
Comment on lines +308 to +309
-- | Mitchell says: this could/should just have replaced 'expectNamespaceIdsByCausalIdsOf', but that function has
-- many callers, so having two temporarily eases the transition.

@ChrisPenner ChrisPenner May 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I wonder maybe the better option might be to absorb pUnrecoverableEitherMap into the QueryA class, that way you can just have a QueryA m constraint here and wouldn't need to touch any callsites.

I admit, the transaction types are getting away from me a bit as the number of them slowly grows 😓

@ChrisPenner

Copy link
Copy Markdown
Member

The "namespace diff" endpoint (where in the UI do we cause that to be hit?)

I don't believe we have a page for this yet, but may add one at some point, cc @hojberg .

@ChrisPenner

ChrisPenner commented May 1, 2025

Copy link
Copy Markdown
Member

Computing a diff could still be slow; more dedicated performance work is probably warranted.

You clearly did a ton of work pipelining where possible, in lots of spots which may even speed up term rendering and stuff too 🙌🏼

Q: do you have a sense (even perhaps a vague one) of how the pipelining changes have affected diff performance? Have you pushed a gnarly diff up to staging and timed the difference for instance? I think that'd be good info to have, especially since writing all this pipeline stuff and forcing things into Applicative style is annoying, it'd be nice to know how much we should be prioritizing it in the future.

@mitchellwrosen

Copy link
Copy Markdown
Contributor Author

Q: do you have a sense (even perhaps a vague one) of how the pipelining changes have affected diff performance? Have you pushed a gnarly diff up to staging and timed the difference for instance? I think that'd be good info to have, especially since writing all this pipeline stuff and forcing things into Applicative style is annoying, it'd be nice to know how much we should be prioritizing it in the future.

I do have a vague sense that it's making things noticeably faster. I feel the applicative style is just sort of a stopgap solution to rewriting queries to return entire result sets in one go rather than N+1 style one-at-a-time. But for the very minimal time investment I think it's worth it.

It's actually kind of hard to make a big gnarly diff to try on staging – you can't just grab two releases of a project (they don't share a history). But in my small experiments, things are faster.

@ChrisPenner

Copy link
Copy Markdown
Member

It's actually kind of hard to make a big gnarly diff to try on staging – you can't just grab two releases of a project (they don't share a history). But in my small experiments, things are faster.

If you've got a diff you like to test with on prod, you can just download and push it up to staging, having good test data is worth the time, that's what staging is for after all :)

@@ -0,0 +1,3 @@
-- Delete all previously-computed namespace diffs, because the diff payload is different now (we explicitly store
-- errors).
TRUNCATE namespace_diffs;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on whether you choose to re-queue diffs when the page is loaded and it's not cached, you may want to repopulate the queue with all existing contribution diffs after wiping them out.

If you have it set to requeue on page load then no need

Comment thread src/Share/Postgres.hs
pipelined p = Transaction (Hasql.pipeline (unPipeline p))

-- | Like fmap, but the provided function can throw a recoverable error by returning 'Left'.
pEitherMap :: (a -> Either e b) -> Pipeline e a -> Pipeline e b

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, we don't expose e in QueryA, if you have ideas on re-organizing these classes let me know :)

I'm wondering if maybe having the e as part of the Transaction type is a bad idea and maybe we should just have a tryTransaction :: ExceptT e Transaction a -> m (Either e a) which knows to roll-back instead; then I guess we'd just use Compose (Either e) Pipeline for applicative situations?

At any rate that's a conversation for outside of this PR :)

-- outstanding work.
notifs :: TVar (Set ChannelKind)
notifs = unsafePerformIO $ STM.newTVarIO mempty
notifs = unsafePerformIO $ STM.newTVarIO allChannels

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for doing this 👍🏼

@ChrisPenner ChrisPenner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go!

@ChrisPenner ChrisPenner merged commit 0c2e10a into main May 6, 2025
4 checks passed
@ChrisPenner ChrisPenner deleted the diff-api-tweaks branch May 6, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants