diff api tweaks#58
Conversation
| @@ -0,0 +1,13 @@ | |||
| -- Drop and re-create the contribution_diff_queue table without a primary key constraint on contribution_id. | |||
| -- This allows us to insert redundant rows while a diff is being computed, rather than blocking due to row lock. | |||
There was a problem hiding this comment.
I think I'm missing the "why" of why we want redundant rows while the diff is being computed?
If it's already in the queue should we perhaps just ON CONFLICT DO NOTHING?
Otherwise seems like we'll just have a bunch of workers doing redundant work churning on the same stuff.
Guessing I'm just missing something here :)
There was a problem hiding this comment.
In the current implementation, a worker select-for-update and performs the entire job in one transaction before deleting the job row. So, there actually is no opportunity for a worker to grab a job, attempt to perform it, and fail.
However, I'm not sure we want to rely on being able to complete a job transactionally, forever. So, in the current implementation, merely visiting a contribution diff endpoint will result in a new contribution_diff_queue row in the case that the diff isn't already computed, which is intended to cover the case that a job was previously enqueued, but failed.
But there's no way to INSERT ON CONFLICT DO NOTHING if the row we are trying to insert has a uniqueness constraint, and the row exists, with a held lock. We have to block until the transaction holding the lock completes in order to determine whether the uniqueness constraint is violated (even though we have ON CONFLICT DO NOTHING).
There was a problem hiding this comment.
Whoops, I'm realizing that the disadvantage of this in-between design is that we can enqueue one job multiple times that gets picked up and worked on in parallel, which we don't want.
Ok, I think the two reasonable options, then, are:
- Keep existing design for job queue, but keep uniqueness constraint, and remove line in "get contribution diff" endpoint that tries to enqueue a job if the diff hasn't been computed.
- Move to a design that involves claiming work explicitly in the database, with a recorded timestamp, and infer that a job that began X minutes ago but hasn't been completed is safe to start again.
I sorta lean towards the first option, for simplicity.
There was a problem hiding this comment.
Yeah, I'm going to go with the first. The second is more robust but would require building a little heartbeating mechanism, and feels like overkill for now.
There was a problem hiding this comment.
The first one sounds good, however what happens for all the old existing diffs? If we don't enqueue anything when you load the page won't those diffs simply never be computed?
I'm no expert on PG row locking, is there a Transaction Isolation level where you could run a quick query to see if the row already exists, then proceed with an insert with "ON CONFLICT DO NOTHING" in a follow-up transaction if it appears it doesn't already exist?
| import System.Clock qualified as Clock | ||
|
|
||
| -- | Check every 10 minutes if we haven't heard on the notifications channel. | ||
| -- | Check every 30 seconds if we haven't heard on the notifications channel. |
There was a problem hiding this comment.
This is fine, just curious if something motivated the change or if there's an issue with the PG notify stuff? if the PG Notify stuff is working then this is just a fallback :)
There was a problem hiding this comment.
Oh, I just thought that after a full web server restart it would be odd for a worker to wait for 10 entire minutes before checking deciding to peek at the work queue. But we can revert, it's a corner case either way.
There was a problem hiding this comment.
Yeah that's true, for that we should just change the startup behaviour, maybe start with a notification on each channel when we first initialize the TVar in Share.Postgres.Notifications; that way we check every channel for work immediately on startup :)
There was a problem hiding this comment.
Agreed, I'll do that and bump this back up to 10 minutes
| -- | Ideally, we'd use this – a runtime with lookup actions in transaction, not IO. But that will require refactoring to | ||
| -- the runtime interface in ucm, so we can't use it for now. That's bad: we end up unsafely running separate | ||
| -- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved | ||
| -- connection pool. |
There was a problem hiding this comment.
Good catch on this one, swapping to use the same transaction or pipelining or something in the runtime could be a pretty significant speedup if we can get it to work.
| unisonRuntime | ||
| } | ||
|
|
||
| badAskUnliftCodebaseRuntime :: |
There was a problem hiding this comment.
Can you add a note, either an altered version of the note from codebaseRuntime' or something, just to explain why this one is "bad" :)
| -- transactions for inner calls to 'codeLookup' / 'cachedEvalResult', which can lead to deadlock due to a starved | ||
| -- connection pool. | ||
| codebaseRuntime' :: Runtime Symbol -> CodebaseEnv -> IO (CodebaseRuntime (PG.Transaction e)) | ||
| codebaseRuntime' unisonRuntime CodebaseEnv {codebaseOwner} = do |
There was a problem hiding this comment.
Just a nit; but could we name this codebaseRuntimeTransaction or something like that? I'm dubious of "prime" names after the confusion it's caused in UCM, so I definitely prefer more descriptive names, even if longer :)
| case mayTermAndType of | ||
| Just termAndType -> do | ||
| UnliftIO.atomically $ UnliftIO.modifyTVar cacheVar $ \CodeLookupCache {termCache, ..} -> | ||
| codeLookupForUser :: TVar CodeLookupCache -> UserId -> CL.CodeLookup Symbol (PG.Transaction e) Ann |
There was a problem hiding this comment.
Nice work refactoring this 👍🏼
| let hydrateTerms :: | ||
| UserId -> | ||
| BiMultimap Referent Name -> | ||
| PG.Transaction e (Map Name (TermReferenceId, (Term Symbol Ann, Type Symbol Ann))) | ||
| hydrateTerms codebaseUser termReferents = do | ||
| let termReferenceIds = Map.mapMaybe Referent.toTermReferenceId (BiMultimap.range termReferents) | ||
| termIds <- | ||
| PG.pFor termReferenceIds \refId -> | ||
| (refId,) <$> DefnsQ.pipelinedExpectTermId refId | ||
| v2Terms <- | ||
| PG.pFor termIds \(refId, termId) -> | ||
| (refId,) <$> DefnsQ.pipelinedExpectTermById codebaseUser refId termId | ||
| v1Terms <- | ||
| for v2Terms \(refId, (term, typ)) -> | ||
| (refId,) <$> Codebase.convertTerm2to1 (Reference.idToHash refId) term typ | ||
| pure v1Terms |
There was a problem hiding this comment.
Good call pipelining this, I'm not positive but I'd suspect this is a large number of the overall queries we end up doing in a diff. I really wonder whether what the path to doing even better here looks like; presumably best would be if we don't need to convert 2 to 1 at all, which would be a big UCM change :'(
| hydrateTypes codebaseUser typeReferences = do | ||
| let typeReferenceIds = Map.mapMaybe Reference.toId (BiMultimap.range typeReferences) | ||
| typeIds <- | ||
| PG.pFor typeReferenceIds \refId -> |
There was a problem hiding this comment.
I'm not positive, but I bet we could do even better than pipelining here and collect all the typeIds in a single batch query rather than a pipelined for.
| -- | Mitchell says: this could/should just have replaced 'expectNamespaceIdsByCausalIdsOf', but that function has | ||
| -- many callers, so having two temporarily eases the transition. |
There was a problem hiding this comment.
Hrmm, I wonder maybe the better option might be to absorb pUnrecoverableEitherMap into the QueryA class, that way you can just have a QueryA m constraint here and wouldn't need to touch any callsites.
I admit, the transaction types are getting away from me a bit as the number of them slowly grows 😓
I don't believe we have a page for this yet, but may add one at some point, cc @hojberg . |
You clearly did a ton of work pipelining where possible, in lots of spots which may even speed up term rendering and stuff too 🙌🏼 Q: do you have a sense (even perhaps a vague one) of how the pipelining changes have affected diff performance? Have you pushed a gnarly diff up to staging and timed the difference for instance? I think that'd be good info to have, especially since writing all this pipeline stuff and forcing things into Applicative style is annoying, it'd be nice to know how much we should be prioritizing it in the future. |
I do have a vague sense that it's making things noticeably faster. I feel the applicative style is just sort of a stopgap solution to rewriting queries to return entire result sets in one go rather than N+1 style one-at-a-time. But for the very minimal time investment I think it's worth it. It's actually kind of hard to make a big gnarly diff to try on staging – you can't just grab two releases of a project (they don't share a history). But in my small experiments, things are faster. |
If you've got a diff you like to test with on prod, you can just download and push it up to staging, having good test data is worth the time, that's what staging is for after all :) |
| @@ -0,0 +1,3 @@ | |||
| -- Delete all previously-computed namespace diffs, because the diff payload is different now (we explicitly store | |||
| -- errors). | |||
| TRUNCATE namespace_diffs; | |||
There was a problem hiding this comment.
Depending on whether you choose to re-queue diffs when the page is loaded and it's not cached, you may want to repopulate the queue with all existing contribution diffs after wiping them out.
If you have it set to requeue on page load then no need
| pipelined p = Transaction (Hasql.pipeline (unPipeline p)) | ||
|
|
||
| -- | Like fmap, but the provided function can throw a recoverable error by returning 'Left'. | ||
| pEitherMap :: (a -> Either e b) -> Pipeline e a -> Pipeline e b |
There was a problem hiding this comment.
Ah I see, we don't expose e in QueryA, if you have ideas on re-organizing these classes let me know :)
I'm wondering if maybe having the e as part of the Transaction type is a bad idea and maybe we should just have a tryTransaction :: ExceptT e Transaction a -> m (Either e a) which knows to roll-back instead; then I guess we'd just use Compose (Either e) Pipeline for applicative situations?
At any rate that's a conversation for outside of this PR :)
| -- outstanding work. | ||
| notifs :: TVar (Set ChannelKind) | ||
| notifs = unsafePerformIO $ STM.newTVarIO mempty | ||
| notifs = unsafePerformIO $ STM.newTVarIO allChannels |
There was a problem hiding this comment.
Nice, thanks for doing this 👍🏼
Overview
This PR amends the namespace diff and contribution diff endpoints to explicitly model two additional states:
Additionally, it modifies the contribution diff endpoint to defer to the background diff computing worker, rather than eagerly try to compute the diff itself in the foreground. A diff error (due to incoherent decl) is now explicitly stored, too, rather than logged n' tossed.
The background worker was fixed to acquire a row lock on a
contribution_diff_queuerow and hold the lock while computing the diff. This necessitated refactoring the diff implementation to run in a single transaction. I also amended a few of the slower parts of the diff to pipeline queries where possible, though many things could not be parallelized very well. Computing a diff could still be slow; more dedicated performance work is probably warranted.There's one loose ends that should be addressed in follow-up PRs: