[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails#399
[DEPRIORITIZED][AAQ-765] Retry LLM generation when AlignScore fails#399lickem22 wants to merge 6 commits into
Conversation
| asession=asession, | ||
| exclude_archived=True, | ||
| ) | ||
| response.debug_info["past_failure"] = failure_reason |
There was a problem hiding this comment.
Added that in case it works after the second try to understand why it failed first.
| return response | ||
|
|
||
|
|
||
| def is_unable_to_generate_response(response: QueryResponse) -> bool: |
There was a problem hiding this comment.
Added this function retry only if that condition is met.
|
|
||
|
|
||
| @backoff.on_predicate( | ||
| backoff.expo, |
There was a problem hiding this comment.
What backoff.expo does is basically waiting a little more everytime the function is reran in an exponential way just to handle the load better.
There was a problem hiding this comment.
Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?
There was a problem hiding this comment.
I guess it depends on how useful the approach is, since we haven't done any analysis to see how well it works.
But personnally I think since it doesn't add a dependency (backoff being used by litellm), and since the only code we would change if we retry just once is the decorator and the config variable, the cost is pretty low, so we can just keep it.
| You are a helpful question-answering AI. You understand user question and answer their \ | ||
| question using the REFERENCE TEXT below. | ||
| """ | ||
| RETRY_PROMPT_SUFFIX = """ |
There was a problem hiding this comment.
Added a suffix to the prompt to incorporate failure reason
suzinyou
left a comment
There was a problem hiding this comment.
Thanks Carlos! I think we need to think and validate the prompt a bit. Should we discuss in the next tech session?
|
|
||
| metadata = metadata or {} | ||
| prompt = RAG.prompt.format(context=context, original_language=original_language) | ||
| if "failure_reason" in metadata and metadata["failure_reason"]: |
There was a problem hiding this comment.
How about we create a new arg, "retry=False"?
There was a problem hiding this comment.
The downside is
- We would have to create it for all the parent functions and
- We need both
is_retryandmetadata["failure_reaon"]to actually do retry.
But I think it would be easier to understand the code, and we won't be hiding any unexpected actions! What do you think?
Something like
if is_retry:
if "failure_reason" not in metadata:
raise ValueError("failure_reason is required for retry requests")
prompt = RAG.retry_prompt.format(
context=context,
original_language=original_language,
failure_reason=metadata["failure_reason"],
)There was a problem hiding this comment.
My initial understanding was that we are using this to try the functionality. What if we keep it like this while testing, and if it turns out to be something we want to keep, the we will explicitly set it as a functionality by addind the is_retry parameter. What do you think?
| RETRY_PROMPT_SUFFIX = """ | ||
| If the response above is not aligned with the question, please rectify this by \ | ||
| considering the following reason(s) for misalignment: "{failure_reason}". | ||
| Make necessary adjustments to ensure the answer is aligned with the question. | ||
| """ |
There was a problem hiding this comment.
Right now, we are only passing failure_reason which is response.debug_info["factual_consistency"]["reason"],
but we should also include the LLM response in this prompt..
There was a problem hiding this comment.
Also, shouldn't the prompt define what we mean by alignment?
There was a problem hiding this comment.
That makes sense. To be honest, I was just having a go at updating the prompt to take the output into consideration. I am not exactly an expert in prompt engineering. Should we discuss that in a tech session?
|
|
||
|
|
||
| @backoff.on_predicate( | ||
| backoff.expo, |
There was a problem hiding this comment.
Should we just have a logic that retries once, instead of adding a config (num retries) we don't know if we'll use 🤔 ?
3858453 to
a5a0ac9
Compare
3858453 to
a5a0ac9
Compare
Reviewer: @amiraliemami
Estimate: 40mins
Ticket
Fixes:AAQ-675
Description
Goal
The goal of this PR is to allow retrying LLM response again when AlignScore fails because of a low score N times (default is 0).
Changes
The following changes have been made:
QueryResponseErrorand the error is low alignment score. Also allowed to add the previous failure raison inresponse.debug_info["past_failure"]Future Tasks (optional)
How has this been tested?
Testing this is tricky because for this change to be observed we need LLM response to work but AlignScore to fail, and finding these cases are not straighforward.
Was tested two ways:
First way is :
ALIGN_SCORE_THRESHOLDto an unrealistic score (example 1.5). That way AlignScore fails/searchwithgenerate_lllm_responseset totrue. with a content and a question relevant for the content. An example is content:Here we are going to talk about pineapples because of their pine shapes and the applee like tastand question:
Are apple related to pineapples,Second way was by still setting
ALIGN_SCORE_THRESHOLDto a value >1 but then adding a logic in the code to make sure the value ofALIGN_SCORE_THRESHOLDis reduced to a reasonable value everytime the LLM response is regenerated to make sure that after the second retry the AlignScore check passes. So, in second run,ALIGN_SCORE_THRESHOLDshould be less than 0.8. This approach is not straightforward. I am open to more efficient ways of testing this feature.Checklist
Fill with
xfor completed.(Delete any items below that are not relevant)
scripts/.github/workflows/