Hi Physical Intelligence team,
First of all, thank you for the incredible work on Pi0.7 and the highly insightful technical report!
I was studying the sequence packing strategy for inference-time CFG and noticed a potential discrepancy between the text description and the provided attention mask figure titled "Inference with metadata CFG" in Fig. 19.
-
The Text Description
The text states that for efficient inference, both positive and negative examples are packed into the same sequence to construct an attention tree with two branches, "which do not attend to one another." This perfectly aligns with standard CFG logic to maintain parallel, isolated computation.
-
The Visual Discrepancy
However, if we look closely at the attention mask matrix (specifically the row for Text Prompt (+)):
Starting from the causal mask triangle of Text Prompt (+) and looking horizontally to the left, there are solid grey blocks indicating active attention over both the Text Prompt (-) and Flow actions (-) columns.
My Questions:
Is this simply a minor graphical typo in the figure? (i.e., should the area to the left of the Text Prompt (+) triangle be completely blank/masked out to ensure strict isolation?)
Or is there a non-trivial architectural reason why the positive text prompt is designed to attend to the negative tokens and actions? If so, wouldn't this break the parallel processing nature of the CFG branches?
Looking forward to your clarification! Thanks again for the great work.
Hi Physical Intelligence team,
First of all, thank you for the incredible work on Pi0.7 and the highly insightful technical report!
I was studying the sequence packing strategy for inference-time CFG and noticed a potential discrepancy between the text description and the provided attention mask figure titled "Inference with metadata CFG" in Fig. 19.
The Text Description
The text states that for efficient inference, both positive and negative examples are packed into the same sequence to construct an attention tree with two branches, "which do not attend to one another." This perfectly aligns with standard CFG logic to maintain parallel, isolated computation.
The Visual Discrepancy
However, if we look closely at the attention mask matrix (specifically the row for Text Prompt (+)):
Starting from the causal mask triangle of Text Prompt (+) and looking horizontally to the left, there are solid grey blocks indicating active attention over both the Text Prompt (-) and Flow actions (-) columns.
My Questions:
Is this simply a minor graphical typo in the figure? (i.e., should the area to the left of the Text Prompt (+) triangle be completely blank/masked out to ensure strict isolation?)
Or is there a non-trivial architectural reason why the positive text prompt is designed to attend to the negative tokens and actions? If so, wouldn't this break the parallel processing nature of the CFG branches?
Looking forward to your clarification! Thanks again for the great work.