Congrats for the insightful paper!

I noticed a few points in the figure in the appendix that I find a bit confusing, and here are two questions:
-
Since the 'Training Long Language Model' step uses a context length of only 224k, why does the model still show high accuracy even when the context length reaches 512k?
-
I observed that when the distractor is set to 5, the distribution of the NIAH results appears unusual. It seems that the context length of 224k performs better than the context length of 64k, which is quite different from what is typically seen in NIAH results for other models.
Looking forward to your insights on these points
Best regards
Congrats for the insightful paper!
I noticed a few points in the figure in the appendix that I find a bit confusing, and here are two questions:
Since the 'Training Long Language Model' step uses a context length of only 224k, why does the model still show high accuracy even when the context length reaches 512k?
I observed that when the distractor is set to 5, the distribution of the NIAH results appears unusual. It seems that the context length of 224k performs better than the context length of 64k, which is quite different from what is typically seen in NIAH results for other models.
Looking forward to your insights on these points
Best regards