Fix IndexError when running with splitted text chunks#500
Conversation
JiwaniZakir
left a comment
There was a problem hiding this comment.
The fix correctly addresses the original double bug — [] * N always produces [], and wrapping it in [...] produced [[]] regardless of length — but introduces a subtle aliasing issue. Using [[]] * len(self._ents_cands_by_doc) creates a list where every element is a reference to the same inner list object. If downstream code ever calls .append() or mutates one of those inner lists in-place (e.g., self._ents_cands_by_shard[i].append(...)), every slot will be affected simultaneously. The safe pattern is a list comprehension: [[] for _ in range(len(self._ents_cands_by_doc))]. This same concern applies to all four call sites changed in task.py.
The new test in test_entity_linker.py only asserts on docs[0], leaving the other three documents unverified despite being included specifically to exercise the split-chunk scenario. Adding at least one assertion on a doc with entities that span a shard boundary (or confirming entity count correctness across all four docs) would give the test more regression value. The function name also uses splitted, which should be split.
This is a bug report and also a potential fix to the bug.
Description
After splitting a long text and feeding each chunk into an LLM model to prompt for entities linking task, the
_get_prompt_datafunction complains "IndexError: list index out of range" when trying to accessself._ents_cands_by_shard[i_doc].Syntax of the line
self._ents_cands_by_shard = [[] * len(self._ents_cands_by_doc)]seems to imply a purpose to initialize a list with a certain length. But it won't fulfill the purpose, for example if you run[[] * 3]you only get[[]]rather than[[], [], []]. Please justify the change at your deliberation.Before applying the change: error log
After applying the change: all external tests passed
Types of change
Bug fix
Checklist
testsandusage_examples/tests, and all new and existing tests passed. This includespytestran with--external)pytestran with--gpu)