Commit 7db517a
authored
fix: Reset private state correctly in sitemap parsers (#1938)
## Description
Fixes two attribute-name typos in `src/crawlee/_utils/sitemap.py` where
the underscore prefix was missing on assignment, so each statement
silently created a new public attribute instead of resetting the
intended private state:
- `_XMLSaxSitemapHandler.endElement` set `self.current_tag = None`
instead of `self._current_tag = None`. The handler therefore never left
the "inside a tracked tag" state, so stray text between elements kept
being appended to the buffer and a duplicate close tag could re-process
stale buffer contents.
- `_TxtSitemapParser.flush` set `self.buffer = ''` instead of
`self._buffer = ''`. Reusing the parser after `flush()` concatenated the
leftover URL with the next chunk, yielding corrupted URLs like
`https://b.com/https://c.com/`.
Both fixes add the missing underscore. Regression tests cover the state
reset in the XML handler and the buffer corruption in the TXT parser.1 parent fe9464c commit 7db517a
2 files changed
Lines changed: 37 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
125 | | - | |
| 125 | + | |
126 | 126 | | |
127 | 127 | | |
128 | 128 | | |
| |||
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
159 | | - | |
| 159 | + | |
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
10 | 18 | | |
11 | 19 | | |
12 | 20 | | |
| |||
347 | 355 | | |
348 | 356 | | |
349 | 357 | | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
350 | 384 | | |
351 | 385 | | |
352 | 386 | | |
| |||
0 commit comments