Bug report
Bug description:
Hello,
I am currently debugging this issue.
I have noticed that the bug can be reproduced when the problematic file is truncated to 9 GiB B but it does not happen when truncated to 8 GiB.
The problem seems to be that the next member offset is computed wrong. It seems to point 512 B after the correct TAR header, which, in this case, points into the data for the extended attributes such as 30 mtime=1752348[...].
One of the differences seems to be this code part, which is not hit for the working case:
|
if "size" in pax_headers: |
|
# If the extended header replaces the size field, |
|
# we need to recalculate the offset where the next |
|
# header starts. |
|
offset = next.offset_data |
|
if next.isreg() or next.type not in SUPPORTED_TYPES: |
|
offset += next._block(next.size) |
|
tarfile.offset = offset |
While looking into the line above, i.e., into _apply_pax_info, I noticed that there is no definite order for applying the size even though it can appear multiple times!
|
def _apply_pax_info(self, pax_headers, encoding, errors): |
|
"""Replace fields with supplemental information from a previous |
|
pax extended or global header. |
|
""" |
|
for keyword, value in pax_headers.items(): |
|
if keyword == "GNU.sparse.name": |
|
setattr(self, "path", value) |
|
elif keyword == "GNU.sparse.size": |
|
setattr(self, "size", int(value)) |
|
elif keyword == "GNU.sparse.realsize": |
|
setattr(self, "size", int(value)) |
|
elif keyword in PAX_FIELDS: |
|
if keyword in PAX_NUMBER_FIELDS: |
|
try: |
|
value = PAX_NUMBER_FIELDS[keyword](value) |
|
except ValueError: |
|
value = 0 |
|
if keyword == "path": |
|
value = value.rstrip("/") |
|
setattr(self, keyword, value) |
In the non-working case, the PAX headers look like this:
{'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '9663676416',
'atime': '1752349406.975921575',
'ctime': '1752349534.57652562',
'mtime': '1752349534.57652562',
'size': '9602318848'}
I.e, the size member first gets set to GNU.sparse.realsize and then to size. The debug output looks like this:
[_apply_pax_info] SET SIZE to: 9663676416 from key: GNU.sparse.realsize
[_apply_pax_info] SET SIZE to: 9602318848 from key: size
[_apply_pax_info] SET key to: 1752349534.5765257 from key: mtime
Is it specified that the order of the PAX headers must always be this way? Else, one might just as well encounter it like this:
{'atime': '1752349406.975921575',
'ctime': '1752349534.57652562',
'mtime': '1752349534.57652562',
'size': '9602318848',
'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '9663676416'}
and either one of these orders would be a bug.
The working case does not have this ambiguity:
{'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '8589934592',
'atime': '1752349538.445543898',
'ctime': '1752351104.53673501',
'mtime': '1752351104.53673501'}
the debug output looks like this:
[_apply_pax_info] SET SIZE to: 8589934592 from key: GNU.sparse.realsize
[_apply_pax_info] SET key to: 1752351104.536735 from key: mtime
I.e., even if the is no ordering problem, there already are different semantics for the TarInfo.size member as one will contain GNU.sparse.realsize and the other will contain [PAXHeader.]size.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
Bug report
Bug description:
Hello,
I am currently debugging this issue.
I have noticed that the bug can be reproduced when the problematic file is truncated to 9 GiB B but it does not happen when truncated to 8 GiB.
The problem seems to be that the next member offset is computed wrong. It seems to point 512 B after the correct TAR header, which, in this case, points into the data for the extended attributes such as
30 mtime=1752348[...].One of the differences seems to be this code part, which is not hit for the working case:
cpython/Lib/tarfile.py
Lines 1562 to 1569 in 47b01da
While looking into the line above, i.e., into
_apply_pax_info, I noticed that there is no definite order for applying the size even though it can appear multiple times!cpython/Lib/tarfile.py
Lines 1615 to 1634 in 47b01da
In the non-working case, the PAX headers look like this:
{'GNU.sparse.major': '1', 'GNU.sparse.minor': '0', 'GNU.sparse.name': 'userdata', 'GNU.sparse.realsize': '9663676416', 'atime': '1752349406.975921575', 'ctime': '1752349534.57652562', 'mtime': '1752349534.57652562', 'size': '9602318848'}I.e, the size member first gets set to
GNU.sparse.realsizeand then tosize. The debug output looks like this:Is it specified that the order of the PAX headers must always be this way? Else, one might just as well encounter it like this:
{'atime': '1752349406.975921575', 'ctime': '1752349534.57652562', 'mtime': '1752349534.57652562', 'size': '9602318848', 'GNU.sparse.major': '1', 'GNU.sparse.minor': '0', 'GNU.sparse.name': 'userdata', 'GNU.sparse.realsize': '9663676416'}and either one of these orders would be a bug.
The working case does not have this ambiguity:
{'GNU.sparse.major': '1', 'GNU.sparse.minor': '0', 'GNU.sparse.name': 'userdata', 'GNU.sparse.realsize': '8589934592', 'atime': '1752349538.445543898', 'ctime': '1752351104.53673501', 'mtime': '1752351104.53673501'}the debug output looks like this:
I.e., even if the is no ordering problem, there already are different semantics for the
TarInfo.sizemember as one will containGNU.sparse.realsizeand the other will contain[PAXHeader.]size.CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
TarInfo.sizeto the real file size for sparse files #136622