Retain up to microsecond precision when importing sas7bdat files#330
Retain up to microsecond precision when importing sas7bdat files#330belegdol wants to merge 3 commits into
Conversation
8fd1d3f to
2ab3a38
Compare
|
hi Thanks for the PR. I think the change is straightforward, but I need a bit more context. Explain why you need this in first place and submit a SAS file with fractional seconds as example, it need to be included in the test as well (Check contributing, issue should be discussed before sending a PR). 1- Changes to Readstat are not allowed, but I do not see any at the moment, so I guess it is fine. |
|
Hi @ofajardo, apologies for the confusing description and for the lack of issue. The description refers to the state before I submitted #332. As such, bullet point 2 can be ignored here and can be discussed in #332. |
|
OK, sounds good. Look at my comments in the other PR as well. For this one it would be important to provide a SAS example and writing a simple test that checks that the microseconds are read as expected. Probably one file and test would be enough for both PRs |
|
I have a test and a test file ready. Unfortunately, |
|
Hi, the vectorized version must stay by default, because the non vectorized version is very slow and I had complains in the past, and that is why the vectorized version was put in place. If there is no graceful way to make the higher precision work with the vectorized version, one option is not too capture the high precision by default but activate it by an argument and in such case avoid the vectorized path. Saying from the top of my head as I have not seen it. But in my experience so far, precision higher than seconds is rare (I have not seen it so far myself), so making it optional is an option. |
|
Does this still qualify as vectorized? From 186d2b17589f85bf95b9c8f05a2c6f4379aed277 Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Wed, 13 May 2026 16:40:19 +0200
Subject: [PATCH] Working concept for SAS using long long
---
pyreadstat/_readstat_parser.pyx | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..f2fb7ba 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -199,6 +199,8 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
cdef int secs
cdef double msecs
cdef int usecs
+ cdef long long days_usecs
+ cdef long long unix_to_origin_usecs
cdef object mydat
# For polars we are going to return an epoch from unix origin,
@@ -233,13 +235,16 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
return mydat.date()
elif var_format == DATE_FORMAT_DATETIME:
if output_format == "polars":
- # we want to return seconds from unix
+ unix_to_origin_usecs = <long long> (unix_to_origin_secs * 1e6)
+ # we want to return microseconds from unix
if file_format == FILE_FORMAT_STATA:
- # tstamp is in millisecons
- return (tstamp/1000) - unix_to_origin_secs
+ # tstamp is in milliseconds
+ return (tstamp * 1e3) - unix_to_origin_usecs
else:
# tstamp in seconds
- return tstamp - unix_to_origin_secs
+ days_usecs = <long long> (floor(tstamp) * 1e6)
+ usecs = <int> (round(tstamp % 1 * 1e6))
+ return days_usecs - unix_to_origin_usecs + usecs
if file_format == FILE_FORMAT_STATA:
# tstamp is in millisecons
@@ -1107,7 +1112,7 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
if var_format == DATE_FORMAT_DATE:
date_cols.append(column)
if datetime_cols:
- data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+ data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='us'))
if date_cols:
data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
--
2.54.0Or does the entire double to timestamp conversion need to happen in |
|
Another option would be to limit the feature to milliseconds and document that anything beyond that might me imprecise. |
|
milliseconds can be an option. As described one problem is the speed, it has to at least match the current one. Another problem with increasing the precision by default is that it may limit the oldest date that can be processed, and that is an issue when you have old datetimes in your data, in that sense seconds was very good. Please control for that as well. |
|
I will investigate. As things are currently in the master branch, the polars path will happily read fractional seconds from the SAS files, it is just that the results might differ in the last bit. Only the pandas path was explicitly discarding the fraction, which is what this PR changes. |
|
It looks like the way I created the file has no influence on the precision challenges. pyreadstat/pyreadstat/_readstat_parser.pyx Line 232 in 1adde46 1829-09-11T10:47:37.282617 becomes 1829-09-11T10:47:37.282618. Before the subtraction, the floating point value is -4111996342.7173829079. After subtraction, it becomes -4427615542.7173824310. The former rounds up whereas the latter rounds down. It appears that one needs to convert to integer microseconds before the subtraction for maximum precision as done in #330 (comment), but this moves the majority of operations out of the vectorized polars code.Given the nature of storing time as a floating point number, loss of precision due to rounding is expected at some point. How much is acceptable? Microsecond precision is only guaranteed for around 142 years around 1960 for SAS, anything beyond that is imprecise by nature. Using a test file with milliseconds only would pass the tests but this feels like cheating. Regarding the performance, is there a particular file you would like me to benchmark? Is the large dataset mentioned in the readme publicly available? |
|
Here is the vectorised version for reference: From 32b083c82485f9472246d0cad15abbc8c0a2892e Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Thu, 14 May 2026 01:24:12 +0200
Subject: [PATCH] vectorised version
---
pyreadstat/_readstat_parser.pyx | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..ef73a27 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -1107,7 +1107,14 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
if var_format == DATE_FORMAT_DATE:
date_cols.append(column)
if datetime_cols:
- data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+ data_frame = data_frame.with_columns(
+ [
+ pl.from_epoch(
+ (pl.col(c) % 1 * 1e6).round().cast(pl.Int64) + (pl.col(c).floor() * 1e6).cast(pl.Int64),
+ time_unit='us')
+ for c in datetime_cols if data_frame[c].len() > 0
+ ]
+ )
if date_cols:
data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
--
2.54.0 |
|
It turns out that generating a test file with precision limited to milliseconds only is not enough. I generated one with 1000 datetimes between years 1700 and 2200, and I still got one mismatch: |
Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.
|
I managed to get it working. |
As things stand now, time and datetime precision when importing .sas7bdat files is limited to whole seconds. With this change, up to microsecond precision is possible.
I am submitting this as a draft PR for the following reasons: