Retain up to microsecond precision when importing sas7bdat files by belegdol · Pull Request #330 · Roche/pyreadstat

belegdol · 2026-05-08T12:49:45Z

As things stand now, time and datetime precision when importing .sas7bdat files is limited to whole seconds. With this change, up to microsecond precision is possible.

I am submitting this as a draft PR for the following reasons:

the changes are (partially) dependent on the upstream ReadStat changes which are included here for reproducibility. Otherwise one needs Linux SAS to generate appropriate files
Once the format length is read properly for 32-bit files like ones created with SAS for Windows, the current approach of comparing the entirety of format name with a list will break whenever a format width is defined. Is there a particular reason for this approach? Ignoring the format width altogether would definitely be simpler. I can include it in the change you would be willing to consider
Right now the same code is used for SPSS files, which I have none to test with. If the change needs to be made SAS-specific, please let me know.

belegdol · 2026-05-12T08:07:43Z

#1 is actually independent, I removed it.
#2 should be mitigated by #332
#3 I do not know how to test.

ofajardo · 2026-05-12T10:33:41Z

hi

Thanks for the PR. I think the change is straightforward, but I need a bit more context. Explain why you need this in first place and submit a SAS file with fractional seconds as example, it need to be included in the test as well (Check contributing, issue should be discussed before sending a PR).

1- Changes to Readstat are not allowed, but I do not see any at the moment, so I guess it is fine.
2- Explain why "the current approach of comparing the entirety of format name with a list will break whenever a format width is defined". I am not happy to change the current behavior of the package, so there must be a very strong reason to drop that mechanism that has worked for years by now.
3- At the very minimum tests should run without errors (they do for me). There are sample SPSS and STATA files in the test_data folder. However, we would need to generate SPSS and STATA tests files with fractional second information.

belegdol · 2026-05-12T11:57:37Z

Hi @ofajardo,

apologies for the confusing description and for the lack of issue. The description refers to the state before I submitted #332. As such, bullet point 2 can be ignored here and can be discussed in #332.
The reason for needing this is to read maximum precision available as well as ensuring maximum possible parity with https://github.com/saurfang/spark-sas7bdat/commits/master/. There should be no difference between creating a spark data frame natively with spark-sas7bdat and creating a pandas dataframe with pyreadstat and subsequently converting it to spark dataframe.
I will look into test errors, and can also likely provide a test SAS file. SPSS one can be a problem since I do not have access to SPSS currently.

ofajardo · 2026-05-12T12:22:09Z

OK, sounds good. Look at my comments in the other PR as well. For this one it would be important to provide a SAS example and writing a simple test that checks that the microseconds are read as expected. Probably one file and test would be enough for both PRs

belegdol · 2026-05-13T13:47:40Z

I have a test and a test file ready. Unfortunately, polars path is failing on datetime due to microseconds hitting the limit of float64 precision. I will see if I can get the conversion to work in a way that the test pass.
Interestingly enough, time column, which is not vectorized, passes fine.
The pandas route works for datetime column as well. Would disabling the vectorisation for datetime be worth the increased precision?

ofajardo · 2026-05-13T14:16:41Z

Hi, the vectorized version must stay by default, because the non vectorized version is very slow and I had complains in the past, and that is why the vectorized version was put in place.

If there is no graceful way to make the higher precision work with the vectorized version, one option is not too capture the high precision by default but activate it by an argument and in such case avoid the vectorized path. Saying from the top of my head as I have not seen it. But in my experience so far, precision higher than seconds is rare (I have not seen it so far myself), so making it optional is an option.

belegdol · 2026-05-13T14:44:24Z

Does this still qualify as vectorized?

From 186d2b17589f85bf95b9c8f05a2c6f4379aed277 Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Wed, 13 May 2026 16:40:19 +0200
Subject: [PATCH] Working concept for SAS using long long

---
 pyreadstat/_readstat_parser.pyx | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..f2fb7ba 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -199,6 +199,8 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
     cdef int secs
     cdef double msecs
     cdef int usecs
+    cdef long long days_usecs
+    cdef long long unix_to_origin_usecs
     cdef object mydat
 
     # For polars we are going to return an epoch from unix origin, 
@@ -233,13 +235,16 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
             return mydat.date()
     elif var_format == DATE_FORMAT_DATETIME:
         if output_format == "polars":
-            # we want to return seconds from unix
+            unix_to_origin_usecs = <long long> (unix_to_origin_secs * 1e6)
+            # we want to return microseconds from unix
             if file_format == FILE_FORMAT_STATA:
-                # tstamp is in millisecons
-                return (tstamp/1000) - unix_to_origin_secs
+                # tstamp is in milliseconds
+                return (tstamp * 1e3) - unix_to_origin_usecs
             else:
                 # tstamp in seconds
-                return tstamp - unix_to_origin_secs
+                days_usecs = <long long> (floor(tstamp) * 1e6)
+                usecs = <int> (round(tstamp % 1 * 1e6))
+                return days_usecs - unix_to_origin_usecs + usecs
 
         if file_format == FILE_FORMAT_STATA:
             # tstamp is in millisecons
@@ -1107,7 +1112,7 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='us'))
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

Or does the entire double to timestamp conversion need to happen in from_epoch()? The code above works, but it definitely adds complexity.

belegdol · 2026-05-13T14:50:49Z

Another option would be to limit the feature to milliseconds and document that anything beyond that might me imprecise.
Millisecond precision is what I have actually seen in the wild, and it is also how much precision you get by calling datetime() in SAS.

ofajardo · 2026-05-13T16:13:47Z

milliseconds can be an option.

As described one problem is the speed, it has to at least match the current one.

Another problem with increasing the precision by default is that it may limit the oldest date that can be processed, and that is an issue when you have old datetimes in your data, in that sense seconds was very good. Please control for that as well.

belegdol · 2026-05-13T16:41:03Z

I will investigate. As things are currently in the master branch, the polars path will happily read fractional seconds from the SAS files, it is just that the results might differ in the last bit. Only the pandas path was explicitly discarding the fraction, which is what this PR changes.
My test file happens to expose the imprecision in the polars route. There is a slight chance that the way I generated it might be partially responsible for that though.
Regarding the date range, you are absolutely correct. With large distances from 1960, there is not enough mantissa bits for microsecond precision. Ideally would reproduce how SAS renders the values if it is possible without losing performance.

belegdol · 2026-05-14T07:12:10Z

It looks like the way I created the file has no influence on the precision challenges.
While I have managed to develop a vectorized version of the code in #330 (comment), it still produces an off-by-microsecond date for one of 100 values tested. At the first glance it looks like the subtraction in

pyreadstat/pyreadstat/_readstat_parser.pyx

Line 232 in 1adde46

return (tstamp/1000) - unix_to_origin_secs

changes the LSB enough to influence the rounding: 1829-09-11T10:47:37.282617 becomes 1829-09-11T10:47:37.282618. Before the subtraction, the floating point value is -4111996342.7173829079. After subtraction, it becomes -4427615542.7173824310. The former rounds up whereas the latter rounds down. It appears that one needs to convert to integer microseconds before the subtraction for maximum precision as done in #330 (comment), but this moves the majority of operations out of the vectorized polars code.
Given the nature of storing time as a floating point number, loss of precision due to rounding is expected at some point. How much is acceptable? Microsecond precision is only guaranteed for around 142 years around 1960 for SAS, anything beyond that is imprecise by nature.
Using a test file with milliseconds only would pass the tests but this feels like cheating.
Regarding the performance, is there a particular file you would like me to benchmark? Is the large dataset mentioned in the readme publicly available?

belegdol · 2026-05-14T09:54:07Z

Here is the vectorised version for reference:

From 32b083c82485f9472246d0cad15abbc8c0a2892e Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Thu, 14 May 2026 01:24:12 +0200
Subject: [PATCH] vectorised version

---
 pyreadstat/_readstat_parser.pyx | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..ef73a27 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -1107,7 +1107,14 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(
+                    [
+                        pl.from_epoch(
+                            (pl.col(c) % 1 * 1e6).round().cast(pl.Int64) + (pl.col(c).floor() * 1e6).cast(pl.Int64),
+                            time_unit='us')
+                        for c in datetime_cols if data_frame[c].len() > 0
+                    ]
+                )
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

belegdol · 2026-05-14T12:43:00Z

It turns out that generating a test file with precision limited to milliseconds only is not enough. I generated one with 1000 datetimes between years 1700 and 2200, and I still got one mismatch: 1825-05-31T22:55:23.626 ended up as 1825-05-31T22:55:23.625999 when imported from SAS with the vectorized code. Same as previously, subtraction changed the seventh digit after the radix: -4247082276.3740000725 became -4562701476.3740005493.
It would appear that for polars one would have to explicitly round down to milliseconds in order to ensure a complete match.

Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.

belegdol · 2026-05-14T16:48:58Z

I managed to get it working. tests/test_narwhalified.py ran in 3.081s with this PR applied on top of master vs 3.367s on master. If you have a heavier dataset to test, please let me know.
Please note that for this PR to work on dev, my other PR needs to be merged first, otherwise the formats will not be recognised properly. Alternatively, the changes can be applied on top of master.

Read fractions of seconds from SAS datasets

2ab3a38

belegdol force-pushed the increased_datetime_time_precision branch from 8fd1d3f to 2ab3a38 Compare May 12, 2026 08:03

belegdol marked this pull request as ready for review May 12, 2026 08:07

belegdol changed the base branch from master to dev May 12, 2026 20:13

belegdol added 2 commits May 14, 2026 18:14

Add a test for a SAS file with fractional seconds time

8508b6e

Prevent rounding errors when processing datetime to polars

444e946

Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.

Conversation

belegdol commented May 8, 2026

Uh oh!

belegdol commented May 12, 2026

Uh oh!

ofajardo commented May 12, 2026

Uh oh!

belegdol commented May 12, 2026

Uh oh!

ofajardo commented May 12, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

ofajardo commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belegdol commented May 13, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

ofajardo commented May 13, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

belegdol commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belegdol commented May 14, 2026

Uh oh!

belegdol commented May 14, 2026

Uh oh!

belegdol commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ofajardo commented May 13, 2026 •

edited

Loading

belegdol commented May 14, 2026 •

edited

Loading

belegdol commented May 14, 2026 •

edited

Loading