Skip to content

Retain up to microsecond precision when importing sas7bdat files#330

Open
belegdol wants to merge 3 commits into
Roche:devfrom
belegdol:increased_datetime_time_precision
Open

Retain up to microsecond precision when importing sas7bdat files#330
belegdol wants to merge 3 commits into
Roche:devfrom
belegdol:increased_datetime_time_precision

Conversation

@belegdol
Copy link
Copy Markdown

@belegdol belegdol commented May 8, 2026

As things stand now, time and datetime precision when importing .sas7bdat files is limited to whole seconds. With this change, up to microsecond precision is possible.

I am submitting this as a draft PR for the following reasons:

  1. the changes are (partially) dependent on the upstream ReadStat changes which are included here for reproducibility. Otherwise one needs Linux SAS to generate appropriate files
  2. Once the format length is read properly for 32-bit files like ones created with SAS for Windows, the current approach of comparing the entirety of format name with a list will break whenever a format width is defined. Is there a particular reason for this approach? Ignoring the format width altogether would definitely be simpler. I can include it in the change you would be willing to consider
  3. Right now the same code is used for SPSS files, which I have none to test with. If the change needs to be made SAS-specific, please let me know.

@belegdol belegdol force-pushed the increased_datetime_time_precision branch from 8fd1d3f to 2ab3a38 Compare May 12, 2026 08:03
@belegdol
Copy link
Copy Markdown
Author

#1 is actually independent, I removed it.
#2 should be mitigated by #332
#3 I do not know how to test.

@belegdol belegdol marked this pull request as ready for review May 12, 2026 08:07
@ofajardo
Copy link
Copy Markdown
Collaborator

hi

Thanks for the PR. I think the change is straightforward, but I need a bit more context. Explain why you need this in first place and submit a SAS file with fractional seconds as example, it need to be included in the test as well (Check contributing, issue should be discussed before sending a PR).

1- Changes to Readstat are not allowed, but I do not see any at the moment, so I guess it is fine.
2- Explain why "the current approach of comparing the entirety of format name with a list will break whenever a format width is defined". I am not happy to change the current behavior of the package, so there must be a very strong reason to drop that mechanism that has worked for years by now.
3- At the very minimum tests should run without errors (they do for me). There are sample SPSS and STATA files in the test_data folder. However, we would need to generate SPSS and STATA tests files with fractional second information.

@belegdol
Copy link
Copy Markdown
Author

Hi @ofajardo,

apologies for the confusing description and for the lack of issue. The description refers to the state before I submitted #332. As such, bullet point 2 can be ignored here and can be discussed in #332.
The reason for needing this is to read maximum precision available as well as ensuring maximum possible parity with https://github.com/saurfang/spark-sas7bdat/commits/master/. There should be no difference between creating a spark data frame natively with spark-sas7bdat and creating a pandas dataframe with pyreadstat and subsequently converting it to spark dataframe.
I will look into test errors, and can also likely provide a test SAS file. SPSS one can be a problem since I do not have access to SPSS currently.

@ofajardo
Copy link
Copy Markdown
Collaborator

OK, sounds good. Look at my comments in the other PR as well. For this one it would be important to provide a SAS example and writing a simple test that checks that the microseconds are read as expected. Probably one file and test would be enough for both PRs

@belegdol belegdol changed the base branch from master to dev May 12, 2026 20:13
@belegdol
Copy link
Copy Markdown
Author

I have a test and a test file ready. Unfortunately, polars path is failing on datetime due to microseconds hitting the limit of float64 precision. I will see if I can get the conversion to work in a way that the test pass.
Interestingly enough, time column, which is not vectorized, passes fine.
The pandas route works for datetime column as well. Would disabling the vectorisation for datetime be worth the increased precision?

@ofajardo
Copy link
Copy Markdown
Collaborator

ofajardo commented May 13, 2026

Hi, the vectorized version must stay by default, because the non vectorized version is very slow and I had complains in the past, and that is why the vectorized version was put in place.

If there is no graceful way to make the higher precision work with the vectorized version, one option is not too capture the high precision by default but activate it by an argument and in such case avoid the vectorized path. Saying from the top of my head as I have not seen it. But in my experience so far, precision higher than seconds is rare (I have not seen it so far myself), so making it optional is an option.

@belegdol
Copy link
Copy Markdown
Author

Does this still qualify as vectorized?

From 186d2b17589f85bf95b9c8f05a2c6f4379aed277 Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Wed, 13 May 2026 16:40:19 +0200
Subject: [PATCH] Working concept for SAS using long long

---
 pyreadstat/_readstat_parser.pyx | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..f2fb7ba 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -199,6 +199,8 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
     cdef int secs
     cdef double msecs
     cdef int usecs
+    cdef long long days_usecs
+    cdef long long unix_to_origin_usecs
     cdef object mydat
 
     # For polars we are going to return an epoch from unix origin, 
@@ -233,13 +235,16 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
             return mydat.date()
     elif var_format == DATE_FORMAT_DATETIME:
         if output_format == "polars":
-            # we want to return seconds from unix
+            unix_to_origin_usecs = <long long> (unix_to_origin_secs * 1e6)
+            # we want to return microseconds from unix
             if file_format == FILE_FORMAT_STATA:
-                # tstamp is in millisecons
-                return (tstamp/1000) - unix_to_origin_secs
+                # tstamp is in milliseconds
+                return (tstamp * 1e3) - unix_to_origin_usecs
             else:
                 # tstamp in seconds
-                return tstamp - unix_to_origin_secs
+                days_usecs = <long long> (floor(tstamp) * 1e6)
+                usecs = <int> (round(tstamp % 1 * 1e6))
+                return days_usecs - unix_to_origin_usecs + usecs
 
         if file_format == FILE_FORMAT_STATA:
             # tstamp is in millisecons
@@ -1107,7 +1112,7 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='us'))
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

Or does the entire double to timestamp conversion need to happen in from_epoch()? The code above works, but it definitely adds complexity.

@belegdol
Copy link
Copy Markdown
Author

Another option would be to limit the feature to milliseconds and document that anything beyond that might me imprecise.
Millisecond precision is what I have actually seen in the wild, and it is also how much precision you get by calling datetime() in SAS.

@ofajardo
Copy link
Copy Markdown
Collaborator

milliseconds can be an option.

As described one problem is the speed, it has to at least match the current one.

Another problem with increasing the precision by default is that it may limit the oldest date that can be processed, and that is an issue when you have old datetimes in your data, in that sense seconds was very good. Please control for that as well.

@belegdol
Copy link
Copy Markdown
Author

I will investigate. As things are currently in the master branch, the polars path will happily read fractional seconds from the SAS files, it is just that the results might differ in the last bit. Only the pandas path was explicitly discarding the fraction, which is what this PR changes.
My test file happens to expose the imprecision in the polars route. There is a slight chance that the way I generated it might be partially responsible for that though.
Regarding the date range, you are absolutely correct. With large distances from 1960, there is not enough mantissa bits for microsecond precision. Ideally would reproduce how SAS renders the values if it is possible without losing performance.

@belegdol
Copy link
Copy Markdown
Author

belegdol commented May 14, 2026

It looks like the way I created the file has no influence on the precision challenges.
While I have managed to develop a vectorized version of the code in #330 (comment), it still produces an off-by-microsecond date for one of 100 values tested. At the first glance it looks like the subtraction in

return (tstamp/1000) - unix_to_origin_secs
changes the LSB enough to influence the rounding: 1829-09-11T10:47:37.282617 becomes 1829-09-11T10:47:37.282618. Before the subtraction, the floating point value is -4111996342.7173829079. After subtraction, it becomes -4427615542.7173824310. The former rounds up whereas the latter rounds down. It appears that one needs to convert to integer microseconds before the subtraction for maximum precision as done in #330 (comment), but this moves the majority of operations out of the vectorized polars code.
Given the nature of storing time as a floating point number, loss of precision due to rounding is expected at some point. How much is acceptable? Microsecond precision is only guaranteed for around 142 years around 1960 for SAS, anything beyond that is imprecise by nature.
Using a test file with milliseconds only would pass the tests but this feels like cheating.
Regarding the performance, is there a particular file you would like me to benchmark? Is the large dataset mentioned in the readme publicly available?

@belegdol
Copy link
Copy Markdown
Author

Here is the vectorised version for reference:

From 32b083c82485f9472246d0cad15abbc8c0a2892e Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Thu, 14 May 2026 01:24:12 +0200
Subject: [PATCH] vectorised version

---
 pyreadstat/_readstat_parser.pyx | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..ef73a27 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -1107,7 +1107,14 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(
+                    [
+                        pl.from_epoch(
+                            (pl.col(c) % 1 * 1e6).round().cast(pl.Int64) + (pl.col(c).floor() * 1e6).cast(pl.Int64),
+                            time_unit='us')
+                        for c in datetime_cols if data_frame[c].len() > 0
+                    ]
+                )
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

@belegdol
Copy link
Copy Markdown
Author

It turns out that generating a test file with precision limited to milliseconds only is not enough. I generated one with 1000 datetimes between years 1700 and 2200, and I still got one mismatch: 1825-05-31T22:55:23.626 ended up as 1825-05-31T22:55:23.625999 when imported from SAS with the vectorized code. Same as previously, subtraction changed the seventh digit after the radix: -4247082276.3740000725 became -4562701476.3740005493.
It would appear that for polars one would have to explicitly round down to milliseconds in order to ensure a complete match.

belegdol added 2 commits May 14, 2026 18:14
Datetime values in SAS, SPSS and STATA are stored as a floating point
number. Any operation risks introducing a rounding error.
Use integer math in order to preserve original interpretation.
@belegdol
Copy link
Copy Markdown
Author

belegdol commented May 14, 2026

I managed to get it working. tests/test_narwhalified.py ran in 3.081s with this PR applied on top of master vs 3.367s on master. If you have a heavier dataset to test, please let me know.
Please note that for this PR to work on dev, my other PR needs to be merged first, otherwise the formats will not be recognised properly. Alternatively, the changes can be applied on top of master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants