airdos04c parser time and memory optimization#66
Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes the AIRDOS04C log parser for improved runtime/memory by avoiding per-block event lists and constructing the output via NumPy, and adds basic parsing-duration instrumentation in the spectral record processing task.
Changes:
- Refactor AIRDOS04C parsing to compute high-energy histograms in a single pass and build the DataFrame from a preallocated NumPy array.
- Add optional UNIX time alignment via
$TIME-derived offset when computingtime_ms. - Add timing printout around
parse_log_to_unified(...)in the async spectral record task.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
backend/DOSPORTAL/tasks/spectral_records.py |
Adds parsing duration measurement/log output around unified parsing. |
backend/DOSPORTAL/services/parsing/parsers/airdos_04c.py |
Refactors AIRDOS04C parser to reduce allocations and speed up DataFrame construction. |
Comments suppressed due to low confidence (1)
backend/DOSPORTAL/services/parsing/parsers/airdos_04c.py:160
metadatano longer includeschannel_columns. Other parsers consistently provide this (e.g.,parsers/airdos_04a.py:102-115,parsers/geodos_1024_v1.py:88-101), and some callers look for it before falling back to column-name scanning. Please addchannel_columns: channel_namesback to keep the unified parsing metadata consistent across parsers.
metadata: dict[str, object] = {
"header": raw_header,
"format": self.format_name,
"version": self.version,
"records_count": int(len(df)),
"channels_count": int(len(channel_names)),
"skipped_rows": int(skipped_rows),
"time_range_ms": [
float(df["time_ms"].min()),
float(df["time_ms"].max()),
],
"high_energy_bin_edges": bin_edges,
"high_energy_adc_max": self.ADC_MAX,
"start_unix_s": int(start_unix_s) if start_unix_s is not None else None,
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| total_channels = self.LOW_CHANNELS + self.N_HIGH_BINS | ||
| channel_names = [f"channel_{i}" for i in range(total_channels)] | ||
|
|
There was a problem hiding this comment.
total_channels is recomputed at line 141 but the value isn’t used afterward (the earlier total_channels is what drives channel_names/channel_arr). This looks like a leftover from the previous implementation; removing the redundant assignment would reduce confusion.
| import time | ||
| _t0 = time.perf_counter() | ||
| parsed = parse_log_to_unified(combined) | ||
| print(f"Parsing done in {time.perf_counter() - _t0:.3f}s (record {record.id}, {len(raw_files)} files)") |
There was a problem hiding this comment.
The new timing instrumentation uses print(...) inside this background task. For production/async task runs this can be noisy and bypass structured logging/levels; prefer using the module logger (e.g., debug/info) for this timing metric so it can be filtered/collected consistently.
| print(f"Parsing done in {time.perf_counter() - _t0:.3f}s (record {record.id}, {len(raw_files)} files)") | |
| logger.debug( | |
| "Parsing done in %.3fs (record %s, %s files)", | |
| time.perf_counter() - _t0, | |
| record.id, | |
| len(raw_files), | |
| ) |
| df = pd.DataFrame(channel_arr, columns=channel_names) | ||
| df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32)) | ||
| df.insert(0, "time_ms", np.array(time_list, dtype=np.float64)) |
There was a problem hiding this comment.
time_ms is no longer normalized to start at 0.0 (other parsers do time_ms -= min(time_ms) and tests expect df["time_ms"].min() == 0.0). With the AIRDOS04C fixture values (~7.5e9 ms), this will break consumers that assume relative time. Suggest subtracting the minimum from the time_ms array/column after it’s built, and ensure metadata["time_range_ms"] reflects the normalized range.
| df = pd.DataFrame(channel_arr, columns=channel_names) | |
| df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32)) | |
| df.insert(0, "time_ms", np.array(time_list, dtype=np.float64)) | |
| time_ms_arr = np.array(time_list, dtype=np.float64) | |
| time_ms_arr -= time_ms_arr.min() | |
| df = pd.DataFrame(channel_arr, columns=channel_names) | |
| df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32)) | |
| df.insert(0, "time_ms", time_ms_arr) |
No description provided.