Skip to content

airdos04c parser time and memory optimization#66

Merged
cisar2218 merged 2 commits into
masterfrom
update-parsers
Apr 28, 2026
Merged

airdos04c parser time and memory optimization#66
cisar2218 merged 2 commits into
masterfrom
update-parsers

Conversation

@cisar2218
Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI review requested due to automatic review settings April 28, 2026 22:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the AIRDOS04C log parser for improved runtime/memory by avoiding per-block event lists and constructing the output via NumPy, and adds basic parsing-duration instrumentation in the spectral record processing task.

Changes:

  • Refactor AIRDOS04C parsing to compute high-energy histograms in a single pass and build the DataFrame from a preallocated NumPy array.
  • Add optional UNIX time alignment via $TIME-derived offset when computing time_ms.
  • Add timing printout around parse_log_to_unified(...) in the async spectral record task.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
backend/DOSPORTAL/tasks/spectral_records.py Adds parsing duration measurement/log output around unified parsing.
backend/DOSPORTAL/services/parsing/parsers/airdos_04c.py Refactors AIRDOS04C parser to reduce allocations and speed up DataFrame construction.
Comments suppressed due to low confidence (1)

backend/DOSPORTAL/services/parsing/parsers/airdos_04c.py:160

  • metadata no longer includes channel_columns. Other parsers consistently provide this (e.g., parsers/airdos_04a.py:102-115, parsers/geodos_1024_v1.py:88-101), and some callers look for it before falling back to column-name scanning. Please add channel_columns: channel_names back to keep the unified parsing metadata consistent across parsers.
        metadata: dict[str, object] = {
            "header": raw_header,
            "format": self.format_name,
            "version": self.version,
            "records_count": int(len(df)),
            "channels_count": int(len(channel_names)),
            "skipped_rows": int(skipped_rows),
            "time_range_ms": [
                float(df["time_ms"].min()),
                float(df["time_ms"].max()),
            ],
            "high_energy_bin_edges": bin_edges,
            "high_energy_adc_max": self.ADC_MAX,
            "start_unix_s": int(start_unix_s) if start_unix_s is not None else None,
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 141 to 142
total_channels = self.LOW_CHANNELS + self.N_HIGH_BINS
channel_names = [f"channel_{i}" for i in range(total_channels)]

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_channels is recomputed at line 141 but the value isn’t used afterward (the earlier total_channels is what drives channel_names/channel_arr). This looks like a leftover from the previous implementation; removing the redundant assignment would reduce confusion.

Copilot uses AI. Check for mistakes.
import time
_t0 = time.perf_counter()
parsed = parse_log_to_unified(combined)
print(f"Parsing done in {time.perf_counter() - _t0:.3f}s (record {record.id}, {len(raw_files)} files)")
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new timing instrumentation uses print(...) inside this background task. For production/async task runs this can be noisy and bypass structured logging/levels; prefer using the module logger (e.g., debug/info) for this timing metric so it can be filtered/collected consistently.

Suggested change
print(f"Parsing done in {time.perf_counter() - _t0:.3f}s (record {record.id}, {len(raw_files)} files)")
logger.debug(
"Parsing done in %.3fs (record %s, %s files)",
time.perf_counter() - _t0,
record.id,
len(raw_files),
)

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +138
df = pd.DataFrame(channel_arr, columns=channel_names)
df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32))
df.insert(0, "time_ms", np.array(time_list, dtype=np.float64))
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_ms is no longer normalized to start at 0.0 (other parsers do time_ms -= min(time_ms) and tests expect df["time_ms"].min() == 0.0). With the AIRDOS04C fixture values (~7.5e9 ms), this will break consumers that assume relative time. Suggest subtracting the minimum from the time_ms array/column after it’s built, and ensure metadata["time_range_ms"] reflects the normalized range.

Suggested change
df = pd.DataFrame(channel_arr, columns=channel_names)
df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32))
df.insert(0, "time_ms", np.array(time_list, dtype=np.float64))
time_ms_arr = np.array(time_list, dtype=np.float64)
time_ms_arr -= time_ms_arr.min()
df = pd.DataFrame(channel_arr, columns=channel_names)
df.insert(0, "particle_count", np.array(particle_count_list, dtype=np.int32))
df.insert(0, "time_ms", time_ms_arr)

Copilot uses AI. Check for mistakes.
@cisar2218 cisar2218 merged commit 369771e into master Apr 28, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants