Skip to content

Commit 92fe668

Browse files
Cinco de mayo (#312)
* Finish removing IBasicCSVParser * Create csv::internals::speculative * Make CSVParserCore templated * Fix -Werror=reorder * Simplified RowCollection. Removed unsync access methods. * Add append_rows() to RowCollection * Refactor TSD to accept batches * 1gb/s C:\CSV Datasets\metadata.tsv>csv_bench metadata.tsv Parsing took (including disk IO): 11.4867 Dimensions: 9074875 rows x 58 columns Parser worker threads: 12 Speculative chunks: 1210 ambiguous=0 probability_model=0 size_heuristic=0 repairs=0 assumed_quoted=0 assumed_unquoted=1210 Columns: strain virus gisaid_epi_isl genbank_accession genbank_accession_rev sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted date_updated sampling_strategy database clade_nextstrain clade_who Nextclade_pango immune_escape ace2_binding missing_data divergence nonACGTN coverage rare_mutations reversion_mutations potential_contaminants QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons QC_overall_score QC_overall_status frame_shifts deletions insertions substitutions aaSubstitutions clock_deviation * Remove CSVRowOutput * Clean-up BOM stripping logic * Get rid of tuple stuff * Restore previous parser hot loop * Reorganize code * Make threading fully runtime customizable * Update csv_read_scheduler.hpp * Reorganize namespaces * Update driver.cpp * Code deduplication * Test strengthening * Added more tests * Update test_read_csv_file.cpp * Fix poor performance with quoted values * Split up memory optimization parts * Fix TSan issue * More cleanup * Improve Python benchmarks * Update benchmarks * Updated docs Removed speculative_parsing() method. It now automatically turns on for a specific threshold. * Up timeout for TSD stress test
1 parent 1fc5331 commit 92fe668

131 files changed

Lines changed: 16085 additions & 4601 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/rules/csv_reader_rules.md

Lines changed: 0 additions & 39 deletions
This file was deleted.

.github/workflows/cmake-multi-platform.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ jobs:
6363
python-version: '3.x'
6464

6565
- name: Cache CMake FetchContent dependencies
66-
uses: actions/cache@v4
66+
uses: actions/cache@v5
6767
with:
6868
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
6969
key: fetchcontent-${{ runner.os }}-${{ matrix.c_compiler }}-std${{ matrix.cxx_standard }}-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -127,7 +127,7 @@ jobs:
127127
submodules: recursive
128128

129129
- name: Cache CMake FetchContent dependencies
130-
uses: actions/cache@v4
130+
uses: actions/cache@v5
131131
with:
132132
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
133133
key: fetchcontent-${{ runner.os }}-single-threaded-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -167,7 +167,7 @@ jobs:
167167
python-version: '3.x'
168168

169169
- name: Cache CMake FetchContent dependencies
170-
uses: actions/cache@v4
170+
uses: actions/cache@v5
171171
with:
172172
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
173173
key: fetchcontent-${{ runner.os }}-emscripten-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -228,7 +228,7 @@ jobs:
228228
submodules: recursive
229229

230230
- name: Cache CMake FetchContent dependencies
231-
uses: actions/cache@v4
231+
uses: actions/cache@v5
232232
with:
233233
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
234234
key: fetchcontent-${{ runner.os }}-simd-${{ matrix.simd_mode }}-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}

.github/workflows/compat-edge-cases.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ jobs:
4343
python-version: '3.x'
4444

4545
- name: Cache CMake FetchContent dependencies
46-
uses: actions/cache@v4
46+
uses: actions/cache@v5
4747
with:
4848
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
4949
key: fetchcontent-${{ runner.os }}-msvc-no-zc-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -95,7 +95,7 @@ jobs:
9595
mingw-w64-x86_64-ninja
9696
9797
- name: Cache CMake FetchContent dependencies
98-
uses: actions/cache@v4
98+
uses: actions/cache@v5
9999
with:
100100
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
101101
key: fetchcontent-${{ runner.os }}-mingw-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}

.github/workflows/coverage.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ jobs:
2424
submodules: recursive
2525

2626
- name: Cache CMake FetchContent dependencies
27-
uses: actions/cache@v4
27+
uses: actions/cache@v5
2828
with:
2929
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
3030
key: fetchcontent-${{ runner.os }}-coverage-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}

.github/workflows/sanitizers.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ jobs:
4949
submodules: recursive
5050

5151
- name: Cache CMake FetchContent dependencies
52-
uses: actions/cache@v4
52+
uses: actions/cache@v5
5353
with:
5454
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
5555
key: fetchcontent-${{ runner.os }}-sanitizer-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }}-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -81,7 +81,7 @@ jobs:
8181

8282
- name: Upload sanitizer logs
8383
if: failure()
84-
uses: actions/upload-artifact@v4
84+
uses: actions/upload-artifact@v7
8585
with:
8686
name: linux-sanitizer-logs-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }}
8787
path: build/Testing/
@@ -103,7 +103,7 @@ jobs:
103103
uses: ilammy/msvc-dev-cmd@0b201ec74fa43914dc39ae48a89fd1d8cb592756
104104

105105
- name: Cache CMake FetchContent dependencies
106-
uses: actions/cache@v4
106+
uses: actions/cache@v5
107107
with:
108108
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
109109
key: fetchcontent-${{ runner.os }}-msvc-asan-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -130,7 +130,7 @@ jobs:
130130

131131
- name: Upload MSVC AddressSanitizer logs
132132
if: failure()
133-
uses: actions/upload-artifact@v4
133+
uses: actions/upload-artifact@v7
134134
with:
135135
name: windows-msvc-asan-logs-std20
136136
path: build/msvc-asan/Testing/
@@ -145,7 +145,7 @@ jobs:
145145
submodules: recursive
146146

147147
- name: Cache CMake FetchContent dependencies
148-
uses: actions/cache@v4
148+
uses: actions/cache@v5
149149
with:
150150
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
151151
key: fetchcontent-${{ runner.os }}-valgrind-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}
@@ -177,7 +177,7 @@ jobs:
177177

178178
- name: Upload Valgrind results
179179
if: failure()
180-
uses: actions/upload-artifact@v4
180+
uses: actions/upload-artifact@v7
181181
with:
182182
name: valgrind-results
183183
path: build/Testing/
@@ -194,7 +194,7 @@ jobs:
194194
submodules: recursive
195195

196196
- name: Cache CMake FetchContent dependencies
197-
uses: actions/cache@v4
197+
uses: actions/cache@v5
198198
with:
199199
path: ${{ env.CSV_FETCHCONTENT_BASE_DIR }}
200200
key: fetchcontent-${{ runner.os }}-o3-coverage-${{ hashFiles('CMakeLists.txt', 'tests/CMakeLists.txt') }}

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ bin/
77
build/
88

99
# Build: Python
10+
__pycache__/
1011
*.pyc
1112
*.pyd
1213

.gitmodules

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
11
[submodule "tests/data"]
22
path = tests/data
33
url = https://github.com/vincentlaucsb/csv-data.git
4-
[submodule "python/pybind11"]
5-
path = python/pybind11
6-
url = https://github.com/pybind/pybind11.git

.travis.yml

Lines changed: 0 additions & 86 deletions
This file was deleted.

AGENTS.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Architectural overview for AI assistants working with this codebase.
44

5-
> **Maintenance rule:** Whenever this file is changed, update both `CLAUDE.md` and `ARCHITECTURE.md` in the same directory to reflect relevant changes. `CLAUDE.md` is a bullet-point summary and `ARCHITECTURE.md` is the top-level architecture index; both must stay in sync with this guidance.
5+
> **Maintenance rule:** `AGENTS.md` is the canonical AI-agent guidance. When this file changes, update `ARCHITECTURE.md` in the same directory if the architecture index or durable engineering guidance needs to reflect the change. `CLAUDE.md` is only a compatibility shim that points here.
66
77
## Critical: single_include/csv.hpp Is A Shim
88

@@ -52,6 +52,7 @@ For detailed file mapping, parser data flow, and component relationships, see `A
5252
8. **Opportunistic rewrites/refactors are allowed when they are safe and justified.** Keep them separated from build-fix urgency where possible, and avoid bundling unrelated churn with compiler triage unless explicitly requested.
5353
9. **When proposing changes that affect compile-time behavior, explain the tradeoff clearly.** Call out any impact to codegen, performance, portability, and readability before applying the change.
5454
10. **If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.** Provide a short justification before expanding further.
55+
11. **`CSVReader::iterator` is intentionally single-pass.** Do not cache all `RawCSVDataPtr` chunks to make it behave like a forward iterator; that defeats bounded-memory streaming for large CSV files. Algorithms that need multi-pass access should first materialize rows into a container such as `std::vector<CSVRow>`.
5556
5657
See `tests/AGENTS.md` for test strategy, checklist, and conventions.
5758
@@ -71,6 +72,9 @@ See `tests/AGENTS.md` for test strategy, checklist, and conventions.
7172
- **Consolidation:** If a `.cpp` would be under ~100 lines *and* the split causes excessive comment duplication between the two files, prefer a single `.hpp` with definitions marked `inline` (free functions and methods alike). Do not use `CSV_INLINE` for consolidated definitions — `CSV_INLINE` expands to empty in multi-header mode, which would produce ODR violations across TUs. Do not consolidate just for brevity — only when duplication is the dominant cost.
7273
7. **Prefer LF (`\n`) line endings for tracked source, test, CMake, and Markdown files.** When you touch a file with mixed line endings, normalize the edited file to LF unless there is a file-specific reason not to. Avoid introducing mixed CRLF/LF endings in the same file.
7374
8. **Keep preprocessor directives flush left.** `#define`, `#if`, `#ifdef`, `#else`, and `#endif` should start at column 0. Code inside multi-line macros should be indented exactly as the equivalent non-macro code would be; do not add extra indentation just because it lives inside a macro body.
75+
9. **Keep constructor initializer lists in declaration order.** C++ initializes bases and members in declaration order, not initializer-list order. When adding or editing a constructor, order its initializer list to match the class declaration exactly so GCC/Clang `-Wreorder` stays clean and readers do not infer a false initialization dependency.
76+
10. **Internal folder namespaces should match folder structure.** When adding or moving files under `include/internal/`, place their contents in the matching nested namespace when practical. For example, `include/internal/speculative/` maps to `csv::internals::speculative`, and `include/internal/parser/` maps to `csv::internals::parser`. Do not churn existing files solely for this rule unless the namespace move is part of an intentional architecture cleanup.
77+
11. **Do not accidentally pass large objects by value.** Use `const&` for observation, `&` for mutation, and `&&` / by-value-with-an-explicit-`std::move` for ownership transfer. If passing a large object by value is intentional, make the consuming semantics obvious at the call site or add a brief comment.
7478
7579
### Rules for Comments
7680
1. **Always update or remove incorrect comments.**

ARCHITECTURE.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Primary architecture document:
77

88
Subsystem deep-dive:
99
- include/internal/THREADSAFE_DEQUE_DESIGN.md
10+
- BOMStrippingRefactor.md
1011

1112
Operational/testing guidance:
1213
- AGENTS.md
@@ -23,7 +24,9 @@ Notes:
2324
- Private member naming should prefer trailing underscores; when editing mixed-style code, normalize the touched region toward that convention.
2425
- Prefer LF (`\n`) line endings for tracked source, test, CMake, and Markdown files; when touching a file with mixed endings, normalize it to LF unless there is a file-specific reason not to.
2526
- Keep preprocessor directives flush left; `#define`, `#if`, `#ifdef`, `#else`, and `#endif` should start at column 0, and code inside multi-line macros should be indented as if the macro wrapper were not present.
26-
- Compatibility macros defined in `common.hpp` must only be referenced after including `common.hpp`. See AGENTS.md and CLAUDE.md for details.
27+
- Keep constructor initializer lists in the same order as base/member declarations so GCC/Clang `-Wreorder` remains clean and initialization dependencies stay obvious.
28+
- Internal folder namespaces should match folder structure when practical; for example `include/internal/parser/` maps to `csv::internals::parser`.
29+
- Compatibility macros defined in `common.hpp` must only be referenced after including `common.hpp`. See AGENTS.md for details.
2730
- API constraints should be user-friendly: do not over-constrain templates unless needed for correctness, safety, or a measured performance win.
2831
- `CSVReader` is intentionally non-copyable and move-enabled; use explicit ownership transfer patterns (`std::move`, `std::unique_ptr`) at API boundaries.
2932
- Respect existing compile-time compatibility macros (`IF_CONSTEXPR`, `CONSTEXPR_VALUE`, etc.) unless correctness requires change.
@@ -32,4 +35,3 @@ Notes:
3235
- When changing compile-time behavior, explicitly document tradeoffs (codegen, performance, portability, readability).
3336
- If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.
3437
- Apply the 5/2 anti-duplication rule: if equivalent behavior exists in 2+ code paths and each copy is ~5+ meaningful lines, extract a shared helper; if duplication remains, document why and keep regression coverage for each path.
35-

0 commit comments

Comments
 (0)