Seems like maybe pandas/pytables append is a lot slower than writing into a new file. (Or else the rewriting-when-strings-are-longer code is hitting a lot.)
The sort step should probably pre-count lines per PUMA in stats, and maybe max string lengths for the things that need that. Then we can preallocate file sizes and write into them, instead of appending.
Probably should (also?) consider using feather or parquet instead of hdf5.
Seems like maybe pandas/pytables append is a lot slower than writing into a new file. (Or else the rewriting-when-strings-are-longer code is hitting a lot.)
The sort step should probably pre-count lines per PUMA in
stats, and maybe max string lengths for the things that need that. Then we can preallocate file sizes and write into them, instead of appending.Probably should (also?) consider using feather or parquet instead of hdf5.