-
Notifications
You must be signed in to change notification settings - Fork 0
Archiving
Archiving means placing multiple files into an archive, a file that contains other files.
There are two archive formats we use for archiving:
- 7zip archives: extension 7z
- tar archives: extention tar
Advantages/Disadvantages:
- 7zip doesnt store user/group information, extracted files are owned by the extractor. tar stores user/group information and can be restored on extraction.
- 7zip doesnt handle hardlinks as hardlinks, files are duplicated. tar only stores one copy of hardlinked data, links are stored as links
- 7zip allows different compression schemes per-file. tar files are uncompressed, must be compressed by single bitstream compressor (see below). Compressed tar files have extension tar.XXX where XXX represents the compressor used. A compressed tar file is called a tarball.
- 7zip allows easier modification of archive (deleting/renaming/adding). tar files must be decompressed and recreated (technically 7zip does this as well but the program does it for you rather than manually)
- 7zip supports symbolic links, but doesnt like broken symbolic links, tar will handle them
- Listing a 7zip file contents in instant, tar files must be decompressed
Bitstream compressors:
- lzma2: slow compression, best compression for random data (images, track files), medium decompression that can be multithreaded
- bzip2: slow compression, best compression most of the time for text files, slow decompression, single threaded only
- deflate (gzip): used for nifti and mgh files. Superseded, don't use.
- zstd: fast compression, can be multithreaded, excellent compression given high speed, very fast decompression
- lz4: ultra faster compression, can be multithreaded, compression not as good, ultra fast decompression
When creating archives, we generally want to do a few things:
- Make an archive of a root directory that contains files, for example freesurfer.7z for a directory called freesurfer.
- Make the archive not bigger than, say 10TB. If a directory contains many medium/large directories, compress each child directory. This means that it is easier to transfer/recover individual directories rather than transferring one huge >10TB file. It also means that less data is lost upon a corruption.
When to use tar or 7z. 7z is better for most operations once the archive is created and good for directories that have compressible and already compressed data. For example, a Freesurfer output will have the following directories:
- label: text files, compressible
- mri: mgz, m3z, .nii.gz are not compressible
- stats: text files, compressible
- surf: compressible
So, we add label, stats, surf with a slow compressor, to save space, then add mri without compression for speed.
To make a 7zip archive:
`~/kg98/Shared/archive_scripts/directory_7zz.job <foo> [level]`
Level is the compression level for the slow lzma2 compressor, it can be 1 (fastest), 3, 5, 7, 9 (slowest). Anywhere from 1 to 7 is recommended, 9 is too slow.
If you have text files with extension .tsv, .csv, .txt you can create an archive with those compressed using bzip2
~/kg98/Shared/archive_scripts/directory_7zz_csvtxt_bzip2.job <foo> [level]
To make a .tar.zst tarball
~/kg98/Shared/archive_scripts/directory_tarzst.job <foo> [level]
Level here can be any number from 1 (fast) to 19 (slow). Recommended range is from 3 to 15.
When these scripts complete, they make a .done file for each archive. So .7z.done. To delete the directories that were archived in the current working directory:
~/kg98/Shared/archive_scripts/remove_done_7z.sh for 7z archives
~/kg98/Shared/archive_scripts/remove_done_tarzst.sh for 7z archives
To make the file listings for all archives in the current directory and below, run:
~/kg98/Shared/archive_scripts/all_archives_listings.sh
- 0.0 Home
- 0.1 Neuroscience fundamentals
- 0.2 Reproducible Science
- 0.3 MRI Physics, BIDS, DICOM, and data formats
- 0.4 Introduction to Diffusion MRI
- 0.5 Introduction to Functional MRI
- 0.6 Measuring functional and effective connectivity
- 0.7 Connectomics, graph theory, and complexity
- 0.8 Statistical and Mathematical Tidbits
- 0.9 Introduction to Psychopathology
- 0.10 Introduction to Genetics and Bioinformatics
- 0.11 Neural field theory and eigenmodes
- 0.12 Introduction to Programming
- 1.0 Working on the Cluster
- 2.0 Programming Languages
- 2.1 Python
- 2.1.1 Getting Set Up
- 2.1.2 Applications of Python in Neuroimaging
- 2.2 MATLAB
- 2.3 R and RStudio
- 2.4 Programming Intro Exercises
- 2.5 git and GitHub
- 2.6 SLURM and Job Submission
- 2.1 Python
- 3.0 Neuroimaging Tools and Packages
- 3.1 BIDS
- 3.2 FreeSurfer
- 3.2.1 Qdec
- 3.3 FSL
- 3.3.1 ICA-FIX
- 3.4 Connectome Workbench/wb_command
- 3.5 fMRIPrep
- 3.6 QSIPrep
- 3.7 HCP Pipeline
- 3.8 tedana
- 4.0 Quality control
- 4.1 MRIQC
- 4.2 Common Artefacts
- 4.3 T1w
- 4.4 rs-fMRI
- 5.0 Specialist Tools
- 6.0 Putting it all together
- 7.0 Data management