Skip to content

Commit 3f1364d

Browse files
committed
Update characterize_data.py documentation.
Update documentation: Long file names and per cell character limits in spreadsheeds. Usage of uv to run script directly from GitHub repository. Duplicate file handling in per_series mode.
1 parent 64ead93 commit 3f1364d

1 file changed

Lines changed: 21 additions & 13 deletions

File tree

Python/scripts/characterize_data.py

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,6 @@
1717
#
1818
# =========================================================================
1919

20-
#
21-
# Run the script directly from GitHub without downloading it using uv (https://github.com/astral-sh/uv):
22-
# uv run https://raw.githubusercontent.com/InsightSoftwareConsortium/SimpleITK-Notebooks/refs/heads/main/Python/scripts/characterize_data.py -h
23-
#
24-
2520
#
2621
# Provide inline script metadata per PEP 723 (https://peps.python.org/pep-0723/)
2722
# /// script
@@ -775,7 +770,7 @@ def characterize_data(argv=None):
775770
-------
776771
To run the script one has to specify:
777772
1. Root of the data directory.
778-
2. Filename of csv output, can include relative or absolute path.
773+
2. Filename of csv output.
779774
3. The analysis type to perform per_file or per_series. The latter indicates
780775
we are only interested in DICOM files.
781776
@@ -830,6 +825,11 @@ def characterize_data(argv=None):
830825
python characterize_data.py ../../Data/ Output/generic_image_data_report.csv per_file \
831826
--configuration_file ../../Data/characterize_data_user_defaults.json 2> errors.txt
832827
828+
You can also run the script directly from GitHub without downloading it or explicitly creating
829+
a virtual Python environment using the uv Python package and project manager
830+
(https://github.com/astral-sh/uv):
831+
uv run https://raw.githubusercontent.com/InsightSoftwareConsortium/SimpleITK-Notebooks/refs/heads/main/Python/scripts/characterize_data.py -h
832+
833833
Output:
834834
------
835835
The output from the script includes:
@@ -893,16 +893,24 @@ def xyz_to_index(x, y, z, thumbnail_size, tile_size):
893893
tile_size =
894894
print(df["files"].iloc[xyz_to_index(x, y, z, thumbnail_size, tile_size)])
895895
896-
Caveat:
897-
------
896+
Caveats:
897+
--------
898898
When characterizing a set of DICOM images, start by running the script in per_file
899-
mode. This will identify duplicate image files. Remove them before running using the per_series
900-
mode. If run in per_series mode on the original data the duplicate files will not be identified
901-
as such, they will be identified as belonging to the same series. In this situation we end up
902-
with multiple images in the same spatial location (repeated 2D slice in a 3D volume). This will
903-
result in incorrect values reported for the spacing, image size etc.
899+
mode. This will identify duplicate images at the file level. Remove them before running
900+
in per_series mode. If run in per_series mode on data with duplicate files they may
901+
not be identified as such as they may be identified as belonging to the same series.
902+
In this situation we end up with multiple images in the same spatial location
903+
(repeated 2D slice in a 3D volume). This will result in incorrect values reported for the
904+
spacing, image size etc.
904905
When this happens you will see a WARNING printed to the terminal output, along the lines of
905906
"ImageSeriesReader : Non uniform sampling or missing slices detected...".
907+
908+
When file paths are very long and the number of files in a series is large the total
909+
per cell character count in the "files" column may exceed the cell limits of some
910+
spreadsheet applications. The limit for Microsoft Excel is 32,767 characters and for
911+
Google Sheets it is 50,000 characters. When opened with Excel, the contents of the cell are
912+
truncated and this will corrupt the column layout. The data itself is valid and can be read
913+
correctly using Python or R.
906914
"""
907915
# Maximal number of points for which scatterplots are saved in pdf format,
908916
# otherwise png. Threshold was deterimined empirically based on rendering

0 commit comments

Comments
 (0)