|
| 1 | +# Data formats |
| 2 | +Python can handle many data formats "out of the box" using its standard |
| 3 | +library. How to read/write CSV and XML are illustrated, as well as how |
| 4 | +to read binary data generated by, e.g., C programs. |
| 5 | + |
| 6 | +## What is it? |
| 7 | + |
| 8 | +1. Binary data |
| 9 | + |
| 10 | + * `write_doubles.c`: C code to write a sequence of `double` to a binary |
| 11 | + file (square roots of integers). |
| 12 | + * `read_doubles.c`: C code to read a binary file containing a sequence |
| 13 | + of `double`, and print those in ASCII representation to verify the |
| 14 | + contents of the binary file. |
| 15 | + * `Makefile`: to build the C programs. |
| 16 | + * `read_doubles.py`: reads sequences of 8 bytes, unpacks them into |
| 17 | + a Python variable, and prints them in ASCII to standard out. |
| 18 | + * `doubles.bin`: binary file. |
| 19 | + * `variable_length_arrays.c`: C code to write a number of variable |
| 20 | + length arrays as binary data. The length of each array is given as |
| 21 | + a four-byte unsigned integer, and is followed by eight-byte little |
| 22 | + endian encoded double precision floating point values. |
| 23 | + * `read_variable_length_array.py`: Python script to read and print |
| 24 | + variable length arrays. |
| 25 | + * `write_bin_records.py`: Python script that writes a record consisting |
| 26 | + of a fixed length string and an integer. |
| 27 | + * `read_bin_records.py`: C application that reads a record consisting |
| 28 | + of a fixed length string and an integer. |
| 29 | + |
| 30 | +1. CSV files |
| 31 | + |
| 32 | + * `write_csv.py`: uses the standard library `csv` module to create |
| 33 | + a CSV file with four columns and five rows. |
| 34 | + * `read_csv.py`: reads a CSV file (e.g., `data.csv`) that has two |
| 35 | + columns, `name` and `weight` and prints the values to standard output. |
| 36 | + It uses the CSV `sniffer()` function to detect the CSV dialect. |
| 37 | + * `data.csv`: example file to use with `read_csv.py`. |
| 38 | + * `read_commented_csv.py`: illustrates some fiddling with files that |
| 39 | + are not truly CSV since they have a comment header |
| 40 | + * `data_commented_tabs.csv`: tab separated CSV file |
| 41 | + * `data_commented_commas.csv`: comma separated CSV file |
| 42 | + * `data_commented_semicolon.csv`: semicolon separated CSV file |
| 43 | + * `read_csv_rows.py`: illustrates the default CSV reader |
| 44 | + * `agt_parser.py`: parser for a file format that is partially CSV. |
| 45 | + * `agt_data`: data files to parse using `agt_parser.py`. |
| 46 | + |
| 47 | +1. JSON files |
| 48 | + |
| 49 | + * `average_age.py`: computes the average age of "people" stored in a JSON |
| 50 | + file. |
| 51 | + * `average_age_functional.py`: computes the average age of "people" |
| 52 | + stored in a JSON file in functional style. |
| 53 | + * `people.json`: JSON file containing personal information. |
| 54 | + |
| 55 | +1. XML files |
| 56 | + |
| 57 | + * `write_xml.py`: creates XML that has a root-level `blocks` element, |
| 58 | + containing `block` elements that are named (by attribute), and can |
| 59 | + have `item` elements, where the latter contain a text element. |
| 60 | + Use the `-h` options to see how to specify parameters. |
| 61 | + XML is generated using the `xml.minidom` module in the standard |
| 62 | + library. |
| 63 | + * `read_xml.py`: reads an XML file in the format described above, |
| 64 | + and writes each item, preceeded by its block's name to standard |
| 65 | + output. However, this program can also deal with nested blocks, i.e., |
| 66 | + an XML file where a `block` element can contain another `block` |
| 67 | + element. |
| 68 | + The SAX parser in the `xml.sax` module is used for parsing the XML. |
| 69 | + A `ContentHandler` is implemented, and a `Block` class is used for |
| 70 | + data representation. |
| 71 | + * `blocks.xml`: example XML file. |
| 72 | + * `nested_blocks.xml`: example XML file containng nested block elements. |
| 73 | + |
| 74 | +1. Text as binary |
| 75 | + |
| 76 | + * `line_indexer.py`: indexes a text file, i.e., it produces a CSV file |
| 77 | + with two columns, the first the file position of the start of each line |
| 78 | + in the text file, the second the length of that line, line endings |
| 79 | + exclusive. The text file is read in binary mode. |
| 80 | + * `index.txt`: example file to index. |
| 81 | + * `read_line_index.py`: test script that takes a text file as input, |
| 82 | + a file position and a line length, and prints the characters read |
| 83 | + to standard output for verification, quoted by '|'. |
| 84 | + |
| 85 | +1. CSV, HDF5 and JSON combination |
| 86 | + |
| 87 | + For more (and better examples of reading and writing, see |
| 88 | + the [DataStorage](https://github.com/gjbex/training-material/tree/master/DataStorage/Hdf5) section. |
| 89 | + * `data_generator.py`: this script will generate numerical data (integer |
| 90 | + and floating point) using specified random number distributions. The |
| 91 | + column names, types the distributions, and the parameters they require |
| 92 | + are specified in a JSON file that is read. The generated data is |
| 93 | + written to a file in either CSV or HDF5 format. |
| 94 | + * `mixed_data.json`: example JSON definition file for the data table to |
| 95 | + be generated. |
| 96 | + |
| 97 | +1. `VCD` files |
| 98 | + |
| 99 | +## Other formats |
| 100 | + |
| 101 | +Example code for using NetCDF can be found in the [DataFormats](https://github.com/gjbex/training-material/tree/master/DataStorage/NetCDF) section. |
0 commit comments