The CSVToFHIR converter Data Contract file is a JSON based configuration file which defines the CSVToFHIR conversion process. A single data contract file is used to support multiple CSVToFHIR conversions for a single tenant.
{
"general": {},
"fileDefinitions": {}
}| Key Name | Description | Required |
|---|---|---|
| general | Contains general settings for the CSVToFHIR service which apply to all file definitions such as tenant id, timezone, etc | Y |
| fileDefinitions | Defines the CSVToFHIR mapping configuration for each CSV source file | Y |
{
"general": {
"timeZone": "US/Eastern",
"tenantId": "tenant1",
"assigningAuthority": "default-authority",
"streamType": "live",
"emptyFieldValues": [
"empty",
"\\n"
],
"regexFilenames": true
}
}| Key Name | Description | Required |
|---|---|---|
| timeZone | The default timezone to apply to datetime values as necessary. The timeone is a valid tz database/IANA values | Y |
| tenantId | The customer tenant id | Y |
| assigningAuthority | The default assigning authority/system of record, applied to code values where needed | N |
| streamType | Indicates if the incoming data is "historical" or "live" | Y |
| emptyFieldValues | Additional field values which are treated as "empty" or NULL | N |
| regexFilenames | Determines if the filename to fileDefinition matching will be regex based or simple string comparison. Default: False | N |
- timeZone is a valid value as specified by
pytz.common_timezones - streamType is either historical or live
The top-level key within a FileDefinition serves as the FileDefinition name. This name is matched against the input CSV file using either string match (case-insensitive) or regex (case-sensitive) [see general.regexFilenames setting].
Two methods of providing a fileDefinition for a file are supported; inline and external.
Provided as the value of the filename pattern key
{
"fileDefinitions": {
"Patient": {
"comment": "patient demographic fields",
"fileType": "csv",
"valueDelimiter": ",",
"convertColumnsToString": true,
"resourceType": "Patient",
"groupByKey": "patientId",
"skiprows": [2],
"headers": [],
"tasks": []
}
}
}| Key Name | Description | Required |
|---|---|---|
| fileType | The type of source file. Supports "csv" or "fixed-width". Defaults to "csv" | N |
| valueDelimiter | The value, or field, delimiter used in the "CSV" file. Defaults to "," | N |
| comment | Provides an additional description/comment for the file definition | N |
| convertColumnsToString | When true converts all input columns to Python's "str" data type. If False, Pandas will infer the datatype. Defaults to True. | N |
| resourceType | The target FHIR resource type. | Y |
| groupByKey | The field used to associate the record with other records in separate CSV payloads | Y |
| skiprows | Skip rows from the csv file. Value can be in integet to skip that many lines from the top, or an array to skip rows with that index (0 based). e.g. [2, 3] will skip row 3 and 4 from the file (including headers) |
N |
| headers | Provides a header record for a CSV source file without a header. Column names reflect the target record format. When fileType=fixed-width, headers is a required field, and should be a dictionary of type <col_name>:<col_width> |
N |
| tasks | List of tasks to execute against the CSV source data, prior to FHIR conversion. | N |
reference an external json file that contains the fileDefinition model. The path can be absolute or relative to the main data-contract
{
"fileDefinitions": {
"Patient": "external-patient-file-definition.json"
}
}- resourceType is a valid FHIR resource type name
- tasks definitions align with pipeline task function implementations
- if fileType is
fixed-widthheaders are mandatory
{
"name": "add_constant",
"comment": "adds a default ethnic system code to the source data",
"params": {
"name": "ethnicitySystem",
"value": "http://terminology.hl7.org/CodeSystem/v3-Ethnicity"
}
}| Key Name | Description | Required |
|---|---|---|
| name | The task name | Y |
| comment | Additional comment/documentation for the task | N |
| params | Dictionary of task parameters | N |
| Task Name | Description | Parameters | Examples |
|---|---|---|---|
| add_constant | Creates an additional column with constant value assigned |
name: constant name used as the new column name value: constant value |
{
"name": "add_constant",
"params": {
"name": "ssnSystem",
"value": "http://hl7.org/fhir/sid/us-ssn"
}
}
|
| add_row_num |
|
||
| append_list |
|
||
| build_object_array |
|
||
| change_case |
|
||
| compare_to_date |
|
||
| conditional_column | Creates a new column by mapping the values from a source column to a target value. Supports inline mappings as a dictionary, and external mappings using a file name. If mapping not found: "default": value will be used if present otherwise leave existing value from source |
source_column: The source column for the new conditional column condition_map: Maps values from the source column to the desired target values or a filename that contains the mappings. target_column: The new target column |
Inline mapping:
{
"name": "conditional_column",
"params": {
"source_column": "raceText",
"target_column": "raceCode",
"condition_map": {
"american indian": "1002-5",
"asian": "2028-9",
"black": "2054-5",
"pacific islander": "2076-8",
"white": "2106-3",
"default": "2131-1"
}
}
}
External file map:
{
"name": "conditional_column",
"params": {
"source_column": "raceText",
"target_column": "raceCode",
"condition_map": "race.csv"
}
}
|
| conditional_column_update |
|
||
| condition_column_with_prerequisite |
|
||
| convert_to_list |
|
||
| copy_columns | Copies one or more source columns to a target column. |
columns: List of column(s) to copy target_column: Name of column to be created value_separator: Character to be used when mutliple columns are concatenated.Defaults to a " ". |
|
| filter_to_columns |
|
||
| find_not_null_value |
|
||
| format_date | Formats date string values within a column to a target format. |
columns: the column name(s) to update date_format: the date format to apply to the column(s). Defaults to “%Y-%m-%d” |
{
"name": "format_date",
"params": {
"columns": [
"dateOfBirth"
],
"date_format": "%Y-%m-%d"
}
}
|
| map_codes | Maps 'codes' values to a target representation. map_codes supports inline mappings as a dictionary, and external mappings using a file name. If a map of “default” is provided, any value that does not match another mapping key is given this value. | code_map: Contains a mapping from source value to target value for a given set of fields or the name of a file which contains the mappings. |
Internal data contract mapping
{
"name": "map_codes",
"params": {
"code_map": {
"sex": {
"default": "unknown",
"F": "female",
"M": "male",
"O": "other"
}
}
}
}
External mapping:
{
"name": "map_codes",
"params": {
"code_map": {
"sex": "sex.csv"
}
}
}
|
| rename_columns | Renames column(s) | column_map: A dictionary which maps the source column names to the target column names. |
{
"name": "rename_columns",
"params": {
"column_map": {
"hospitalId": "assigningAuthority",
"givenName": "nameFirstMiddle",
"familyName": "nameLast",
"sex": "gender",
"dateOfBirth": "birthDate"
}
}
}
|
| replace_text |
|
||
| remove_whitespace_from_columns |
|
||
| set_nan_to_none |
|
||
| split_column |
|
||
| split_row | Splits a record “row” on a column or columns, creating N additional rows for each column included within the split operation. Creates additional columns for the “label” and “value”. |
columns: The column(s) to split on split_column_name: The column name or header used for the "label" column split_value_column_name: The column name or header used for the "value" column |
{
"name": "split_row",
"params": {
"columns": [
"height",
"weight",
"bmi"
],
"split_column_name": "observationCodeText",
"split_value_column_name": "observationValue"
}
}
|
| validate_value | Validates the value of the column against the provided regex to confirm a complete match. An alternative value can be provided which will be used in case regex does not match |
column_name: name of column to validate regex: regex to validate against no_match_replacement: replacement value if the regex does not match, defaults to None |
{
"name": "validate_value",
"params": {
"column_name": "company email",
"regex": "^[a-zA-Z0-0_\.]+@company.com",
"no_match_replacement": "invalid company email"
}
}
|
| join_data | takes a secondary file (csv or fixed width) and joins the supplimentary data with the primary dataframe based on some common joining key. |
secondary_data_source: path to the secondary data file. Can be relative to the data-contract dictionary or absolute. join_type: {'left', 'right', 'outer', 'inner', 'cross'} which correspond roughly to the join types in relational databases by the same name. See "how" parameter of pandas.dataframe.merge function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html join_on: Key that will be used to corelate the two data sets. The Key has to be named exactly the same in both datasets source_type: csv or fixed-width. default: csv reader_params: any additional parameters that need to be passed to pandas for reading the secondary file. default: None |
{
"name": "join_data",
"params": {
"secondary_data_source": "/path/to/secondary/file.csv",
"join_type": "outer",
"join_on": "MRN",
"source_type": "csv",
"reader_params": {
"some_panda_reader_param": "param value"
}
}
}
|
CsvToFHIR uses the smart_open library to read the Datacontract and any referenced file definitions within it. This allows CsvToFHIR to seamlessly support data contract files that are stored in external cloud storage such as S3, Azure Blob Storage etc. (see smart_open documentation for a full list of supported platforms).
In order to use an external cloud storage vendor additional dependencies might be required which are not automatically installed by CsvToFHIR. For example to support
azure storage, install the azure extras package from the smart_open pip install smart_open[azure]. Again, see the smart_open documentation for additional information
and examples.
A sample configuration to use a data contract stored in azure would look like:
export mapping_config_directory=azure://my_bucket/my_prefix/
export mapping_config_file_name=data-contract.json