Skip to content

Latest commit

 

History

History
510 lines (482 loc) · 16.3 KB

File metadata and controls

510 lines (482 loc) · 16.3 KB

Data Contract

The CSVToFHIR converter Data Contract file is a JSON based configuration file which defines the CSVToFHIR conversion process. A single data contract file is used to support multiple CSVToFHIR conversions for a single tenant.

Resource Mapping Specification

Top Level Keys

{
  "general": {},
  "fileDefinitions": {}
}
Key Name Description Required
general Contains general settings for the CSVToFHIR service which apply to all file definitions such as tenant id, timezone, etc Y
fileDefinitions Defines the CSVToFHIR mapping configuration for each CSV source file Y

General

{
  "general": {
    "timeZone": "US/Eastern",
    "tenantId": "tenant1",
    "assigningAuthority": "default-authority",
    "streamType": "live",
    "emptyFieldValues": [
      "empty",
      "\\n"
    ],
    "regexFilenames": true
  }
}
Key Name Description Required
timeZone The default timezone to apply to datetime values as necessary. The timeone is a valid tz database/IANA values Y
tenantId The customer tenant id Y
assigningAuthority The default assigning authority/system of record, applied to code values where needed N
streamType Indicates if the incoming data is "historical" or "live" Y
emptyFieldValues Additional field values which are treated as "empty" or NULL N
regexFilenames Determines if the filename to fileDefinition matching will be regex based or simple string comparison. Default: False N

Validations

  • timeZone is a valid value as specified by pytz.common_timezones
  • streamType is either historical or live

FileDefinition

The top-level key within a FileDefinition serves as the FileDefinition name. This name is matched against the input CSV file using either string match (case-insensitive) or regex (case-sensitive) [see general.regexFilenames setting].

Two methods of providing a fileDefinition for a file are supported; inline and external.

Inline

Provided as the value of the filename pattern key

{
 "fileDefinitions": {
    "Patient": {
     "comment": "patient demographic fields",
      "fileType": "csv",
      "valueDelimiter": ",",
      "convertColumnsToString": true,
      "resourceType": "Patient",
      "groupByKey": "patientId",
      "skiprows": [2],
      "headers": [],
      "tasks": []
    }
  }
}
Key Name Description Required
fileType The type of source file. Supports "csv" or "fixed-width". Defaults to "csv" N
valueDelimiter The value, or field, delimiter used in the "CSV" file. Defaults to "," N
comment Provides an additional description/comment for the file definition N
convertColumnsToString When true converts all input columns to Python's "str" data type. If False, Pandas will infer the datatype. Defaults to True. N
resourceType The target FHIR resource type. Y
groupByKey The field used to associate the record with other records in separate CSV payloads Y
skiprows Skip rows from the csv file. Value can be in integet to skip that many lines from the top, or an array to skip rows with that index (0 based). e.g. [2, 3] will skip row 3 and 4 from the file (including headers) N
headers Provides a header record for a CSV source file without a header. Column names reflect the target record format. When fileType=fixed-width, headers is a required field, and should be a dictionary of type <col_name>:<col_width> N
tasks List of tasks to execute against the CSV source data, prior to FHIR conversion. N

External

reference an external json file that contains the fileDefinition model. The path can be absolute or relative to the main data-contract

{
 "fileDefinitions": {
    "Patient": "external-patient-file-definition.json"
  }
}

Validations

  • resourceType is a valid FHIR resource type name
  • tasks definitions align with pipeline task function implementations
  • if fileType is fixed-width headers are mandatory

Tasks

{
 "name": "add_constant",
 "comment": "adds a default ethnic system code to the source data",
 "params": {
    "name": "ethnicitySystem",
    "value": "http://terminology.hl7.org/CodeSystem/v3-Ethnicity"
 }
}
Key Name Description Required
name The task name Y
comment Additional comment/documentation for the task N
params Dictionary of task parameters N

Supported Tasks

Task Name Description Parameters Examples
add_constant Creates an additional column with constant value assigned name: constant name used as the new column name
value: constant value
{
  "name": "add_constant",
  "params": {
    "name": "ssnSystem",
    "value": "http://hl7.org/fhir/sid/us-ssn"
  }
}
      
add_row_num
      
append_list
      
build_object_array
      
change_case
      
compare_to_date
      
conditional_column Creates a new column by mapping the values from a source column to a target value. Supports inline mappings as a dictionary, and external mappings using a file name.

If mapping not found:
"default": value will be used if present
otherwise leave existing value from source
source_column: The source column for the new conditional column condition_map: Maps values from the source column to the desired target values or a filename that contains the mappings. target_column: The new target column Inline mapping:
{
  "name": "conditional_column",
  "params": {
    "source_column": "raceText",
    "target_column": "raceCode",
    "condition_map": {
      "american indian": "1002-5",
      "asian": "2028-9",
      "black": "2054-5",
      "pacific islander": "2076-8",
      "white": "2106-3",
      "default": "2131-1"
    }
  }
}
      
External file map:
{
  "name": "conditional_column",
  "params": {
    "source_column": "raceText",
    "target_column": "raceCode",
    "condition_map": "race.csv"
  }
}
      
conditional_column_update
      
condition_column_with_prerequisite
      
convert_to_list
      
copy_columns Copies one or more source columns to a target column. columns: List of column(s) to copy
target_column: Name of column to be created
value_separator: Character to be used when mutliple columns are concatenated.Defaults to a " ".
      
filter_to_columns
      
find_not_null_value
      
format_date Formats date string values within a column to a target format. columns: the column name(s) to update
date_format: the date format to apply to the column(s). Defaults to “%Y-%m-%d”
{
  "name": "format_date",
  "params": {
    "columns": [
      "dateOfBirth"
    ],
    "date_format": "%Y-%m-%d"
  }
}
      
map_codes Maps 'codes' values to a target representation. map_codes supports inline mappings as a dictionary, and external mappings using a file name. If a map of “default” is provided, any value that does not match another mapping key is given this value. code_map: Contains a mapping from source value to target value for a given set of fields or the name of a file which contains the mappings. Internal data contract mapping
{
  "name": "map_codes",
  "params": {
    "code_map": {
      "sex": {
        "default": "unknown",
        "F": "female",
        "M": "male",
        "O": "other"
      }
    }
  }
}
      
External mapping:
{
  "name": "map_codes",
  "params": {
    "code_map": {
      "sex": "sex.csv"
    }
  }
}
      
rename_columns Renames column(s) column_map: A dictionary which maps the source column names to the target column names.
{
  "name": "rename_columns",
  "params": {
    "column_map": {
      "hospitalId": "assigningAuthority",
      "givenName": "nameFirstMiddle",
      "familyName": "nameLast",
      "sex": "gender",
      "dateOfBirth": "birthDate"
    }
  }
}
      
replace_text
      
remove_whitespace_from_columns
      
set_nan_to_none
      
split_column
      
split_row Splits a record “row” on a column or columns, creating N additional rows for each column included within the split operation. Creates additional columns for the “label” and “value”. columns: The column(s) to split on
split_column_name: The column name or header used for the "label" column
split_value_column_name: The column name or header used for the "value" column
{
  "name": "split_row",
  "params": {
    "columns": [
      "height",
      "weight",
      "bmi"
    ],
    "split_column_name": "observationCodeText",
    "split_value_column_name": "observationValue"
  }
}
      
validate_value Validates the value of the column against the provided regex to confirm a complete match. An alternative value can be provided which will be used in case regex does not match column_name: name of column to validate
regex: regex to validate against
no_match_replacement: replacement value if the regex does not match, defaults to None
{
  "name": "validate_value",
  "params": {
    "column_name": "company email",
    "regex": "^[a-zA-Z0-0_\.]+@company.com",
    "no_match_replacement": "invalid company email"
  }
}
      
join_data takes a secondary file (csv or fixed width) and joins the supplimentary data with the primary dataframe based on some common joining key. secondary_data_source: path to the secondary data file. Can be relative to the data-contract dictionary or absolute.
join_type: {'left', 'right', 'outer', 'inner', 'cross'} which correspond roughly to the join types in relational databases by the same name.
See "how" parameter of pandas.dataframe.merge function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
join_on: Key that will be used to corelate the two data sets. The Key has to be named exactly the same in both datasets
source_type: csv or fixed-width. default: csv
reader_params: any additional parameters that need to be passed to pandas for reading the secondary file. default: None
{
  "name": "join_data",
  "params": {
    "secondary_data_source": "/path/to/secondary/file.csv",
    "join_type": "outer",
    "join_on": "MRN",
    "source_type": "csv",
    "reader_params": {
      "some_panda_reader_param": "param value"
    }
  }
}
      

Alternative Datacontract locations

CsvToFHIR uses the smart_open library to read the Datacontract and any referenced file definitions within it. This allows CsvToFHIR to seamlessly support data contract files that are stored in external cloud storage such as S3, Azure Blob Storage etc. (see smart_open documentation for a full list of supported platforms).

In order to use an external cloud storage vendor additional dependencies might be required which are not automatically installed by CsvToFHIR. For example to support azure storage, install the azure extras package from the smart_open pip install smart_open[azure]. Again, see the smart_open documentation for additional information and examples.

A sample configuration to use a data contract stored in azure would look like:

export mapping_config_directory=azure://my_bucket/my_prefix/
export mapping_config_file_name=data-contract.json