Whenever we want to make a change to a persistence format or a communication format, we need to ensure this is backwards compatible. For example:
- DynamoDB tables
- SQS messages (usually JSON)
- JSON files, e.g. in S3
- Properties files, e.g. instance/table properties
- Arrow files, other binary formats
Bear in mind the old code will still be active for some time after the deployment. This includes:
- ECS tasks on the old version that may continue to run for hours
- EMR clusters with jobs queued with an old version of the jar
- Clients that may not be upgraded for some time
For example, compaction tasks and bulk import jobs will continue to submit state store commits to SQS, and large transactions to S3, with the old version of the code.
This means values generated by the old version need to be readable by the new version, and values generated by the new version need to be readable by the old version as well. This needs to be true regardless of which version we're upgrading from and to.
For every JSON format, every properties file, every DynamoDB table, and for binary formats e.g. Arrow files, ensure that any change does not:
- Remove any pre-existing fields
- Stop populating any pre-existing fields
- Rename any pre-existing fields
- Change any existing structure
- Change the behaviour of existing fields
For the values serialised in a format, ensure that all previously supported values will still result in the expected behaviour after an upgrade. For example, if there's an enum, we should not remove any enum values, because there may be serialised instances of this stored in an instance.
For every field we've added, ensure that functionality will be unaffected if that field is missing. Expect the system and its clients to continue to emit values that do not contain that field during the deployment and for hours afterwards.
Ensure that unit test coverage is still present for support for the format as it was before any change.
This includes all SQS messages, all JSON files held in S3, all JSON data held in DynamoDB tables.
If instance/table properties are renamed, the old name should still work.
If properties need to be removed, they should be first documented as deprecated, with an explanation of why and recommendations for how to proceed. Deprecated properties should be retained for a significant period. When a property is removed, the system should continue to behave as the user would expect. Nothing should fail because a property has been removed. Ideally the property should still be recognised so the user can be informed of what happened.
At time of writing we don't have specific mechanisms to manage these changes. When we deprecate a property we can just state in the property description. If we want to rename any properties we'll need to add a mechanism to support the old name as an alias. We have an issue for these mechanisms:
If we want to change the default value of a property or JSON field, we need to make sure this will not produce unexpected behaviour if the user has an instance deployed with this property unset. For example, if we want to change the default state store implementation, we would need some handling to remember what the default implementation was when a table was created. If we didn't do that, existing tables would suddenly forget their table state, because there's no state stored in the new implementation type.
We can distinguish between external formats and internal formats. Usually internal formats should not use default values, so that once the data has been created its behaviour will not change. This is particularly relevant for internal SQS messages. For external formats where data is supplied by the user, it can be more practical to use defaults.
We have an issue for a mechanism for this for configuration properties:
DynamoDB tables have a partially defined schema, in the sense that they have a partition key and sort key which are defined when the table is created. We will not be able to change this aspect of the table definition.
We also need to treat the tables as having an implied schema. We need to maintain support for all existing data held in a table. Because different components are deployed in parallel, we also need to assume that old processes will continue to run and read from and write to the table. That means that any new data needs to be readable by old processes, as well as old data readable by new processes. All the usual restrictions on persistent data formats (e.g. JSON) also apply to DynamoDB fields.