exploration report on encryption in IPLD

warpfork · warpfork · commit 32669e484e8b · 2021-02-19T14:22:26.000+01:00
diff --git a/design/history/exploration-reports/2021.01-encryption.md b/design/history/exploration-reports/2021.01-encryption.md
@@ -0,0 +1,262 @@
+Encryption and IPLD, 2021
+=========================
+
+This is an exploration report about the role and relationship of encryption relating to IPLD,
+gathering some thoughts and recent updates in early 2021.
+It's meant to be useful as a reference piece for further discussion at this time.
+
+This document takes input from tons of people.
+It's written by warpfork in the immediate aftermath of close conversation with Mikeal,
+but also has tons of input from Carson of Textile (via the 2021.01.11 IPLD Weekly meeting),
+and also draws on other notes exchanged over time with project such as the Ceramic Network, 3Box, Peergos, Qri, and others.
+(Even if you don't see your name here, it's likely you've contributed something --
+this topic has just been a long time brewing, so attributing all inspirations completely is now hard!)
+Thank you to all these folks for their efforts.
+
+
+Overview
+--------
+
+People frequently want to implement encryption as part of decentralized systems.
+So, it's no surprise that it's also frequent that people want a story for how encryption and IPLD should interact.
+
+For a long time, IPLD has been agnostic to any sort of encryption.
+(We've been afraid of doing a _wrong_ thing and baking it into specs.)
+Instead, we've asked that people using IPLD figure out how to compose IPLD and encryption on their own.
+It may now be time for this to change, as we gather lots of input from the community.
+
+In this document, we're going to cover three major topic groups:
+
+- 1. A proposal for encryption primitives in IPLD, and a plan for how to use multicodec indicators for encryption!
+- 2. A section about usage conventions we see which have repeatedly emerged, and seem useful, and thus now seem worth identifying and creating vocabulary for.
+- 3. A section for gathering notes about use cases, tradeoff notes, and general cautions about general applied cryptography.
+
+Comments and feedback on each of these topic groups are welcome.
+
+At the end of this document, _we will still expect IPLD to be agnostic to encryption_ in the sense that you can bring your own concepts and layer them in IPLD as you please,
+but we _may_ also have some ideas for cryptographic primitives we might give some extra support and coordination for.
+
+
+Encryption Codecs
+-----------------
+
+The IPLD team is now considering encryption which is signaled by multicodec indicators
+(and thus works anywhere CIDs are used),
+and works in the natural way an IPLD codec is expected to work.
+
+(This is a big change in stance.
+Previously, we've considered it unclear whether codecs are the right place for this.)
+
+There's a couple of details about how we expect this to work which are recent realizations,
+and so this document might be nearly the first description of them:
+
+
+### encryption codecs use multicodec indicators
+
+As stated in the summary above: encryption will use multicodec indicators.
+
+This means we'll reserve new numbers in the multicodec table.
+We'll expect to see values like "AES-GCM" appear in the same table as "DAG-CBOR".
+
+
+### encryption codecs are still codecs of the usual contracts
+
+Codecs which do encryption will look like regular IPLD codecs.
+
+What does this mean?  Well, in our recent improvements to formalizations, we now describe a codec as
+the operation "decode" -- `function (rawByteStream) -> (ipldDataModelNode | error)` --
+and the operation "encode" -- `function (ipldDataModelNode, writeableBytestream) -> (error)`.
+(Loosely.  This is psuedocode, not any particular programming language.)
+
+(Okay, what did _that_ mean?  ;)  ...I'll do it again in plain language.)
+
+The key detail that is important for IPLD's soundness is:
+the encoded data stream must be transformable to a "node" -- which must be describable _entirely_ and _purely_ by the IPLD Data Model --
+and then back again, from that "node" to an encoded data stream.
+
+Okay, background established.  Now: why does this matter to encryption?
+
+Two reasons:
+
+- that contract means *no additional parameters* are allowed.  So, for encryption, it means keys don't -- *can't* -- enter into this yet.
+- that contract means we always have to be able to transform the encoded form into *something*.
+
+
+#### encryption codecs are defined as destructuring ciphertext
+
+... *not* as yielding cleartext.  This may be unintuitive, but is important.
+
+First, an example: many encryption schemes have two components in their ciphertext:
+some sort of "initialization vector" (commonly known as an "IV"), which is a number unique to that ciphertext;
+and then the ciphertext body itself.
+So, for such an encryption scheme, the relevant IPLD codec would probably produce a _map_, matching this schema:
+
+```ipldsch
+type CodecResult struct {
+	iv   Bytes
+	body Bytes
+} representation map
+```
+
+(The actual serial form may look like anything it wants (likely, some binary length-prefixed format),
+because that's the responsibility of the codec implementation to define;
+this small schema just describes the Data Model view we might expect to be yielded.)
+
+This is neat in several ways:
+
+- It means that processing the data into Data Model is always *defined* -- even if you don't have key material.
+	- This in turn means IPLD Selectors, and all sorts of other stuff, *work normally* over encrypted data.
+	  (Not over the cleartext, obviously -- then the encryption wouldn't be doing much, would it?
+	  But their operation is *defined*, so they can be used safely and predictably.)
+- It means we have a way to access the ciphertext.
+	- ... That may not sound like a big deal, but it's been a weird and interesting buggaboo in a lot of other previous proposals about how to fit encryption into IPLD.
+- It means we don't have to solve the problem of how to get key material into a codec.
+	- This is a big deal because it means, well, a bunch of our abstraction layers in IPLD don't... uh, shatter.  Good.
+
+Okay, but how do we get to cleartext then?  Let us proceed to the next section!
+
+#### getting to cleartext when using encryption codecs involves feature detection
+
+Encryption codecs in IPLD libraries will have extra methods on them, and support some kind of "feature detection" to advertise this.
+Those additional methods will accept key material as a parameter, and return an IPLD Data Model Node... of the *cleartext*.
+(E.g., the "node" returned here, and the "node" returned by the codec alone, will be *very* different data.)
+
+How exactly this looks will vary by langauge and library implementation;
+different languages will have different idioms for doing feature detection.
+
+
+### key management is out of band
+
+Keys still need to be supplied to the encrypt and decrypt methods of an encryption codec when they are used.
+This key supply and management is something that must be handled "out of band".
+
+We don't have a total strategy for automatic application of keys in large graphs.
+And we probably won't, either.
+We expect that most applications using cryptography will have some key management strategy that is unique to them,
+and will probably _not_ want their IPLD library dictating anything about key management.
+(For example, many complex applications using cryptography may involve key derivation strategies,
+which can even be content or data-organization aware -- we cannot possibly specify such things in IPLD; we need to just accept instructions on that.)
+
+IPLD will be open to future work on library functions for how to handle key management in practice.
+If we can find sufficiently common patterns, they may be worth library features.
+However, we should be comfortable understanding that there may actually not be single answers to key management,
+and the number of features relating to it that belong in IPLD libraries might be correspondingly minimal.
+
+
+### encryption codecs can be used recursively
+
+TODO (this emerges fairly naturally but deserves comment and example)
+
+
+### limitations of this approach
+
+#### double hashing
+
+This approach is roughly "mac-then-encrypt-then-mac" (if you're from the era of crypto education which called things "MAC" rather than "MIC" (which would make much more sense (but, I digress))).
+
+In other words: we hash things twice, and one of the hashes ends up in the output data body (because it's inside the ciphertext).
+
+There's nothing wrong with this (it's certainly cryptographically sound!); it's just slightly excessive and does spend a few bytes.
+
+#### selectors don't work over cleartext
+
+Because these codecs don't immediately yield cleartext, selectors applied to data yielded from these codecs won't be working on the cleartext either.
+
+_However_, this isn't necessarily a big problem, and we actually have a good remedy available:
+_ADLs could still be composed with this approach._
+An ADL could handle the key management issues,
+use the codec which is only yielding ciphertext internally (e.g. these nodes would be the ADL "substrate"),
+apply the decryption, and then yield the cleartext as the ADL's output.
+Traversals, selectors, and all the other goodies that are expected to work on IPLD Data Model Nodes could then continue to work upon this data.
+(The key management problem has merely been pushed around, arguably, but critically, it's been pushed out of the area from where multicodec constraints made it unsolvable.)
+
+
+
+Conventions and Usage Patterns around Encryption in IPLD
+---------------------------------------------------------
+
+General notice: there are not single solutions to how to compose crypto systems.
+Many tradeoffs exist in design of applications using encryption.
+In some situations, metadata and size and access pattern concealment don't matter;
+in others, they're critically important, and an infinite amount of performance penalty is an acceptable trade.
+We can't make these decisions for applications.
+In this document, we'll limit our scope to talking about patterns that we've seen,
+and building some vocabulary around them, and sharing the ideas that seem to have good results.
+
+
+### Desirable traits
+
+Some frequently identified desires when working with encrypted data include:
+
+- ability to use "pinning" services without special integrations or disclosure of key material
+- ability to tersely identify subtrees, e.g. for purposes such as network transfer
+
+These are things which are well-provided for when using IPLD without encryption,
+but require some additional design when using IPLD with encryption, since the link structure of documents is generally itself encrypted.
+
+Mind: these goals are complicated: if they didn't require information that is _encrypted_, they wouldn't be worth special mention in the first place.
+It's very important to be sure you also consider the [cryptography caveats](#introduction-to-cryptography-caveats) when working with these goals.
+
+
+### Pattern: Cleartext Manifest over CIDs of Encrypted Data
+
+Key concepts:
+
+- All content is encrypted at block level (using the systems described in the [Encryption Codecs](#encryption-codecs) section) (so, we have a set of CIDs, all of which have a multicodec indicator that indicates some kind of encryption codec).
+- We still want to be able to pin the whole set, or fetch the whole set using one query.
+
+The solution to this is pretty clear: we want some merkle tree of cleartext IPLD objects, and that tree should just link to the encrypted data CIDs.
+
+#### manifest tree structure can be any form
+
+An interesting trait of the manifest pattern is that to provide its key benefits --
+e.g. being able to refer to the whole set of data at once --
+it doesn't actually _matter_ exactly _what_ tree structure or layout algorithm is used.
+
+HAMTs or Chunky Trees can both be used; or for small enough data, a plain map in a single block.
+Anything that reaches the goals works; there's little or no need to standardize on this.
+
+
+
+introduction to cryptography caveats
+------------------------------------
+
+Designing cryptographic systems is _tricky_ -- to put it mildly.
+
+We can't always offer complete systems and complete guidance to cryptographic work.
+What we can do in IPLD is offer some components and, sometimes, some patterns of suggested use.
+How to put those things together (and do so safely) is still fundamentally the responsibility of the application developer.
+
+We also can't provide a complete introduction and set of coursework on how to compose cryptographic systems!
+Those are educational resources you'll need to find access to elsewhere if you haven't gotten it already.
+
+With all those caveats made, though, we'd like to offer a few pointers to topics you should at least be aware of.
+These topics are also especially relevant to the combination of encryption and IPLD because of how they involve tradeoffs
+(and, some of those tradeoffs are things that inform *why* we don't move certain kinds of features into IPLD specs -- it's because there's more than one way to go about it).
+
+### access patterns of ciphertext can leak hints about the cleartext
+
+// more description of this would be welcome
+
+### size of ciphertext may leak hints about the cleartext
+
+// more description of this would be welcome
+
+### these are example headings, not an exhaustive list
+
+// it's unclear how much we should offer a primer in cryptography
+
+
+
+Postscript: What Actually Happened
+----------------------------------
+
+The conversation about encryption and its relationship to IPLD is probably still not finished
+(but this document is, because as an exploration report, at some point, we call it done; and if the conversation continues, it'll be with a new document).
+
+Encryption discussion is still ongoing in PRs:
+in particular, in https://github.com/ipld/specs/pull/349#issuecomment-763901167 it seems we may be backing away from making multicodec indicators do double-duty,
+and instead using a single multicodec indicator to describe a codec that handles the ciphertext in a standard way,
+then creating a new numeric 'code' field for indicating which cipher mechanism is used, and putting that 'code' field in the codec that's handling the ciphertext.
+
+It will probably remain the case that there will be more than one way to go about encryption when working with IPLD.