Constrained firmware update problem statement
draft-richardson-fud-constrained-update-00

Abstract

This document details the problems of upgrading small devices that need complete firmware replacements, but which do not have enough storage to keep an entire copy of the replacement image. In addition to detailing the specific challenge, a conceptual architecture for a solution is posited involving use of DTLS session resumption tickets with CoAP Block Transfer mode.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on October 18, 2017.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction

1.1. Fundamental Postulate
1.2. Terminology
1.3. Challenges

1.3.1. Entire Image Validation
1.3.2. Incrementally Validation
1.3.3. Incrementally Decrypt
1.3.4. Securely transferred
1.3.5. Use of CoAP
1.3.6. Storage in the Network

1.4. Compromises

2. Straw-man proposal

2.1. Things to standardize

3. IANA Considerations
4. Acknowledgements
5. Normative References
Appendix A. Change history
Author's Address

1. Introduction

Class 2 constrained devices typically have a few hundred kilobytes of program store.

A Freescale mc1322x device contains 128K of flash with 96K of ram (which is mirrored from flash). A CC2538 ("OpenMote") comes with up to 512KB of flash, but options for 256KB and 128KB exist.

A basic build of the Contiki OS containing the simplest of of CoAP (erbium) server consumes approximately 60Kbyte of store on the Freescale device.

This number does not include any bootstrap or application level TLS-style security, nor L2-security code. Nor is the application code at all sophisticated. Fitting all of these things would likely push the 96K of space significantly, although there is also 80K of ROM that may be leveraged in some situations.

An OpenMote would have little difficulty if the with 512KB of flash, but if the application fitted into the 256K device, there would be significant pressure to cost optimize into a smaller device. Even if that pressure was resisted, at some point the 250K image that could be easily doubled buffered might grow bigger due a need to debug some real life customer problem. It might be necessary to turn some significant portion of the flash memory into storage for debug information..

1.1. Fundamental Postulate

It is therefore postulated that regardless of the ratio of device capabilities to image size when initially specified, that the image size will grow over time to do more things (or do the same things more correctly) over time, and at some point the device will be unable to double buffer image updates. Two things can occur that that point:

the device no longer receives updates.
the device can only be updated via JTAG or other invasive proceedure.

A device which is hard to reach, or a which requires an update process that depends upon decade old equipment: laptops running ancient versions of Windows with hard to find cabling is effectively not able to be updated.

The result is devices which present significant security risks because of the difficulty in deploying even the simplest of fixes to them.

1.2. Terminology

Terminology from [RFC7228] is used extensively in this document.

In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 [RFC2119] and indicate requirement levels for compliant STuPiD implementations.

1.3. Challenges

The firmware update process must accomplish the following things.

1.3.1. Entire Image Validation

The process MUST be able to cryptographically validate the entire image after writing to flash, and before rebooting into the image.

1.3.2. Incrementally Validation

The process SHOULD be able to incrementally validate blocks as received from the server before writing to flash. Efficient flash operation may require that blocks of 4K or 128K must be received before executing the erase/flash operation.

1.3.3. Incrementally Decrypt

The image MAY be object encrypted (in addition to transfer encryption) and must be decrypted prior to writing to flash.

1.3.4. Securely transferred

The image MUST be transfered privately and integrally from a content server. A mechanism like (D)TLS or OSCOAP is appropriate. Access control to the content may be by private URL, username/password, or preferably, via ACE token.

1.3.5. Use of CoAP

The image SHOULD be transfered using CoAP, possibly using CoAP Block Transfer. The transfer SHOULD be pausable if bandwidth in a LLN is unavailable. The transfer MUST not be a single large Block Transfer, but MUST instead be a series of "bite" sized chunks.

1.3.6. Storage in the Network

In a network with many identical nodes, it SHOULD be possible for one node (once upgraded) to offer it's own image to another node. This has significant latency and bandwidth savings in an LLN.

For this optional feature to be possible, it implies that the image is not transformed in any way that breaks the cryptographic signature.

1.4. Compromises

In an LLN, one node may provide routing (mesh-under or route over) services to other nodes. It is acceptable that for the entire duration of the upgrade tha the node is not capable of forwarding packets. In a well provisioned LLN, alternate routes will be available. In order to reduce the downtime, the protocol SHOULD provide an ability to focus the available upgrade bandwidth on a single node such that it is upgraded as quickly as possible and returned to service.

A network may wish to upgrade leaf nodes first, or may wish to upgrade core nodes first. Each method has advantages, and the choice of which one to do first SHOULD be managed by the operator.

2. Straw-man proposal

This is a sketch of one way of doing the upgrade. The use of DTLS and CoAP is assumed. It is assumed that the device can update flash in 4K blocks, and has approximately 8K of ram, possibly twice that. The image size is approximately 128K.

The upgrade server provides the images 4K chunks, accessible at a predicable base URL, such as coaps://example.com/device/1234/chunk/00000 through coaps://example.com/device/1234/chunk/1f000. The URLs are encoded as hex 4K blocks.

The process starts with the network controller contacting the node to be updated, and a way to get the access token from the update server's ACE AS. The client initiates the DTLS connection, along with the appropriate token.

The client then stores all the details of the DTLS session in a stable part of it's flash, possibly encrypting the information of a randomly generated and known only locally key. The use of a TPM module or other TEEP mechanism would be appropriate for this data.

The stored information should include: the server's address, the port numbers, the DTLS sequence number, the current blocks URL, etc. An appropriate hash of the next block to receive would be recorded. It would not be inappropriate to craft much of the IP/UDP/DTLS/CoAP headers to use.

The client then restarts into a very small recovery code. This code can decrypt the saved information and must be capable of: 1. initiating a CoAP block transfer for a 4K block. 2. receiving the results. 3. validating a hash of the block. 4. flashing the block to the right location. 5. updating the stored information to update to the next block. This may be done using information appended to the transfer itself, or via additional CoAP headers. (Such as a 2.01 header) 6. "rebooting" to start again.

The design above should guarantee that the transfer can continue where it left off in the event of power failure or other transfer failure that causes a reboot (a watchdog timer would be appropriate).

2.1. Things to standardize

The data block interface between the main operating system in the device and the recovery bootloader are rather specific to implementations, but it may be worth standardizing the abstract of what should be transfered. The recover bootloader may be in ROM (and difficult to change) and main system in flash, and subject to upgrade over a period of decades.

The format of the contents of the block that is transfered each time may need to include meta information (such as the location of the next block), as well as additional meta information to permit validation of the final image. This may be need algorithm agility, which may be difficult to accomplish given that the recovery bootloader ROM itself may be unchangeable. It may be necessary to do two-step validation.

3. IANA Considerations

This document details a problem and does not define any specific protocols, so no allocations are defined.

4. Acknowledgements

none yet.

5. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC7228]	Bormann, C., Ersue, M. and A. Keranen, "Terminology for Constrained-Node Networks", RFC 7228, DOI 10.17487/RFC7228, May 2014.

Appendix A. Change history

version 00 of document.

Author's Address

Michael Richardson Sandelman Software Works EMail: mcr+ietf@sandelman.ca