Spook : Sponge-Based Leakage-Resilient Authenticated Encryption with a Masked Tweakable Block Cipher

.


Design rationale and motivation
Spook is an Authenticated Encryption scheme with Associated Data (AEAD).Its primary design goals are resistance against side-channel analysis and low-energy implementations (jointly).
The motivation for the first goal stems from the observation that lightweight devices may be deployed in environments where they can be under physical control of an adversary, yet be responsible for sensitive tasks, or be the root of critical distributed attacks starting from seemingly non-critical connected objects [RSWO18].As a result, the ability to provide side-channel resistance (and possibly resistance against fault attacks) easily and at low cost was identified by the NIST as a desirable feature for lightweight cryptography. 1The motivation for the second goal stems from the observation that energy is a suitable metric to compare the performances of cryptographic algorithms [KDH + 12], and a relevant one from the application viewpoint.It is in particular increasingly needed for battery-operated / energy harvesting devices, for example in the IoT [MMGD17].
In order to reach these goals, Spook builds on and specializes two main ingredients.
The first ingredient is a leakage-resilient mode of operation that enables efficient side-channel resistance.We use the S1P mode of operation for this purpose [GPPS19], which stands for "Sponge One-Pass" and is the lightweight variation of a sequence of works aiming at high-physical security guarantees for (authenticated) encryption [PSV15, BKP + 18, BPPS17, BGP + 19].For integrity, S1P reaches the top of the definitions' hierarchy established in [GPPS18], namely Ciphertext Integrity with Misuse and Leakage in encryption and decryption (CIML2), in a liberal model where all the intermediate computations are leaked to the adversary, except for a long-term secret key that is only used twice per encrypted or decrypted message.For confidentiality, S1P reaches security against Chosen Ciphertext Adversaries with misuse-resilience and Leakage in encryption (CCAmL1).Compared to related works with constructions additionally achieving CCA security with decryption leakages (i.e., CCAmL2 [GPPS18]), the S1P mode has the significant advantage of being single-pass in encryption and in decryption, which we believe is essential for lightweight implementations. 2oncretely, S1P also encourages so-called leveled implementations, where (expensive) protections against side-channel attacks are used in a limited way and independent of the message size, while the bulk of the computation can be executed by cheap and weakly protected components.
The second ingredient is the adoption of regular symmetric primitives to operate the S1P mode of operation, namely the Clyde-128 Tweakable Block Cipher (TBC) and the Shadow-512 permutation, both based on simple extensions of the LS-design framework, which aims at efficient bitslice implementations [GLSV14].In order to facilitate leveled implementations, those primitives use components that can be efficiently masked against side-channel attacks for the TBC (e.g., with [GMK17] in hardware or [GR17] in software), and enable fast implementations for the permutation.They bring two main improvements compared to earlier proposals of LS-designs.On the one hand, they leverage the tools introduced by Beierle et al. [BCLR17] in order to prevent the invariant attacks that put several earlier LS-designs at risk [LMR15,TLS16].On the other hand, they replace the table-based L-boxes used in previous LS-designs by word-level L-boxes that can be efficiently implemented as a sequence of rotation and XOR operations, which is beneficial to prevent cache attacks [TOS10].As a result, both Clyde-128 and Shadow-512 enable efficient bitslicing and side-channel resistant implementations on a wide range of platforms, (e.g., 32-bit microprocessors such as increasingly used in mobile applications and dedicated hardware or FPGAs).
The motivations for using two symmetric primitives in S1P are twofold.First, an invertible (tweakable) block cipher is instrumental to reach CIML2 security in the unbounded leakage model [BPPS17].Second, duplex sponge constructions are in general attractive for efficient AE: they can achieve this functionality in a single pass, are highly flexible and ensure nice security bounds in the multi-user setting [BDPA11,DMA17].Sponge constructions are also believed to provide some level of leakage-resilience by design [DEM + 17].Spook combines the advantages of both.
Eventually, and besides these main features, Spook inherits other interesting properties from the S1P mode of operation: (i ) it is secure beyond the birthday bound (with respect to the size n of the TBC), and (ii ) it can provide n-bit multi-user security at low cost with a public tweak.

The S1P mode of operation
Notations.We denote the plaintext as where the size of blocks 0 to − 2 is r and the size of the last block is 1 in the same way as the plaintext.We denote the τ -bit nonce as N and the key as K||P , where K is a long-term secret key of n bits, and P is a public tweak of n − 1 bits. 3The secret key K has to be picked up uniformly at random in {0, 1} n .The public tweak P is set to an (n − 1)-bit zero vector in case only single-user security is requested, and is set to P = p||1 in case multi-user security is requested (i.e., one bit is used to separate the single-user and multi-user security variants).In case multi-user security is requested, p plays the role of a long-term "public key".It is recommended to pick it up uniformly at random like the long-term secret key K.The S1P[E, π](A, M , N, K||P ) mode of operation relies on an Tweakable Block Cipher (TBC) with n-bit blocks, tweaks and keys, denoted as E, and an (r + c)-bit permutation π.Our primary parameters are n = 128, r = 256, c = 256 and τ = 128.
Conventions.S1P operates over bitstrings (i.e., each of the manipulated data -the plaintext, associated data, ciphertext, keys and nonce -is a sequence of bits).The Spook cipher is however defined for bytestrings (i.e., each of the manipulated data is a sequence of bytes).For encryption, input data (i.e., plaintext, associated data, keys, nonce) bytestrings are first mapped to bitstrings using the BMAP function defined next, and the ciphertext is converted back to a bytestring using the inverse of the BMAP function.The operations are the same for decryption, except that the plaintext and ciphertext are swapped.BMAP maps bytes to bits in little-endian order.More precisely, it takes as input a sequence of bytes of length q: (X[0], . . ., X[q − 1]) and outputs a sequence of bits (Y [0], . . ., Y [8q − 1]), where Y [8i + j] = (X[i]/2 j ) mod 2 for 0 ≤ i < q and 0 ≤ j < 8.
As a result, the nonce N , the private part of the key K and (when applicable) the public part of the key p are all 16 bytes long.In order to get the bitstring p (which has a length of 126 bits) from the corresponding bytestring, the last two bits are discarded after application of BMAP.
The encryption.As illustrated in Figure 1, the encryption of the 4-string input (A, M , N, K||P ) first derives an n-bit initial seed B by using a TBC call E P ||0 K (N ||0 * ).The initial seed B is then used as a fresh key for an inner keyed duplex sponge construction, to process A and M and produce C. Two bits are used for domain separation, in order to distinguish M from A and mark if the last blocks of A and M are of full r bits or not.Let U ||V be the first 2n − 1 bits of the final state, with |U | = n.The tag Z is produced by using another TBC call E V ||1 K (U ), where the 1 concatenated with V guarantees that this tweak is different from the one used to generate B. The ciphertext is made of − 1 blocks of r bits, a final block of length 1 ≤ |C[ − 1]| ≤ r and an n-bit tag.We next denote it as

Clyde-128, a Tweakable LS-Design
The S1P mode of operation requires a TBC.We use the Tweakable LS-Design (TLS-design) framework introduced as part of the SCREAM authenticated encryption candidate to the CAESAR competition for this purpose [GLS + 14].TLS-designs are tweakable variants of the LS-designs which specify a family of bitslice ciphers aimed at efficient masked implementations [GLSV14].
Such ciphers work on n = (s • l)-bit states, where s is the size of the S-box and l is the size of the L-box.We denote the full cipher state as x, a state row as x[i, ] (0 ≤ i < s) and a state column as x[ , j] (0 ≤ j < l).Concretely, we will consider s = 4 and l = 32.Although the internal representation of the data is a (s • l)-bit matrix, the cipher operates over bitstring inputs and outputs.The mapping between a bitstring B and the corresponding bit matrix From an implementation viewpoint, the S-boxes and L-boxes are defined such that they can always be executed thanks to simple operations on the rows (typically corresponding to processor words).
In summary, TLS-designs update the n-bit state x by iterating N s steps, each of them made of two rounds (so N r = 2N s ).One significant advantage of these designs is their simplicity: they can be described in few lines, as illustrated in Algorithm 1, where µ denotes the plaintext, T K a combination of the master key K and tweak T that we call tweakey [JNP14], W (r) are round constants, and S and L are an s-bit S-box and an l-bit L-box (specified next). 5lgorithm 1 TLS-design with l-bit L-box and s-bit Tweakey addition return x We use SCREAM's lightweight tweakey scheduling algorithm [GLS + 14].It takes the n-bit key K and the n-bit tweak T as input.The tweak is divided into n/2-bit halves: T = t 0 t 1 .Then, three different tweakeys are used every three steps as follows: The tweakeys can also be computed on-the-fly using a simple linear function φ, corresponding to multiplication by a primitive element in GF (4) (such that φ 2 (x) = φ(x) ⊕ x, and φ 3 (x) = x):

Shadow-512, a Multiple LS-Design
The S1P mode of operation also requires a (larger) permutation.We use a simple variant of the LS-designs that we denote as mLS-designs (standing for multiple LS-designs) for this purpose.In summary, mLS-designs mix multiple LS-designs thanks to an additional diffusion layer.
Such ciphers work on n = (m • s • l)-bit states, where m is the number of LS-designs considered, s is the size of the S-box and l is the size of the L-box.Taking similar notations as for TLS-designs, we denote the full cipher state as x, each (s • l)-bit substate corresponding to an LS-design as a bundle x[b, , ] (0 ≤ b < m), a bundle row as x[b, i, ] (0 ≤ i < s) and a bundle column as x[b, , j] (0 ≤ j < l).Concretely, we will consider m = 4, s = 4 and l = 32.Although the internal representation of the data is an (m•s•l)-bit state, the cipher operates over bitstring inputs and outputs.The mapping between a bitstring B and the corresponding state In summary, mLS-designs update the n-bit state x by iterating N s steps, each of them made of two rounds (round A and round B) that respectively apply an L-box to the rows of each bundle independently, and a diffusion layer mixing the rows of different bundles (on top of the S-box layer).An accurate description is given in Algorithm 2, where µ denotes the input, W (r) are round constants, S and L are an s-bit S-box and an l-bit L-box and D is a m-bit diffusion layer.
Algorithm 2 mLS-design with l-bit L-boxes and s-bit S-boxes Constant addition return x

Clyde-128 and Shadow-512 components
We now describe the components S, L and D and the round constants used in Clyde-128 and Shadow-512.Both ciphers are designed to enable simple implementations based on 32-bit word-level operations.For the S-box, we provide its circuit representation (which can be applied in parallel to the 32 bits of a word).For the L-box and diffusion layer, we provide a sequence of 32-bit operations.We denote the bitwise AND as and the left rotation of a word x by α bits as rot(x, α).

S-box.
We use a variant of the 4-bit S-box used in Skinny [BJK + 16], modified by replacing the NOR gates by AND gates.It is given in Table 1, with numbers representing bitstrings encoded in little-endian.That is, It has linear and differential x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S(x) 0 8 1 15 2 10 7 9 4 13 5 6 14 3 11 12 probabilities 2 −2 and algebraic degree 3. Concretely, y = S(x) can be implemented serially with 4 AND gates and 4 XOR gates in the direct and inverse directions.In the direct sense, it has an AND depth of two and allows computing the two first (and two last) AND gates in parallel: An inverse implementation (with 4 AND gates & 4 XOR gates) is given in Appendix C.

L-box.
We use an interleaved L-box that applies jointly to pairs of 32-bit words and has branch number 16 over those pairs.Denoting the two words on which it is applied as x and y: where circ denotes the circulant matrix whose first line is given in hexadecimal notation, so that the number b = 31 i=0 2 i b i corresponds to the row vector (b 0 , . . ., b 31 ).Concretely, this L-box can be efficiently implemented (in the direct and inverse directions) thanks to six word-level (left) rotations and six 32-bit XORs per word as follows: The inverse implementation is in Appendix D. Note that this L-box requires a minor modification of the TLS-designs and mLS-designs in Sections 3.2 and 3.3.For TLS-designs, the loop: ); applying L independently on each word, has to be replaced by the following one: x[2i + 1, ]); A similar change applies to mLS-designs.The L notation reflects this application to pairs of words.In LS-designs, such interleaved L-boxes are only applicable to S-boxes with even number of bits.

Security analysis and claims
Our claims are in the single-key setting.Related-key attacks should be avoided at the protocol level.

The S1P mode of operation
The black box security analysis of S1P is proven under the assumptions that the TBC is a secure tweakable strong pseudo-random permutation and that the permutation is a random permutation.CIML2 is proven under the additional assumption that the long-term key of the TBC cannot be leaked (but all other intermediate values can be leaked in full).CCAmL1 security is proven under a bounded leakage assumption.We refer to [GPPS19] for details on these assumptions and proofs.
Based on the above, the single-user security claims of the mode are summarized in Table 2.These bounds imply that the leakage security of Spook depends on the concrete strength of its implementation.Informally, the CIML2 bound guarantees that message integrity reduces to the side-channel security of the TBC.The best forgery attack is a Differential Power Analysis (DPA) against Clyde-128 -the complexity of which is expected to be smaller than 2 n n 2 .The CCAmL1 analysis is more subtle, but essentially guarantees that the confidentiality of long messages reduces to the confidentiality of single-block messages.The best attacks are then a DPA against Clyde-128 (as for CIML2) and a Simple Power Analysis (SPA) against the ephemeral secrets (i.e., the secret part of the permutation state), the complexity of which is expected to be smaller than 2 n/2 . 6he security claims for the multi-user variant of S1P are summarized in Table 3.Note that no additional restrictions are imposed on the message length.The security bounds in both tables only depend on the total number of message and associated data blocks to encrypt.

The Clyde-128 (tweakable) block cipher
The security of Clyde-128 against linear and differential attacks can be analyzed thanks to the wide-trail strategy [DR01].Two rounds activate 16 S-boxes and the linear/differential probability of our S-box is 2 −2 .As a result, eight rounds (four steps) lead to a bound on the probability of the best linear/differential characteristics of (2 −2 ) 4•16 = 2 −128 .Our recommended parameters add four rounds (two steps) in order to prevent improvements of these standard attacks.According to the upper bound in [BCC11], at least five rounds of Clyde-128 are necessary to reach the maximum algebraic degree.We expect that the recommended twelve rounds (six steps) should prevent risks of algebraic cryptanalysis [CP02] and related attacks (e.g., cube [DS09] or division property [Tod15]) -our experiments in this regard did not reveal any weaknesses.
Besides, and following the security arguments recently put forward in [BCLR17], the round constants are chosen so as to maximize the dimension of the smallest invariant subspace over the linear layer that contains all round constants.To achieve this, we need at least ten rounds.This ensures that no invariants exist simultaneously for the S-box layer and the L-box layer.
Note that while Clyde-128 is built from the TLS-designs introduced in [GLS + 14] and analyzed as an ideal TBC in [GPPS19], the way it is used in Spook implies that its tweak input is either constant (as a zero vector or a public value) or pseudo-random and out of adversarial control.So while a standard TBC would require security against chosen-tweak attacks, the number of rounds selected for Clyde-128 only corresponds to single-key and random-tweak security, which is the minimum required for the analysis of the S1P mode of operation to hold.Chosen-tweak security for Clyde-128 could be obtained by doubling the number of rounds, following the approach in [GPPR11].

The Shadow-512 permutation
The exact requirements for the Shadow-512 permutation are more difficult to specify.
A minimum is to reach 128-bit security against linear cryptanalysis.This can be analyzed by considering the super S-box structure of Shadow-512.Two rounds activate 16 S-boxes and four rounds activate 16 × 4 S-boxes thanks to the branch number of the diffusion layer.Hence, a probability bound of 2 −128 for the best linear characteristic is reached after four rounds.
Another minimum requirement is to reach an algebraic degree 128.According to the upper bound in [BCC11], this can be reached after at least five rounds of Shadow-512.
Besides, one important requirement for the permutation in the analysis of the S1P mode of operation is that it ensures collision resistance for the 255 bits that are used to generate the tag.Hence, a more specific requirement is to prevent truncated differentials with probability larger than 2 −128 for those 255 bits.A conservative heuristic for this purpose is to require that no differential characteristic has probability better than 2 −385 , which happens after twelve rounds (six steps).

Primary candidate and variants
Underlying primitives.We consider two sets of parameters for the Clyde-128 TBC and Shadow-512 permutation.The recommended parameters are 12 rounds for Clyde-128 and 12 rounds for Shadow-512.We additionally provide aggressive parameters, with 12 rounds for Clyde-128 and 8 rounds for Shadow-512, as an interesting target for cryptanalysis.Note that our reference implementations and test vectors are based on the recommended parameters.
Full algorithm.We denote as Spook[128, 512, su] the AEAD algorithm operating the S1P mode in the single user setting with Clyde-128 as TBC and Shadow-512 as permutation, and as Spook[128, 512, mu] its multi-user version.Based on these notations, we define our: • Primary candidate as Spook[128, 512, su] with recommended parameters.
We recall that the only difference between the single-user and multi-user versions is that the public tweak p is stuck at zero in the first case (i.e., the key is limited to 128 secret bits), and picked up at random in the second one (i.e., the key is made of 128 secret bits and 126 public bits).
We additionally define two versions of Spook with a 384-bit state.They are obtained by turning the 512-bit permutation into a 384-bit one.We do so by defining Shadow-384 as a 3LS-design (rather than a 4LS-design) where the diffusion layer (a, b, c) = D(x, y, z) is specified as: The rest of the permutation and all the other elements of the mode are adapted so that r = 128, with the same number of rounds for the recommended and aggressive parameters, leading to our: • Second variant as Spook[128, 384, su] with recommended parameters.

Design trade-offs: advantages and limitations
Spook is an AEAD algorithm with state-of-the-art guarantees in the black box setting.It ensures beyond-birthday security with respect to the size of its long-term key, can be extended to multi-user security with a public tweak, and provides nonce misuse-resilience in the sense of Ashur et al [ADL17].
Thanks to its one-pass structure, Spook should allow efficient implementations on a wide range of platforms.Its design is in particular well-suited to 32-bit software implementations (thanks to an intensive exploitation of 32-bit word-level operations), and to dedicated hardware and FPGA implementations (thanks to the low gate complexity of its underlying components).
Spook provides excellent opportunities to mitigate physical attacks efficiently thanks to its leakage-resilient features.In particular, the general rationale behind its design enables leveled implementations, where the Clyde-128 TBC is well protected against side-channel attacks and the Shadow-512 (or Shadow-384) permutation is implemented with cheaper protections (or even no protections at all).It is in the specific contexts where physical attacks are an important concern that Spook is expected to exhibit significant performance gains compared to modes without leakage-resilient features.Concretely, protecting the TBC can be achieved thanks to the masking countermeasure, both in hardware [GMK17] and in software [GR17].For this purpose, Clyde-128 is designed both with low AND complexity (as previous LS-designs) and low AND depth (which is important to limit the latency of so-called glitch-free implementations [NRS11, FGP + 18]).As for the permutation, low-latency / low-energy implementations in the sense of [KDH + 12] are natural candidates in hardware, while some minimum countermeasures to prevent SPA (e.g., low-order masking, or time randomization [VMKS12]) should be sufficient in software.For this purpose, the Shadow-512 (or Shadow-384) permutation is designed with low-latency components.Leveled implementations of Spook can also benefit from pre-computing the (expensive) generation of fresh seeds.
The main price to pay for the leakage-resilient features of Spook is that it suffers from some overheads in case of short messages.This seems unavoidable in any mode leveraging a re-keying process.However, and as evaluated in [BGP + 19], these overheads are amortized as soon as the data to process is a few blocks long, and the gains of leveled implementations can reach factors 10 to 100 (e.g., in energy) if a high physical security level is required by an application.
A secondary drawback is the need of two primitives (a TBC and a permutation), which implies a larger cost (i.e., area) in hardware.However, this drawback vanishes for the intended case studies, since leveled implementations require implementations with different physical security levels anyway.Also, in case uniformly (un)protected implementations are considered, the use of the same S-box and L-box in Clyde-128 and Shadow-512 (or Shadow-384) should allow resource sharing.
Eventually, we list a couple of additional interesting features of Spook.
First, the S1P mode is compatible with solutions for the encryption of long messages segmented into several smaller packets, as for example proposed by Bertoni et al. [BDPA11] and formalized by Hoang et al. [HRRV15].Such a "session feature" can be used as a partial tagging mechanism which allows the decryption of long messages when only a limited memory is available (i.e., smaller than the size of the message), and saves the execution of one TBC per segment (i.e., the highly protected part and therefore more expensive part in a leveled implementation of S1P).As such modes are not directly compatible with the NIST API, we leave their discussion for a separate report.
Second, and since leveraging a re-keying process, the Spook algorithm inherently provides good resistance against various Differential Fault Attacks, as discussed in [MSGR10, DEM + 17]).
Finally, an inverse-free variant of Spook can be obtained by doing the tag verification in the direct sense.It only satisfies CIML1 in the unbounded leakage model, yet can provide good concrete security against bounded leakages if the tag verification is sufficiently protected (e.g., masked).
The decryption.In order to decrypt the 4-string input (A, C, N, K||P ), the mode first derives the initial seed B via E P ||0 K (N ||0 * ), as when encrypting.It then runs the inner keyed duplex sponge construction on A and c to derive M and the (2n − 1)-bit truncated state U ||V .Finally, it makes an inverse TBC call U * = E V ||1 K −1 (Z), and outputs M if and only if there is a match U * = U . 4 More precisely, S1P[E, π].Enc and S1P[E, π].Dec are specified in Appendix A, Algorithms 3 and 4. The different cases that the S1P mode can encounter are also illustrated in Appendix B.

Figure 2 :
Figure 2: Different cases of the S1P mode of operation.

Table 1 :
S-box in table representation.

Table 2 :
Single-user security claims.