Small universal deterministic authenticated encryption for the internet of things

. Lightweight cryptography was developed in response to the increasing need to secure devices for the Internet of Things. After signiﬁcant research eﬀort, many new block ciphers have been designed targeting lightweight settings, optimizing eﬃciency metrics which conventional block ciphers did not. However, block ciphers must be used in modes of operation to achieve more advanced security goals such as data conﬁdentiality and authenticity, a research area given relatively little attention in the lightweight setting. We introduce a new authenticated encryption (AE) mode of operation, SUNDAE, specially targeted for constrained environments. SUNDAE is smaller than other known lightweight modes in implementation area, such as CLOC, JAMBU, and COFB, however unlike these modes, SUNDAE is designed as a deterministic authenticated encryption (DAE) scheme, meaning it provides maximal security in settings where proper randomness is hard to generate, or secure storage must be minimized due to expense. Unlike other DAE schemes, such as GCM-SIV, SUNDAE can be implemented eﬃciently on both constrained devices, as well as the servers communicating with those devices. We prove SUNDAE secure relative to its underlying block cipher, and provide an extensive implementation study, with results in both software and hardware, demonstrating that SUNDAE oﬀers improved compactness and power consumption in hardware compared to other lightweight AE modes, while simultaneously oﬀering comparable performance to GCM-SIV on parallel high-end platforms.


Introduction
As computing on increasingly small devices becomes widespread, enabling security -and in particular cryptography -in such constrained environments becomes critical. Recognizing the fact that cryptographic algorithms optimized for high-performance computing are not necessarily optimal for constrained environments, various research and standardization efforts have set out to explore lightweight cryptography, such as ISO/IEC 29192, CRYPTREC, the CAESAR competition [CAE16], and NIST's lightweight project [nis17].
Lightweight block cipher design is one of the most mature research areas, with constructions going back to 2007, optimizing for a variety of efficiency goals such as latency [BJK + 16], area [CDK09], and energy [BBI + 15]. Although optimizing block cipher design is an important first step, block ciphers on their own are only building blocks, and should be used in modes of operation to achieve security. In particular, ensuring data confidentiality and authenticity is done using an authenticated encryption (AE) mode of operation.
Even if a block cipher is ideally suited to a given environment when considered in isolation, it could be used in an AE mode of operation which erases many of the block cipher's benefits. In fact, AE modes are often not designed to account for the different requirements imposed by lightweight settings. The mode might require two separate, independent keys, as with SIV [RS06], a state size of at least thrice that of the underlying block size, as with COPA [ABL + 13], or multiple initial block cipher calls before it can start processing data, like in EAX [BRW04].
Exceptions include the following AE modes CLOC [IMGM14], JAMBU [WH16], and COFB [CIMN17a], which try to reduce state size and number of block cipher calls to optimize for short messages. However, the challenges imposed by constrained environments are not limited to efficiency constraints, as fundamental security assumptions might be difficult to guarantee as well. For example, devices might lack proper randomness sources, or have limited secure storage to maintain state, in which case they might not be able to generate the nonces necessary to ensure that modes such as CLOC, JAMBU, and COFB maintain security. In such cases, algorithms which provide more robust security are better, such as nonce-misuse resistant AE [RS06], as they do not fail outright in the wrong conditions.
An efficient nonce-misuse resistant dedicated instantiation of SIV called GCM-SIV was proposed by Gueron and Lindell at CCS 2015 [GL15]. While it attains very competitive performance in software on recent Intel architectures, it requires full multiplications in GF (2 128 ), which makes the scheme unattractive in hardware and on resource-constraint platforms. The importance of good implementation characteristics on all platforms was already pointed out in [MM12]: the same cryptographic algorithms used on the small devices of the Internet of Things also have to be employed on the servers that are communicating with them. Crucially, however, the few designs explicitly aiming at being simultaneously efficient on lightweight as well as high-performance platforms such as [BMR + 13, LPTY16] do not provide nonce-misuse resistant authenticated encryption. In this paper, we aim to address this gap.

Contributions
We introduce an AE mode of operation, SUNDAE, which 1. competes with CLOC and JAMBU in number of block cipher calls for short messages, 2. improves over those algorithms and COFB in terms of state size, 3. provides maximal robustness to a lack of proper randomness or secure state, and 4. simultaneously offers good implementation characteristics on lightweight and highperformance platforms.
SUNDAE is designed to be a deterministic authenticated encryption mode [RS06], which means that as long as its input is unique, it maintains both data confidentiality and authenticity. If inputs are repeated, then only that fact is leaked. SUNDAE processes inputs of the form (A, M ), where A is associated data which need not be encrypted, and M is plaintext data to be encrypted. If M is empty, then SUNDAE becomes a MAC algorithm. If needed, nonces are included as the first x bits of associated data, where x is a parameter fixing the nonce's length per key.
SUNDAE's structure is based on SIV [RS06], however it is optimized for lightweight settings: it uses one key, consists of a cascade of block cipher calls, and its only additional operations consist of XOR and multiplication by fixed constants. The use of efficienct intermediate functions is inspired by GCBC [Nan09]. Using an n-bit block cipher, aside from storage for the key, CLOC requires 2n-bit state, JAMBU 1.5n-bit state, and COFB 1.5n-bit state, whereas SUNDAE only uses an n-bit state. SUNDAE's performance is fundamentally limited by the fact that it requires two block cipher calls per data block, hence SUNDAE works best for communication which consists of short messages. For a message consisting of one block of nonce, associated data, and plaintext, COFB uses 3 block cipher calls, CLOC requires 4, JAMBU 5, and SUNDAE 5 as well (which can be reduced to 4 if one block cipher call can be precomputed). However, SUNDAE's strength lies in settings where communication outweighs computational costs: if the combination of associated data and plaintext is never repeated, the nonce is no longer needed, and communication or synchronization costs are reduced, in addition to reducing the block cipher calls to 4. SUNDAE is inherently serial, and although the client side is important, it is not everything, especially given GCM-SIV's excellent performance using AES-NI on Haswell and Skylake. Even though parallel modes inherently profit most from modern parallel architectures, the Comb scheduling technique proposed in [BLT15] can mitigate this issue even for serial modes, at least on the server side. Therefore, we can afford to deploy a serial approach to design a novel mode of operation.

Related Work
Aside from lightweight block ciphers, many other primitives have been optimized for the lightweight setting, such as stream ciphers [CP08] and hash functions [BKL + 13, GPP11], as well as dedicated designs achieving authenticated encryption, such as Grain-128a [ÅHJM11]. Furthermore, permutation-based cryptography provides an approach to designing authenticated encryption with better trade-offs suited to lightweight settings [DEMS14, ABB + 14a, BDP + 14a, BDP + 14b, AJN14, BDPV11, ABB + 14b]. Applications which have the flexibility to choose the underlying primitive will often find the better choice in using permutationbased cryptography. However, there are settings where for legacy reasons one is restricted to using block ciphers -our focus is on designing a scheme for such settings.
When run with empty plaintext data, SUNDAE looks similar to MAC algorithms such as GCBC [Nan09] and CBCR [ZWZW11].

Notation
Unless specified otherwise, all sets are finite. If X is a set, then X n is the set of length-n sequences of X, X ≤q the set of sequences of X of length not greater than q including the empty sequence, and X * the set of finite-length sequences of X. If X ∈ X * , then |X| is its length. Given X, Y ∈ X * , concatenation of X and Y is denoted X Y , or simply XY when no confusion arises.
The notation x → y is used to denote a function which maps the symbol x on the left to the symbol y on the right. If f is a function with domain X × Y, then we write f (X, Y ) and f X (Y ) interchangeably, and use the notation f X to denote the function obtained by fixing the first input of f to X.
Throughout the paper, n denotes block size. The set of blocks is {0, 1} ≤n , and B := {0, 1} n denotes the subset of complete blocks, with all other blocks called incomplete. The element 0 n ∈ B denotes the complete block consisting of only zeroes, and the function pad : {0, 1} ≤n → B pads an incomplete block X with a 1 followed by n − |X| − 1 zeroes, and leaves complete blocks as-is: The empty string is denoted ε. Given two equal-length elements X, Y ∈ {0, 1} * , X ⊕ Y denotes their bitwise XOR. If X ∈ {0, 1} * , then X m denotes truncating X to the m most significant bits of X. Splitting a non-empty string X into blocks is done by computing its block length , which is the smallest integer greater than or equal to |X| /n, and processing X as The set of complete blocks can be viewed as a finite field by mapping strings to polynomials over finite fields. For a positive divisor i of n, we map the bits b 0 , . . . , b n−1 to i elements of GF (2)[x]/(m) for a fixed irreducible polynomial m(x) of degree n/i over GF (2): Given such a mapping and X ∈ B, we let 2 × X and 4 × X denote multiplication by x and x 2 of all a 0 , . . . , a i−1 in their polynomial basis, respectively.
Concrete instantiations for n = 64, 128 are proposed in Sect. 5.2, optimized for block ciphers with 4-bit or 8-bit S-boxes, respectively.
The function E : K × B → B denotes a block cipher, with K the set of keys. The expression a ? b : c evaluates to b if a is true and c otherwise.

Specification
SUNDAE consists of an encryption algorithm enc and a decryption algorithm dec. It is parametrized by a block cipher E : K × B → B, which fixes a key set K and block size n, and a representation of B as a finite field. The encryption algorithm enc takes as input a key K ∈ K, associated data A ∈ {0, 1} * , and a message M ∈ {0, 1} * . It outputs a ciphertext C ∈ {0, 1} n+|M | , where the first n bits of the ciphertext are interpreted as a tag. The decryption algorithm dec takes as input a key K ∈ K, associated data A ∈ {0, 1} * , and a ciphertext C ∈ {0, 1} n × {0, 1} * , and outputs M ∈ {0, 1} |C|−n , or the error symbol ⊥ if verification is not successful. The encryption and decryption algorithms are such that for The key K ∈ K must be generated as specified by the underlying block cipher E, which usually involves choosing K uniformly at random from K. After fixing a key, uniqueness should be guaranteed of each pair (A, M ) of associated data and message input; associated data can be repeated if the message is changed, and message input may be repeated if the associated data is changed. Caution must be taken so that intermediate values used during encryption and decryption are not leaked. In particular, unverified plaintext from the decryption algorithm should not be released [ABL + 14]. Finally, proper operation of SUNDAE requires changing keys well before the bound from Thm. 1 becomes void.
Alg. 1 and Alg. 2 provide pseudocode for enc and dec respectively, and Fig. 1 gives a diagram of encryption. All block cipher calls are performed with a fixed key K. Both encryption and decryption algorithms only use the "forward" block cipher E K , hence the block cipher inverse is not needed.
Figure 1: Diagrams of SUNDAE encryption and authentication. The initial block cipher call changes depending upon the presence of associated and plaintext data. The multiplication × by 2 or 4 and depends on the length of the last blocks.

Intuition and Proof Overview
SUNDAE is analyzed as a deterministic authenticated encryption (DAE) algorithm [RS06], and therefore must achieve authenticity, and confidentiality up to repetition of inputs. Although our formal analysis considers confidentiality and authenticity simultaneously, here we give an intuitive explanation which considers the two goals separately. SUNDAE generally follows the MAC-then-encrypt paradigm, much like SIV [RS06], since SUNDAE processes associated data and plaintext first with a MAC algorithm, and then uses the MAC algorithm output as the "IV" input to the stream cipher OFB [Nat80].
However, the use of a single key for both the MAC algorithm and stream cipher means that SIV's analysis does not carry over. Furthermore, although SUNDAE exhibits similarities with GCBC [Nan09] and OFB, the analyses of those schemes have limited applicability to SUNDAE as the combination of deterministic encryption and authentication introduces complications which do not arise when trying to achieve the goals separately. Therefore a proof of SUNDAE requires new arguments, albeit using techniques from throughout the literature.
Our formal argument starts with the simplifying step of applying a PRP-PRF switch to analyze SUNDAE with uniform random function ρ. Although this limits our analysis to the birthday bound, SUNDAE's security will anyway be limited by birthday bound attacks, for the same reasons the CBC and OFB encryption modes are.
Confidentiality After the PRP-PRF switch, confidentiality can be argued as follows. Since plaintext is encrypted into ciphertext using the stream cipher OFB, confidentiality is maintained if the stream cipher output looks uniformly random to the adversary. In the proof we end up setting aside the fact that plaintext is XORed with the stream cipher output to produce ciphertext since we are considering chosen-plaintext adversaries; the resulting simplified construction is denoted enc-stream in the proof.
OFB maintains security if its "IV" is unpredictable to the adversary. In the case of SUNDAE, the IV corresponds to the tag, hence intuitively, confidentiality will be maintained if the tag is unpredictable. Unlike OFB, we need to take into account the MAC algorithm which produces the tag. Complications arise due to the fact that associated data and plaintext data are processed similarly, with the main method of domain separation being the intermediate functions.
A large part of the formal argument is proving that SUNDAE's domain separation works. To do so, we calculate the probability that any two ρ-inputs collide in a meaningful way -a meaningless ρ-collision would be one where the adversary keeps the prefix of two different queries the same, resulting in the same ρ-input since SUNDAE is deterministic. As long as there are no meaningful collisions, it is easy to argue SUNDAE's confidentiality. Many of the details in the calculation of the ρ-collision bound have little to do with SUNDAE or with the specific intermediate functions that we choose, hence we abstract away details of the argument and prove a more general result in Sect. 4.10.
To connect SUNDAE with the abstract analysis of Sect. 4.10, we recast SUNDAE into different notation in Sect. 4.5 so that the intermediate functions become explicit. This way SUNDAE can be viewed as a cascade of ρ-queries, alternated with intermediate function calls. The sequence of intermediate function calls is denoted I(A, M ), which makes explicit the fact that they only depend on the associated data and plaintext input. Then, following Patarin's method, we focus on analyzing transcripts of interactions between adversaries and SUNDAE, thereby fixing adversarial input, and as a result intermediate functions.
Each transcript then gets converted into a graph in Sect. 4.8.3, which characterizes the relationship between all the ρ-inputs: each node of the graph represents an intermediate function, and if you follow the graph from the "root" node to a leaf node, while applying ρ while going from node to node, you will have executed a call to SUNDAE.
Once the connection between transcripts and graphs has been established, Sect. 4.8.4 describes the types of collisions that can occur, with the important ones being "structural" and "accidental". A structural collision is one which would happen if I(A, M ) were poorly designed, by, for example, using the same intermediate functions for two unrelated inputs. Analysis of accidental collisions is done in Sect. 4.10.
Authenticity Consider an adversary which somehow produces a forgery (C, T ). This means it found a tag T such that the output of the MAC algorithm during the (C, T )decryption call equals T . In particular, intuitively, it would have had to have found a pre-image or second pre-image of the underlying MAC algorithm, since, by definition, C was never output by a previous encryption query (otherwise it would not be a valid forgery). The bulk of the formal argument involves showing that it is in fact difficult for the adversary to produce such an event.
We introduce an intermediate construction to arrive at our conclusion. First, the decryption algorithm of SUNDAE is rewritten in terms of enc-stream and stream, the latter essentially describing OFB mode. After removing the plaintext XOR, we end up with the oracles (enc-stream, dec-stream). Looking more closely at dec-stream, one sees that its internal stream call could be recreated with the enc-stream output that is available to the adversary, since any dec-stream call that uses a tag input which is equal to some previous enc-stream output, will make the exact same ρ-calls as that previous enc-stream query. It is only when a dec-stream call is made which uses tag input which is unrelated to all previous enc-stream output, or when a sufficiently long dec-stream call is made that new ρ-queries are made. Using this knowledge, we introduce the intermediate construction dec-stream * , which tries to recreate dec-stream as best as possible using newly generated uniform random values if necessary to recreate missing ρ-calls.
The argument on the difficultly to find pre-images and second pre-images is easy to reason about with dec-stream * , as shown in Sect. 4.7. The rest of the authenticity proof focuses on bounding the distance between (enc-stream, dec-stream) and the intermediate construction with dec-stream * . Then, as explained above for confidentiality, as long as no ρ-collision occurs, the adversary will not be able to distinguish SUNDAE from the intermediate world which uses dec-stream * . The general result from Sect. 4.10 is then re-used.

Security Definitions, and Statements
For the security definitions we will need the following concepts. Oracles and adversaries are probabilistic algorithms. Given two sequences of oracles (f 1 , . . . , f µ ) and (g 1 , . . . , g µ ), we denote the advantage of an adversary A in distinguishing (f 1 , . . . , f µ ) from (g 1 , . . . , g µ ) by where the notation A O1,...,Oµ → 1 indicates that A outputs 1 when interacting with oracles O 1 , . . . , O µ . A uniformly distributed random function (URF) over X is a random variable uniformly distributed over the set of all functions on X. Let $ i,j,k represent a family of independent Definition 1 (DAE Security). Adversary A's DAE advantage against SUNDAE is defined as where K is chosen uniformly at random from K and ⊥ is an oracle that always outputs ⊥. Letting (O 1 , O 2 ) denote the oracles A interacts with, the adversary may not query We follow the concrete security paradigm [BDJR97] by explicitly describing SUNDAE's security in terms of adversarial resources. An adversary's associated data block length cost is the sum of the block lengths of the associated data that it queries to either SUNDAE's encryption or decryption algorithms. Plaintext and ciphertext block length costs are defined similarly, with an adversary's total block length cost defined as the sum of its associated data, plaintext, and ciphertext block length costs. If σ A , σ P , and σ C denote A's associated data, plaintext, and ciphertext block length costs, then A makes at most block cipher calls indirectly via SUNDAE. SUNDAE is a mode of operation, hence its security relies on the quality of its underlying block cipher, defined as follows.
where K is chosen uniformly at random from K, and π is chosen uniformly at random from the set of all permutations on X.
Any adversary A against SUNDAE can be converted into an adversary A E against the block cipher as follows: adversary A E runs A, and each time A makes a query to SUNDAE, A E recreates SUNDAE encryption or decryption with its own oracle, either E K or π, according to SUNDAE's definition.
Theorem 1. Let A be an adversary making at most q enc K and q v dec K queries with block length costs of at most σ A , σ P , and σ C for associated, plaintext, and ciphertext data, respectively, then In the following sections we go through the formal arguments of proving the above theorem. Sect. 4.11 finally summarizes all the results and computes the above bound.

Proof Notation
The size of a set S is indicated interchangeably by |S| and #S. Given sets J and X, the set of all mappings from J to X is denoted X J . The set of all injective mappings is denoted ∂X I .

Switching to URFs.
Let (enc[F ], dec[F ]) represent SUNDAE's encryption and decryption algorithms with the block cipher calls E K replaced by the function F : B → B. Lemma 1. Let O 1 , O 2 be any oracles, and let ρ be a URF over B. For any adversary A with block length costs of at most σ A , σ P and σ C for associated data, plaintext, and ciphertext respectively, we have where A E is the standard model reduction described in Section 4.2, and N E from Eq. (7) is an upper bound on the total number of block cipher calls A makes.
Proof. Let π denote a permutation chosen uniformly from the set of permutations over B.
Applying the triangle inequality we get The first term equals ∆ A E (E K ; π) and the second term ∆ A E (π ; ρ). The first term is exactly A E 's PRP-advantage against E K , and the second term is bounded above by the PRP-PRF switching lemma [BR06]. Knowing that A E makes at most N E queries to its oracles, we have our desired bound.
After applying the PRP-PRF switch, we have that DAE(A) is bounded by  Below we describe the above steps in detail, and Fig. 2 illustrates the steps in a diagram.

Alternative Description of SUNDAE
Step 1 of enc: From Messages to Intermediate Functions. The first step starts by splitting A and M into blocks, if non-empty, to get Then each block is augmented with a bit to indicate whether it is a final block or not. We let split denote the operation of mapping (A, M ) to the sequence of augmented blocks The augmented blocks output by split are subsequently used as the first parameter in the function where f is defined as Recall that f ((δ, X), Y ) and f δ,X (Y ) are equivalent. We write the operation mapping an input (A, M ) to a sequence of intermediate functions from B to B as I(A, M ). If A = ε and M = ε, we have that where values X ∈ {0, 1} n are interpreted as constant functions mapping any element in B to X. Similarly, if M is non-empty, if A is non-empty, and finally I(ε, ε) = (0 n ). Let Step 2 of enc: Applying ρ. The algorithm's second step applies ρ to the sequence of intermediate functions specified by I(A, M ). Given x = (x 1 , x 2 , . . . , x ) where each x i is a function from B to B, define the cascade of ρ with x to be the function ρ from B to B defined by applying x 1 followed by ρ, followed by x 2 , and so forth: Let Step 3 of enc: chopxor Define chopxor Y (X) to be Then chopxor M (X) represents the final step of enc. Letting enc-stream denote the first two steps of enc, we have Using the above notation, we define where C is the block length of C. Then dec can be described as

Eliminating chopxor
that is, the enc-stream-equivalent of $, then Define A chopxor to be the adversary interacting with an oracle (O 1 , O 2 ) -which is either (enc-stream, dec-stream) or ($ s , ⊥) -that starts by running A, and for each query where A chopxor never queries dec-stream(A, T chopxor M (S)) after having made the queries enc-stream(A, M ) = T S. This means the bound on the distance between the oracles (enc-stream, dec-stream) and ($ s , ⊥) with A chopxor , is a bound on the distance between enc and $ with A. Hence we focus on bounding ∆ A chopxor (enc-stream, dec-stream ; $ s , ⊥).

Bounding Forgery Probability With dec-stream *
Note that dec-stream computes its output based on a query to enc-stream and a query to stream. If T is a tag previously output by enc-stream, then one can recreate stream(T ) just by knowing what the previous enc-stream output is. If T is not a tag previously output by enc-stream, then with high probability stream(T ) will just be a stream of uniform random output. The idea behind dec-stream * is to capture this behaviour: given only access to $ s and $ s 's previous outputs, dec-stream * will try to mimic dec-stream's behaviour as closely as possible. Each one of $ s 's outputs S 1 , S 2 , . . . , S q can be viewed as sequences of complete blocks hence given a value X ∈ B, determining whether X has been output by $ s is a question of finding i, j such that The oracle dec-stream * will have access to all of $ s 's past outputs, and uses $ s and the function stream * (T ), defined as follows. On input (A, T C) with |C| = , the algorithm then it pads with uniform random bits to reach an output length of * n bits. If there is no such i, then stream * outputs * n uniform random bits. We add the requirement that Then we have is the same as bounding the probability that A chopxor forces dec-stream * to output non-⊥ output when interacting with ($ s , dec-stream * ). Consider adversary A * given access to only $ s , that runs A chopxor , and forwards all of A chopxor 's $ s queries to its own $ s oracle, and perfectly simulates dec-stream * queries using $ s . The probability that two $ s queries collide in their first n bits of output (i.e., colliding tags) is at most q 2 /2 n . Similarly, the probability that the first n bits of output of a new $ s -query collides with the input to some past stream * -query is at most qq v /2 n . By excluding these bad events, each $ s -output is uniquely identified by its first n bits, therefore if meaning A * has either found a pre-image or a second pre-image for $ s n , which occurs with probability at most q v /2 n , hence

Focusing on Transcripts
We are left with bounding We apply Patarin's method, which we briefly review below.

Patarin's Method
When an adversary A interacts with an oracle O : X → Y, it produces a sequence of inputs and outputs to the oracle t = ((x 1 , y 1 ), We let O t denote the event that O(x i ) = y i for i = 1, . . . , q, A t the event that A produces inputs x 1 , x 2 , . . . , x q given oracle outputs y 1 , y 2 , . . . , y q , and A O = t the event that the interaction between A and O produces transcript t. Note that in the events defined above, the order of the queries specified by the transcript t is important. Adversarial advantage in distinguishing two oracles can be bounded by looking at the difference in transcript probabilitites. Initially formalized by Patarin [Pat91,Pat08], re-introduced by Chen and Steinberger [CS14], and to a certain extent independently discovered by Bernstein [Ber05], and Chang and Nandi [Nan06,CN08], the following lemma allows us to mostly focus on computing transcript probabilities to establish our results.
Lemma 2 (Patarin). Let A be an adversary attempting to distinguish oracle O 1 from oracle O 2 , both with input domain X and output domain Y. Let T ⊂ (X × Y) * denote the set of transcripts t such that P A t > 0, and say that T can be partitioned into a set of good transcripts T good and a set of bad transcripts T bad . If there exists such that for all t ∈ T good , We apply Patarin's method by describing an event ρ-coll t such that for all t in some set T good , where ρ-coll t is the complement of ρ-coll t , so that If we can find and a set T good such that P ρ-coll t ≤ for all t ∈ T good , then the above equation allows us to apply Lemma 2, which is Patarin's method. To arrive at this point, we introduce further terminology to describe the transcripts.

Transcript Description
A transcript t of A chopxor 's interaction only contains enc-stream and dec-stream output, which hides the calls to stream made by dec-stream; similarly, when A chopxor interacts with ($ s , dec-stream * ), calls to stream * are hidden. We augment A chopxor 's transcripts to include the hidden output. As done in previous work [CS14, GPT15, MRV15], we release the hidden output to A chopxor after all queries have been made, but before A chopxor outputs its decision. Each transcript consists of q O 1 queries (representing either enc-stream or $ s ), q v O 2 -queries (representing either dec-stream or dec-stream * ), and q v H-queries (representing either hidden stream or stream * output), denoted where Y − i is either ⊥ or a message, and − i is the block length of C − i . Note that in both worlds, O 2 (A, T C) can be written as Furthermore, the transcript's probability is non-zero only if for all successful forgeries, and ⊥ otherwise. We can replace all O 2 queries by O 1 and H queries, and maintain the same transcript probability: where

Viewing Transcripts as Graphs
Our goal is to have ρ-coll t describe the event that two ρ-inputs collide in a transcript t.
To do so, we extract the graph of queries made to ρ, illustrated for a single query in Fig. 2. App. B works through an example of the conversion from transcript to graph we describe in this section.
Definition 3. Given a set of sequences, we define the graph induced by the X i as follows. The nodes are all non-empty subsequences of the X i , each labelled by their last element. Two nodes, v 1 and v 2 , are connected by a directed edge v 1 → v 2 if v 1 is a prefix of v 2 , differing by one element.
Given a transcript t, define G t as the graph induced by the following sequences: The graph G t ideally captures the only information that the adversary should learn, namely that inputs to SUNDAE with the same prefixes will result in the same ρ-calls, but all outputs are unrelated to each other. Since SUNDAE internally maps each node of G t to a ρ input, SUNDAE will maintain the graph's structure if no two ρ inputs collide. SUNDAE maps G t to ρ input via its intermediate functions, denoted f in Sect. 4.5. We introduce the graph G I t to describe the adversary's view after application of f : G I t is a graph induced by the following sequences: Note that G t 's nodes are sequences of blocks B, while G I t 's nodes are sequences of functions defined on B.
There is a natural function induced by I mapping G t to G I t as follows. Let v be a node in G t . If v is a sequence of length one of the form ((0, X)), then map it to the node ((X)) in G I t . Otherwise find any (A ± i , M ± i ) such that v is a subsequence of split(A ± i , M ± i ), then map v to the corresponding subsequence of I(A ± i , M ± i ). This mapping is well-defined , then the corresponding subsequences in I(A ± i , M ± i ) and I(A ± j , M ± j ) will equal each other as well. Furthermore, edges between nodes are preserved since the mapping preserves subsequences.
Applying ρ to G I t means applying the cascade ρ to all paths from root nodes to leaves, where each path is interpreted as a vector with root node as the first component, and leaf node as the last component. Given an arbitrary node v, its corresponding ρ-input is defined as v applied to the cascade of ρ with the sequence of nodes on the path connecting v to a root node. For example, say there is a path (v 0 , v 1 , v 2 , v) connecting v to the root node v 0 , then the ρ-input corresponding to v is defined as The ρ-inputs corresponding to nodes in G I t are denoted by the label χ : v → χ v , a random variable over B V dependent on ρ.

Comparing Transcript Probabilities
Given a transcript t, there are four types of collisions that could occur: n , 100 n−2 , 010 n−2 , 110 n−2 }, allowing the adversary to determine ρ's output on any of those inputs, 2. t contains colliding output blocks, meaning i, i , j, j such that either i = i or j = j and S ± i [j] = S ± i [j ], 3. when mapping nodes from G t to nodes in G I t two ρ-inputs inevitably collide through poor design of f , which we call a structural collision, and 4. when applying ρ to G I t two ρ-inputs collide, in which case we call the collision accidental.
Define the set T bad so that it includes all transcripts satisfying conditions 1 and 2. We naturally have that T good is the complement of T bad . Proposition 2 below analyzes the probability of a bad transcript occuring when interacting with ($ s , dec-stream * ). The last two events cannot be described purely in terms of t and lead to the following definition.
Definition 4 (ρ-coll t ). Event ρ-coll t occurs if there are two different nodes v and w of G t that map to two different nodes v and w in G I t , respectively, under I's induced mapping, such that χ v = χ w .
As long as t is not in T bad , each single-element sequence ((0, S ± i [j])) gets placed as a distinct root node in G t without children. Then, under the I-induced mapping, those single-element sequences get mapped to elements ((S ± i [j])), which again become distinct root nodes in G I t without children. Finally, if ρ-coll t does not hold, then each ρ-output seen by the adversary is the result of ρ queried with an input which does collide with any other ρ-input, thereby establishing the following proposition, and hence also the statement given in (39).
Proof. Since $ s 's outputs are all uniform and independently distributed, the chance that one of its output blocks is in 0 n , 100 n−2 , 010 n−2 , 110 n−2 is 4σ P /2 n , and the probability that two of its output blocks collide is σ 2 P /2 n+1 . Similarly, dec-stream * either repeats $ s output, or generates independent, uniform random output, in which case a bad transcript occurs with probability at most 4σ C /2 n + σ 2 C /2 n+1 .

Bounding ρ-coll t Probability
As explained above, ρ-coll t could occur either due to a structural or an accidental collision. A structural collision occurs when the I-induced mapping from G t to G I t maps two different subsequences generated by split to the same subsequence generated by I. If two different message sequences get mapped to the same intermediate function sequence, then a ρ-input collision is guaranteed to occur if the ρ-input calculations start from the same constant. However, as long as the mappings (δ, X) → f δ,X are injective, meaning if (δ, X) = (δ , X ) then f δ,X = f δ ,X as functions, the I-induced mapping will be injective as well.
Note that (X, δ) → f (X,δ) is in fact injective, since if X = X , one can find Y such that Structural collisions only occur when the intermediate functions are not injective. Since we have chosen injective functions in our design such collisions never occur in transcripts produced by SUNDAE.
Nevertheless, there could still be accidental collisions among the ρ-inputs when ρ is applied to G I t . In Sec. 4.10 we derive a general bound which can be used to analyze this case, which we subsequently apply.
For example, consider the transcript consisting of the following elements: O 1 (a 1 , p 1 ||p 5 ) = s 6 , s 7 , s 8 O 1 (a 2 ||a 4 , p 2 ) = s 9 , s 10 (56) For convenience we assume that all the a i 's and p i 's are mutually distinct and full blocks so that they are queried with the same initial IV.
In Figure 3, we construct the graphs G t and G I t for this transcript, and illustrate all events of interest that can occur. The transcript is in T bad if one of the 2 bad events occur in the induced graph G I t : 1. either one of the s i 's is of the form * * 0 n−2 , or 2. if for stream outputs of different O 1 queries we have s i = s j .
A structural collision occurs when we have some f δ,x = f δ ,y for (δ, x) = (δ , y). In that case the structures of G t and G I t are no longer isomorphic. For example in Figure 3, if f 0,a0 = f 1,a1 , then the nodes corresponding to this function collapse to one single node which is then connected to f 0,p1 by a dotted edge as shown in the figure. However the functions chosen in SUNDAE ensures that such collisions never occur.
The 4th type of collisions denoted by the event ρ-coll t occurs when the labels of two different nodes χ i and χ j are accidentally equal due to the randomness induced by the URF ρ. In the following sub-section, we concentrate on finding the probability that this event occurs in a given directed graph G I t when ρ is chosen at random from all functions from B → B.

Bounding ρ-Input Collisions
Let G be a graph which is a directed forest, meaning G consists of disjoint directed trees. Let V be the set of nodes of G and R ⊂ V the set of root nodes. Say that each node in G is labelled by functions from X to X. We denote the function at a node v ∈ V asv. Hence application of function at v to x ∈ X is writtenv(x). Furthermore, given a subset S ⊂ V , we writeŜ to denote the set of functions underlying the nodes of S. In particular, |Ŝ| ≤ |S|, with equality if and only if each node in S represents a different function. A sibling set S of G is the set of children of some other node in G. The only nodes in G which are not part of a sibling set are the root nodes. In particular, R along with all of G's sibling sets forms a partition of V .
As before, ρ : X → X can be applied to G by applying the cascade to each path in G starting from a root node. We let χ ∈ X V denote the ρ-inputs associated to the nodes V .
Consider the event that no two ρ-inputs collide, or equivalently, that χ is in ∂X V . Our goal is to characterize the probability of this event in terms of G as follows.
Proposition 3. Let R denote G's root nodes, and say that |R| = |R|. Let S 1 , S 2 , . . . , S τ denote an enumeration of G's sibling sets, and let Proof. We know that therefore lower bounding this probability can be done by focusing on the labels in ∂X V which occur with non-zero probability. We call such labels valid.
Since G's root nodes are fixed values, if two root nodes represent the same value, then no label in ∂X V will be valid. Therefore we must use the fact that |R| = |R|, since otherwise it is impossible for no two ρ-inputs to collide. Furthermore, a label x ∈ ∂X V is only valid if x v = χ v for all v ∈ R, hence we restrict our attention to such labels.
Since the only randomness present in χ is that provided by ρ, validity of a label x is determined via equations relating ρ to x, as imposed by G's edges. If node v is connected by an edge to node w, then w's label is calculated from v's label by applying ρ and thenŵ, or in other words, χ w =ŵ(ρ(χ v )), which is equivalent to ρ(χ v ) ∈ŵ −1 (χ w ). In particular, letting C v denote the set of children of a node v ∈ V , (60) where the children sets C v have been replaced by the sibling sets S 1 , S 2 , . . . , S τ . Therefore the probability of a valid label can be lower bounded by 1/ |X| τ (which reaches equality if the functionsŵ are bijective), and we get We lower bound the number of valid labels as follows. Consider the possible labels for the first sibling set S 1 . We know that x r must equalr for all root nodes r ∈ R. Since x v with v ∈ S 1 cannot equal x r for r ∈ R, we know that x v ∈ X \R. Hence there are at most ∂(X \R) S1 possible labellings of S 1 , and at least Γ 1 valid ones. After fixing a labelling of S 1 , the labels for S 2 must be taken from a set of size |X| − |R| − |S 1 |, and so Γ 2 lower bounds the number of possible valid labellings of S 2 . Continuing like this for the other sibling sets, we have that Γ i lower bounds the number of possible labellings for S i , and the total number of valid labels for all of G is bounded below by the product of the Γ i .

Proposition 4. If all functions in S i are bijective for all i, then
Proof. Let w ∈ S i and consider some set Y ⊂ X such that |Y| = |X| − N i . Since w is bijective,ŵ −1 (x w ) is a singleton set, therefore for any x ∈ ∂Y Si there is at most one element in the set w∈Siŵ −1 (x w ). Call the element α x if it is present. Then the mapping from x for which α x exists to α x must be injective sinceŵ −1 is injective for all w ∈ S i . Furthermore, the mapping from α ∈ X to ∂X Si defined by α → (w →ŵ(α)) w∈Si is injective as well since if α = α , thenŵ(α) =ŵ(α ) for any w ∈ S i . Therefore The set α ∈ X (w →ŵ(α)) w∈Si ∈ ∂Y Si can be rewritten as w).
Applying the above to the collision probability of χ, we get Prop. 4 allows us to focus on analyzing collisions among the functions (X, δ) → f (X,δ) . We have for any X and X , coll X (f (X,0) , f (X ,1) ) ≤ 1, and if |X| < n and |X | = n, w). First, note that C i is non-zero only if its associated sibling set has size greater than one. Sibling sets of size greater than one can only be created when a query is made, and for each query, either an existing sibling set becomes larger, or a new sibling set is created. This means there are at most q + q v sibling sets of size greater than one. The size of the sibling sets is also bounded above by q + q v , therefore

Collecting the Results to Compute the Bound of Theorem 1
Sect. 4.4 applies the PRP-PRF switch to get that DAE(A) is bounded above by allowing us to focus on ∆ A (enc, dec ; $, ⊥). Sect. 4.6 proceeds to eliminate the chopxor function, concluding that Then the intermediate construction dec-stream * is introduced in Sect. 4.7, to establish that Using the bounds given in equations (72),(73), and (74), we have that the DAE security of SUNDAE is upper bounded by To bound the term ∆ A chopxor (enc-stream, dec-stream; $ s , dec-stream * ) we make use of Patarin's method. To this end, we divided the set of transcripts seen by the adversary into good and bad transcripts. We have established in Sect. 4.8.4 that for all t ∈ T good , By using (71), the above becomes The probablity of a bad-transcript occuring when A chopxor is interacting with the intermediate oracles $ s and dec-stream * is bounded above in Sect. 4.8.4 thus we can apply Lem. 2 in a straightforward manner to get Thus adding the above bounds we get that the DAE security of SUNDAE is upper bounded by

In Software: Embedded and Server-Side
SUNDAE is designed to have little overhead besides its block cipher calls. Besides an n-bit state, it only requires two XORs block and one or two finite field multiplications with a constant per message. Regarding performance, we expect serial software implementations of SUNDAE to run at half the speed of the underlying block cipher.
Setting. Our study considers embedded and high-performance parallel software implementation possibilities for SUNDAE with the following exemplary choices.
Block Cipher: We use AES [DR02] which is widely standardized and deployed in practice.
Platforms: As a case study for embedded devices, we use the Cortex-A57 core of a Samsung Exynos 7420 CPU (ARMv8 platform). For the server side, a Intel Core i7-6700 CPU (Skylake microarchitecture) was used. On both architectures, the cryptographic instruction support for AES is used, and key scheduling is precomputed. On each platform, SUNDAE is implemented serially with minimum overhead and in a parallel way for maximum performance. Message Lengths: Performance data is provided for message lengths of = 2 b bytes, with 6 ≤ b ≤ 11, covering most typical use cases, and in particular also illustrating SUNDAE's performance for relatively short inputs. To evaluate SUNDAE's efficiency when parallelization possibilities from multiple input streams arise, we also implemented it using the Comb scheduling strategy [BLT15] when instantiated with both fixed length messages and a message length mix according to a typical Internet packet size distribution [BLT15], in which around 40% of all packet lengths are short (below 100 bytes) and another 40% are moderately long (around 1500 bytes), hence emphasizing the importance of good performance for shorter messages.
Performance measurements. All measurements were taken on a single core. For the Intel platform, the CPU was a Core i7-6700 CPU at 3.4 GHz with Turbo Boost disabled. On ARM, a single Cortex-A57 core of the Samsung Exynos 7420 was used. The reported performance numbers were obtained as the median of 91 averaged timings of 200 measurements each [KR11]. The performance of AES, both in serial and parallel implementations, is provided as a baseline. Our implementation results are summarised in Table 1. All performance numbers are given in cycles per byte (cpb). The last column, denoted "mix", gives performance for the Internet message length distribution outlined previously.
Discussion. Table 1 confirms that serial software implementations of SUNDAE achieve roughly half the throughput of the underlying block cipher with almost no extra overhead on both Intel and ARM platforms: on Intel, SUNDAE is around 3% slower than two passes of CBC; on ARM, around 7%. SUNDAE's performance for short message lengths is only around 11% worse than for longer messages. Compared to the single-pass nonce-dependent COFB, SUNDAE has an overhead of 60% for short and 80% for long messages on Intel, and 35% for short and 80% for long messages on ARM. Parallel implementations for processing multiple input streams are also possible, making full use of Intel and ARM's cryptographic instruction pipelines. Intel's AES-NI encryption instruction has a latency of 4 and an inverse throughput of 1 on the Skylake microarchitecture, whereas ARM's AES instructions come with a combined latency of 3 and an inverse throughput of 1 on the Cortex-A57. However unlike on Intel, they share the same pipeline as the logical and byte shuffling instructions (XOR,VEXT), which limits the performance gain for most block cipher modes, which alternate these with block cipher invocations.
As further reference points, we compare SUNDAE's performance to the nonce-dependent CLOC and JAMBU. As reported in [Iwa16], CLOC runs on Skylake at 2.82 cpb for long and 7.81 cpb for 64-byte messages. For short messages, SUNDAE performs better, whereas for longer messages, CLOC benefits from being a one pass, but two call, scheme, which allows limited use of the pipeline. Performance data for JAMBU on Skylake has been reported in [BLT16] as 5.5 cpb for long and 6.8 cpb for 64-byte messages, similar to SUNDAE's performance. However, recall that, unlike SUNDAE, CLOC, COFB and JAMBU all depend on nonce freshness for security.
Even in comparison with GCM-SIV, which is nonce misuse-resistant but targets highend platforms, SUNDAE offers similar performance using the Comb technique: GCM-SIV runs at around 1.2 cpb for 2 KB messages [GL15], with SUNDAE performing at 1.3 cpb.

ASIC Implementation
Lightweight implementations of CLOC [IMGM14], SILC [IMG + 14] and AES-OTR [Min14], with AES-128 as the underlying block cipher has already been published in [BBM16]. As with all rate 1/2 modes like CLOC, SILC, and also some rate 1 modes like OTR [BBM16] there is a need for offline storage of message blocks for reading them twice. Note that this was also assumed in the lightweight implementation of the above modes in [BBM16]. In this work, the authors use the 8-bit serial implementation of AES given in [MPL + 11]. The authors implemented the above modes for two typical use cases (a) aggressive and (b) conservative. The aggressive design implemented a version of the circuit that only catered to a limited set of sizes of the plaintext and associated data. For example, the aggressive circuit was only designed to process user inputs in which the associated data was empty and the length of the plaintext was an integral multiple of the block size of the underlying block cipher. The intermediate outputs produced by circuit were stored offline, and an external processor made them available at the input buses as required by the design. This relaxed many of the storage requirements in the circuit, and so the circuit occupied lower gate area. The conservative circuit had no such constraints and was designed to handle all types of user inputs within certain bounds (upto 8 blocks of associated data and 256 blocks of plaintext). All outputs of intermediate modules were stored in additional registers in the circuit. As a result of which its gate area was significantly larger. In our implementation of SUNDAE, we do not distinguish between different use cases: our implementation is able to handle inputs of all sizes within the aforementioned bounds. Furthermore, the mode of operation has been designed in a manner that does not require temporary storage of any intermediate results, as a result no additional storage elements are required for this purpose. This essentially corresponds to the conservative use case of [BBM16]. We implement the mode using the block ciphers AES-128 and Present [BKL + 07]. For AES, we use the Atomic-AES architectures developed in [BBR16a,BBR16b]. These are 8-bit serial architectures for AES meant for accommodating both encryption/decryption on the same platform. We use the Atomic AES v2.0 architecture which is smaller in area than the circuit in [MPL + 11] by around 200 gate equivalents (GE) 1 . Furthermore, we try to do away with the requirement of an additional register to perform field doublings and quadruplings by changing the structure of the finite field. In stead of performing doubling over GF (2 128 ), we perform 8 doublings over GF (2 16 )/ < x 16 + x 5 + x 3 + x + 1 > in the following way. Note that f now becomes a function that can be easily implemented using a bytewise shift register as shown in Figure 4. The diagram shows a glimpse of the datapath of the design. The numbered boxes denote byte sized registers, and those colored grey denote scan registers. We refer to [BBR16b] for a detailed functional description of the circuit. As can be seen, in addition to the original AES circuit, only three 8-bit xor gates, an 8-bit two-input multiplexer, and an 8-bit two-input AND gate are required.
When the signal FMODE is low, the output of the AND gate is zero and all the xor gates are essentially bypassed (logically), and the circuit behaves as it should during the encryption cycles, with data flowing along the blue path between the registers. However when FMODE is high, the circuit computes f in the next clock cycle. Since quadrupling is f • f , it can be computed by setting FMODE to logic high for two consecutive cycles. Timing. The Atomic-AES architecture takes 246 clock cycles to encrypt one block of plaintext. Note that the first 16 cycles are used for loading the plaintext/key on to the registers. Again, the last 16 cycles are used to produce the ciphertext in a bytewise manner. Therefore if the mode of operation calls for 2 consecutive encryption operations (like in lines 8,17,23 of Algorithm 1), then in the last 16 cycles of the 1st encryption, the intermediate ciphertext can be xored with the associated data/plaintext and loaded on to the registers to start the next encryption cycle. As a result, a total of n consecutive encryptions can be done in 246 + (n − 1) · 230 cycles. Let L * be the number of blocks in the associated data and the initial b 1 ||b 2 ||0 n−2 block, and L be the number of blocks in the plaintext. We break up the analysis into subcases: 1. Both |A| = |M | = 0: In this case L * = 1 and L = 0. Only one block is encrypted which takes 246 cycles.
2. |A| = 0 but |M | = 0: In this case L * = 1 and L ≥ 1. The initial b 1 ||b 2 ||0 n−2 block and the first L − 1 plaintext blocks are processed in T 2 = 246 + (L − 1) · 230 cycles. Then F 2 = 1 or 2 cycles are used for the doubling/quadrupling. The last plaintext block and all the plaintext blocks in the second pass require T 3 = 246 + L · 230 cycles. The total time taken to process one user input is T 2 + T 3 + F 2 cycles.
3. |A| = 0 but |M | = 0: In this case L * ≥ 2 and L = 0. The initial L * − 1 blocks can be encrypted in T 1 = 246 + (L * − 2) · 230 cycles. This is interrupted by the field doubling/quadrupling which takes F 1 = 1 or 2 cycles according as the last associated data block is non-integral or not. Thereafter, last associated data block takes another T 2 = 246 cycles. The total time taken is T 1 + F 1 + T 2 cycles. 4. Both |A| = 0 and |M | = 0: In this case L * ≥ 2 and L ≥ 1. The initial L * − 1 blocks can be encrypted in T 1 = 246 + (L * − 2) · 230 cycles. This is interrupted by the field doubling/quadrupling which takes F 1 = 1 or 2 cycles according as the last associated data block is non-integral or not. Thereafter, last associated data block and the first L − 1 plaintext blocks are processed in T 2 = 246 + (L − 1) · 230 cycles. Then F 2 = 1 or 2 cycles are used for the doubling/quadrupling. The last plaintext block and all the plaintext blocks in the second pass require T 3 = 246 + L · 230 cycles. So the total time taken to process one user input is T 1 + T 2 + T 3 + F 1 + F 2 cycles.

Present.
For the implementation with Present as the underlying Block cipher, we use the 4-bit serial implementation proposed in [RPLP08]. Since this is a nibble based implementation of the Present block cipher, we define the doubling and quadrupling operations in a nibble wise fashion. That is to say if d 0 , d 1 , . . . , d 15 are the 16 nibbles of the 64-bit block, we perform 4 doublings over GF (2 16 )/ < x 16 +x 5 +x 3 +x+1 > for each of the 4 bit- . Also one encryption using the nibble serial architecture in Present takes 567 cycles, out of which 20 cycles are needed to load the 80-bit key in a nibble-wise fashion. Hence, n consecutive encryptions take 567 + (n − 1) · 547 cycles, and the expressions for T 1 , T 2 , T 3 change accordingly. It is obvious that the construction with Present offers much less throughput than AES. However, Present is one of the most well analyzed lightweight block ciphers and has been adopted as a standard in ISO/IEC 29192-2. As seen in Table 2, the total circuit area for Present-SUNDAE is only around 1450 gates, which makes it ideal for lightweight platforms. Table 2 we present the synthesis results for the designs. The following design flow was used: first the design was implemented in VHDL. Then, a functional verification was first done using Mentor Graphics Modelsim software. The designs were synthesized using the standard cell library of the 90nm logic process of STM (CORE90GPHVT v 2.1.a) with the Synopsys Design Compiler, with the compiler being specifically instructed to optimize the circuit for area. A timing simulation was done on the synthesized netlist. The switching activity of each gate of the circuit was collected while running post-synthesis simulation. The average power was obtained using Synopsys Power Compiler, using the back annotated switching activity.

Results. In
As can be seen in the table, our implementation of AES-SUNDAE occupies around 2524 GE and is around 600 GE smaller than the aggressively designed CLOC and SILC circuits, synthesized with the same standard cell library. Since it is only fair to compare our design with the conservative CLOC/SILC/AES-OTR designs, we see that here too our implementation outperforms the CLOC, SILC and AES-OTR circuits by around 1800, 1700 and 4220 GE respectively. Also, Present-SUNDAE only occupies around 1452 GE which makes it ideal for deployment in lightweight platforms.
In Table 3, we present the area-wise breakup of the various components of the circuit. For AES-SUNDAE we see that around 85% of the area is occupied by the encryption logic (AES core) alone. This area includes the three additional circuit elements to compute the field doubling. Around 4% of the area is required for the length counters that encode and keep track of the number of blocks of associated data and plaintext currently processed. The remaining area is required for the control logic for routing signals to and out of the encryption core. For Present-SUNDAE, we have roughly the same area distribution, but the percentage contributions are different since the Present core is much smaller than AES.

Comparison with JAMBU and COFB
These figures also compare favorably with modes like JAMBU. The state size in JAMBU is one and a half times the block size of underlying block cipher. So when implemented with AES-128 as the underlying cipher, the mode requires a state of 192 bits. Only 128 bits can be accommodated in the data register of AES and so an additional 64 bit register is needed, which requires around 300 GE. So, if we use the Atomic-AES architecture to design the circuit for JAMBU we can estimate that the area required would be approximately 2100 (AES core) + 100 (Length counter) + 250 (control logic) + 300 (64 bit Register) ≈ 2750 GE, which is still around 250 GE more than the circuit for AES-SUNDAE.
COFB [CIMN17b,CIMN17a] is an AE mode designed by Chakraborti et al. at CHES 2017. It aims at reducing the hardware area over and above the underlying block cipher by using an n 2 size mask (where n is the block size of the underlying block cipher). Since most AE modes use an n bit mask this reduces the storage requirement by n 2 flip-flops over many standard AE schemes. However, this still needs an additional register of n 2 bits to store and update the mask, and so COFB like JAMBU has an effective internal state of 1.5n bits. Even so, the design is suited for lightweight implementation using the Atomic-AES architecture. We note the following: • Mask Logic: To implement COFB, we need a mask register of size n 2 = 64 bits. The register needs to be initialized by an intermediate 64-bit value ∆, and 3 possible updates are specified: (a) Multiplication by the primitive element α of GF (2 64 )/ < x 64 +x 4 +x 3 +x+1 >, (b) Multiplication by 1+α and (c) Multiplication by (1 + α) 2 . Multiplication by α will need only 3 xor gates, whereas multiplication by 1 + α requires 64 + 3 = 67 xor gates. Multiplication by (1 + α) 2 can be achieved by doing multiplication by 1 + α over 2 successive cycles. So at different points of time in the encryption cycle, the register would have to load one of 3 values: the constant ∆, the result of multiplication by α or 1 + α. The most hardware effective way of implementing it is using a register with scan flip-flops along with an additional 64 bit multiplexer. So in total, we can estimate that the mask logic will require around 350 (scan register) + 128 (multiplexer) + 134 (xor gates) = 612 GE. • Linear Mixing: COFB uses the linear mixing function G over the 128 bit AES state. This function is essentially same as the g 2 function used in CLOC. As a result, the AES state register can be tweaked like a columnwise shift register to achieve this functionality, exactly the same way as it was done in [BBM16, Fig. 1(a)]. The tweak requires a 32 bit multiplexer and an equal number of xor gates. Thus the modified AES circuit would require 2060 + 64 (multiplexer)+ 64 (xor gates) =2188 GE.

Throughput
We can also estimate the throughput of the various modes when using the Atomic AES architecture, using the fact that n consecutive blocks are encrypted in 246 + 230(n − 1) cycles. We tabulate the number of cycles-per-byte (CpB) required to operate the modes as a function of message length in Table 4. Among all rate 1/2 modes, AES-SUNDAE performs best when dealing with short messages less than 16 plaintext blocks, whereas all rate 1/2 modes asymptotically reach a constant CpB value for longer messages. Among the rate 1 modes, the performance of SUNDAE is comparable with OTR for short messages of 1 to 4 blocks, whereas COFB performs best overall.
The labels of the nodes are the last elements of their sequence. Fig. 5 displays the resulting graph, where only the node labels are displayed and the C i [j] are not included.