Lightweight and Side-channel Secure 4 × 4 S-Boxes from Cellular Automata Rules

. This work focuses on side-channel resilient design strategies for symmetric-key cryptographic primitives targeting lightweight applications. In light of NIST’s lightweight cryptography project, design choices for block ciphers must consider not only security against traditional cryptanalysis, but also side-channel security, while adhering to low area and power requirements. In this paper, we explore design strategies for substitution-permutation network (SPN)-based block ciphers that make them amenable to low-cost threshold implementations (TI) - a provably secure strategy against side-channel attacks. The core building blocks for our strategy are cryptographically optimal 4 × 4 S-Boxes, implemented via repeated iterations of simple cellular automata (CA) rules. We present highly optimized TI circuits for such S-Boxes, that consume nearly 40% less area and power as compared to popular lightweight S-Boxes such as PRESENT and GIFT. We validate our claims via implementation results on ASIC using 180nm technology. We also present a comparison of TI circuits for two popular lightweight linear diﬀusion layer choices - bit permutations and MixColumns using almost-maximum-distance-separable (almost-MDS) matrices. We ﬁnally illustrate design paradigms that combine the aforementioned TI circuits for S-Boxes and diﬀusion layers to obtain fully side-channel secure SPN block cipher implementations with low area and power requirements. Cellular Automata · Optimal S-Box.


Introduction
Lightweight cryptography has received great momentum with the proposal of a number of efficient symmetric-key cryptographic primitives in recent years. Design choices for lightweight cryptography typically focus on optimizing one or more essential implementationbased criteria, including (but not limited to) area, power, and throughput. At the same time, these primitives must also satisfy the basic security requirements against wellknown cryptanalytic attacks such as linear [MY93] and differential [BS91] cryptanalysis. Lightweight block ciphers follow various design principles, amongst which substitutionpermutation network (SPN) is highly popular. An SPN structure typically comprises several rounds, where each round has three operational layers -(a) a layer of nonlinear substitution-boxes (S-Boxes), (b) a linear permutation-layer, and (c) round-key-XOR. The impetus on lightweight cryptography has been further enhanced by NIST's recent announcement of a lightweight cryptography project [MBTM17], seeking design choices targeting a variety of devices and applications. In particular, the announcement lists

Overview of Our Contributions and Techniques
The main contributions of this paper are briefly summarized below: • Lightweight and Side-channel Secure Design Strategies for S-Boxes.
In this paper, we use cellular automata in order to design such nonlinear functions with inherently lightweight implementations. A cellular automaton is a finite state machine whose state transitions are based on simple local rules. Prior studies have extensively analyzed the scope of realizing complex functions via repeated iterations of this simple rules [Wol83,Wol84b,Wol84a]. A recent work by Picek et al. [PMY + 17] explores the possibility of designing cryptographically optimal 4 × 4 Ashrujit Ghoshal, Rajat Sadhukhan, Sikhar Patranabis, Nilanjan Datta, Stjepan Picek and Debdeep Mukhopadhyay 313 S-Boxes from such simple 4 × 1 CA-based rules. The idea is to iterate over a single instance of the CA rule, while cyclically shifting the input bits, to obtain one output bit of an S-Box at a time. In this work, we take a step further and explore the possibility of designing cryptographically optimal 4 × 4 S-Boxes from CA rules, while also ensuring that such S-Boxes give rise to side-channel secure TI circuits with low area footprint and power consumption. The main design principle for the TI circuit remains the same -we protect the core CA rule by decomposing the input and output bits into as few shares as possible, and then iterate over this core unit by cyclically permuting the input bits. We demonstrate that a significant proportion of the resulting S-Boxes achieve cryptographically optimal properties, and give rise to distinct classes based on their implementation overheads and amenability to TI designs. We also demonstrate additional optimizations on the most lightweight of these S-Box classes by exploiting the decomposability of its CA rule into smaller Boolean functions. Our implementation results on ASIC (180nm technology) show that the most lightweight TI circuit among all CA-based S-boxes has a 49.42% smaller area-footprint and consumes 52.3% less power as compared to the best-known TI of the PRESENT S-Box [PMK + 11]. The same TI circuit also leads to a 35.36% smaller area-footprint and consumes 44.46% less power as compared to a highly optimized TI of the GIFT S-Box.
• Lightweight and Side-channel Secure Design Strategies for Permutation Layers. Permutation layers provide the much needed diffusion in any block cipher construction, and are hence important for side-channel security. Two main classes of permutation layers dominate nearly all lightweight SPN constructions -bit permutations and almost-maximum-distance-separable (almost-MDS) permutations. Examples of the former include PRESENT [BKL + 07] and GIFT [BPP + 17], while an example of the latter strategy is Midori [BBI + 15]. In this paper, we present a comparative analysis of the area and power overheads corresponding to TI designs for both choices of permutations. Such a comparative analysis allows a designer to analyze the pros and cons of choosing either of these strategies with respect to a given application.
• Combining it All Together. Finally, we present a trade-off analysis between the design choices for the S-Box and permutation layers as components in an SPN structure. We first observe that our CA-based S-Boxes have a branch number of 2 (as opposed to 3 for the PRESENT S-Box), and also lack the bad-output-goodinput (BOGI) property exhibited by the GIFT S-Box [BPP + 17]. This makes it practically infeasible to combine these S-Boxes with bit-permutation layers in a full SPN structure and necessitates almost-MDS permutation layers. Interestingly, it turns out that the area and power savings from our CA-based S-Boxes outweigh the additional area and power requirements for an almost-MDS permutation layer over a bit permutation layer, particularly when implemented for side-channel security via TI. With these observations, we propose using CA-based S-Boxes in conjunction with almost-MDS mappings as a new design-for-security strategy for designing lightweight block ciphers that are amenable to low-area and low-power TI designs.

Paper Organization
The rest of this paper is organized as follows. In Section 2, we introduce the notation and present background material on cryptographic properties of S-Boxes, threshold implementations (TI), cellular automata (CA) and their properties, and relevant measurement units for area footprint and power consumption of CMOS devices. Section 3 presents direct-shared TI circuits for cryptographically optimal 4 × 4 S-Boxes obtained via repeated iterations of local CA rules, along with the area and power overheads for the same on ASIC 314 Lightweight and Side-channel Secure 4 × 4 S-Boxes from Cellular Automata Rules platforms (180nm technology). Section 4 further refines these TI circuits by reducing the number of shares to achieve even lower area footprint and power consumption. Section 5 compares bit permutations and MixColumns using almost-MDS matrices in terms of their amenability to low-cost TI designs. This section also presents design paradigms for combining TI for S-Boxes and diffusion layers to achieve lightweight and fully side-channel secure block cipher implementations. Finally, Section 6 summarizes the major findings of the paper and discusses possible future research directions.

Cryptographic Optimality and Representation of S-Boxes
In the standard cryptographic nomenclature, a substitution box (abbreviated as S-Box), is a nonlinear n × m Boolean function. In the rest of the paper, we consider only S-boxes that have the same number of inputs and outputs, i.e., n × n S-boxes. Here, we briefly describe some important cryptographic properties of S-boxes.
• Algebraic Degree. To define the algebraic degree of an S-Box, we use the algebraic normal form (ANF) representation of a Boolean function f represented by a polynomial in F 2 [x 0 , . . . , The algebraic degree deg f of a Boolean function f is defined as the number of variables in the largest product term of the function's ANF having a non-zero coefficient [Car10a]. The algebraic degree deg F of an S-Box F is the maximum algebraic degree of all non-zero linear combinations of the coordinate functions (i.e., component functions) of F [Car10b]. Ideally, a cryptographically useful S-Box should have high algebraic degree to resist algebraic attacks [MPC04].
• Balancedness. Let F be a function from F n 2 into F n 2 . Then, F is balanced if it takes every value of F n 2 exactly once. • Nonlinearity. Nonlinearity of an n × n S-Box F equals the minimum nonlinearity of all its component functions v · F , where v ∈ F n * 2 [Nyb93,Car10b]: is the Walsh-Hadamard transform [Car10b] of the function F and a · b is the usual inner product of a, b ∈ F n 2 that equals a · b = n i=1 a i b i . We use the notation F n * 2 to denote the non-zero elements of the vector space F n 2 . The nonlinearity of any (n, n) function F is bounded above by the covering radius bound: • Differential Uniformity. Let F be an S-Box from F n 2 into F n 2 with a ∈ F n 2 and b ∈ F n 2 . We define the difference distribution table of F with respect to a and b as: The entry at position (a, b) corresponds to the cardinality of the difference distribution table D F (a, b) and is denoted as δ F (a, b). The differential uniformity δ F is then defined as [Nyb94]: • Differential Branch Number. Let F be an S-Box from F n 2 into F n 2 . We define the differential branch number of F as: where wt(a) denotes the Hamming weight of a. Throughout this paper we use the term branch number to denote the differential branch number.
In order to resist linear and differential cryptanalysis attacks, a balanced S-Box should ideally have high nonlinearity and low differential uniformity. In particular, a 4 × 4 S-Box is said to be cryptographically optimal if it is bijective, has nonlinearity equal to 4, and differential uniformity equal to 4 [LP07].

Threshold Implementation: A Countermeasure to SCA
Here, we provide a brief overview of Threshold Implementation along with a simple example and a brief discussion on the importance of this countermeasure to resist side-channel attacks.

Countermeasures against SCA
There exist various countermeasures against side-channel power attacks which have been proposed over the years. A general approach focuses on decreasing the information gathered from traces.
• Noise Addition. Introducing external noise in the side-channel, shuffling the operations or inserting dummy operations in cryptographic implementations are often used as a countermeasure against side-channel attacks. The basic objective is to reduce the signal-to-noise ratio (SNR), and thereby decrease the information gathered from traces. Many works on this topic explicitly focus on improving the statistical distribution of these delays. Still, as shown by Durvaux et al.
• Dynamic and Differential CMOS Logic. Tiri et al. [TV04a] proposed Sense Amplifier Based Logic (SABL), a logic style that uses a fixed amount of charge for every transition, including the degenerated events in which a gate does not change state. In every cycle, a SABL gate charges a total capacitance with a constant value. SABL is based on two principles: (i) it is a Dynamic and Differential Logic (DDL) and therefore has exactly one switching event per cycle (independent of the input value and sequence) and (ii) during a switching event, it discharges and charges the sum of all the internal node capacitances together with one of the balanced output capacitances. Some special constant power implementation like Wave Dynamic Digital Logic (WDDL) [TV04b] are based on SABL and have a close to constant power consumption. However, this comes at a huge overhead costs of area, time, and power consumption.
• Leakage Resilience. Another countermeasure, typically applied at the system level, focuses on restricting the number of usages of the same key for an algorithm. However, generation and synchronization of new keys has a major practical issue. Dziembowski et al. introduced a technique called leakage resilience [DP08], which relocates this problem to the protocol level by introducing an algorithm to generate these keys. This approach can be extended such that several different keys (chunks) are used with the same input text. Nevertheless, both of these techniques drastically decrease the performance of a system, and hence are not practical for real-world implementations.
• Masking. One of the most efficient and powerful approaches to thwart DPA is Masking [CJRR99,GP99], which targets to break the correlation between the power traces and the intermediate values of the computations. This powerful method achieves security by randomizing the intermediate values using secret sharing and carrying out all the computations on the shared values.

Threshold Implementation: A Brief Overview
Threshold Implementation (TI) is a widely used masking technique proposed by Nikova et al. [NRR06] as a countermeasure against Differential Power Attacks (DPA) [KJJ99]. What sets TI apart from most masking techniques is the security it guarantees even in non-ideal circuits where glitches have shown to result in leakage in more conventional masking schemes [MPO05]. Initially, the proposals on TI dealt solely with the first-order DPA security, but it was later extended to protect against higher-order DPA attacks as well [BGN + 14, RBN + 15]. More recently, the pitfalls in the multivariate setting of the higher-order TI scheme were solved in [RBN + 15]. TI works under extremely relaxed assumptions on the underlying leakage which are more achievable in practical scenarios. It offers provable security and allows to construct secure circuits which are practical in size. Additionaly, designing TI does not require many design iterations in practice. TI is a Boolean masking technique based on secret sharing and secure multi-party computation.
In order to achieve the mentioned security a TI design must satisfy the following properties: • Uniformity. All intermediate shares are required to be uniformly distributed. This ensures decoupling of intermediate states from the mean of the leakages, which is essential requirement to counteract the first-order DPA. It suffices to check uniformity at the inputs and the outputs of each of the functions [Bil15]. In case no direct uniform sharing is found, uniformity can be either achieved through correction terms by using more input shares, or by re-masking i.e., adding randomness after the non-uniform computation.
• Non-completeness. Any combination of d or fewer component functions f i of f must be independent of at least one input share x i in order to achieve d th -order noncompleteness. For protection against the first-order DPA, 1 st -order non-completeness is required, i.e., every function must be independent of at least one input share. Non-completeness ensures that the side-channel security of the final circuit is not affected by glitches. Since glitches can only occur in component functions and each individual component function f i lacks knowledge of at least one share x i , glitches cannot reveal any additional information.
• Correctness. Applying the component functions to a valid shared input must always yield a valid sharing of the correct output.

A Simple Example of a Threshold Implementation
We illustrate the concept of TI using a simple example of a two-bit multiplier circuit computing a = xy. The following is a uniform sharing of the circuit [GDC17] with 1 st -order non-completeness using four input and output shares. where the output shares a 1 , a 2 , a 3 , a 4 are computed as: The number of input and output shares can be further reduced using random bits (see [Bil15] for details).

Cellular Automata
Cellular Automata (CA) are parallel computational models used in order to simulate and analyze various discrete complex systems. A cellular automaton consists of a regular grid (lattice) of cells. The grid may be in any finite number of dimensions. For each cell, a set of cells called its neighborhood is defined relative to the specified cell. Each cell is in one of a finite number of states. Typically, at every time step all the cells update their states synchronously. The state update is governed by a local rule which is applied to the neighborhood of every cell.
CA as Vectorial Boolean Function. In this paper, we restrict ourselves to periodic boundary one-dimensional Boolean cellular automata i.e., the case where every cell is in state 0 or 1 and the lattice is a linear array. A Periodic Boundary CA (PBCA) with n input cells F : F 2 n → F 2 n is defined for all x ∈ F 2 n as: where f is a Boolean function on d variables(d ≤ n) is called a local rule. Thus, a CA can be seen as a vectorial Boolean function (S-box) where each coordinate function f i corresponds to the local rule f applied to the neighborhood (x i , · · · , x i+d−1 ). The vectorial Boolean function F of a CA is also called the CA global rule. We note that cellular automata based S-Boxes are actually widely used today, since the nonlinear transformation χ in Keccak is actually a PBCA with n = 5 cells and local rule f defined as: Besides being used in Keccak, the same rule is also used in Panama [DC98], Radio-Gatún [BDPA06], Subterranean [CDGP93], and 3Way [DGV94] ciphers. Unfortunately, despite being very small rule that can be efficiently implemented, it results in optimal S-Boxes only for dimension 3 × 3 and is bijective only for odd dimensions.

Area Overhead and Power Consumption Results
The CMOS technology used for all ASIC implementation results reported in this paper is 180nm. Each implemented circuit is taken through the RTL-to-GDS2 flow to estimate the area overhead and power consumption. We used Synopsys Design Compiler version I-2013.12-SP5-4 for synthesis and Synopsys IC-Compiler version J-2014.09-SP1 for placement and routing of the design. For simulation we used Synopsys VCS version I-2014.03-SP1-1. Standard cell library TSL18FS120 from Tower Semiconductor Ltd. is used for physical design. The area overhead for all implemented circuits are measured in terms of gate equivalents (GE), where a GE in our case is equal to the lowest area occupied by a 2-input NAND gate of 1x drive of 180nm technology.
318 Lightweight and Side-channel Secure 4 × 4 S-Boxes from Cellular Automata Rules The total power consumption of a CMOS device is given by: where P static and P dynamic denote the static and dynamic power consumption of the device.
In this paper, we concentrate on the dynamic power consumption that originates from the switching activity of the circuit: where α is the switching factor (the probability of a bit switching from 0 to 1), C is the switched capacitance, V is the voltage, and f is the clock frequency. In our approach, we aim to use a simple structure of CA-based elements, which reduces the area and consequently the capacitance (since capacitance depends on the area). As the capacitance reduces, P dynamic also reduces since the other factors do not increase.

Lightweight S-Boxes from Cellular Automata Rules
In this section, we illustrate our cellular automata (CA)-based design strategies for obtaining 4 × 4 S-Boxes that are area and power-efficient, and also amenable to low-cost TI. The idea is to choose a local CA rule, which is essentially a 4 × 1 Boolean function, such that it has a low-cost equivalent implementation in hardware. The 4 × 4 S-Box mapping is obtained by applying the same CA rule to four different (cyclic) permutations of the input bits. This allows for an iterative implementation in hardware, with the CA rule implemented once in the data-path, and the control unit applying a cyclically shifted variant of the input bits in each clock cycle to obtain the corresponding output bit. We first describe the De Bruijn graph-based technique to choose the local CA rule, and subsequently enumerate certain cryptographically optimal S-Boxes obtained with this procedure. We also classify these S-Boxes in terms of their amenability to low-area and low-power TI, and present optimized TI designs for representatives from each class.

Choosing the CA Rule
Given a 4 × 1 CA rule f , the corresponding 4 × 4 S-Box is given by: We focus on choosing such CA rules that ensure that the corresponding S-box is bijective. The test for injectivity of the global map of a one-dimensional CA was shown to be decidable in [AP72], while the test for surjectivity for the same was shown to have a quadratic-time algorithm in [Sut91], using De Bruijn graphs. These graphs provide a convenient way to describe configurations of linear CAs. We follow these principles to identify local 4 × 1 CA rules, which in turn guarantee that the resultant 4 × 4 S-Box is bijective. The detailed technique for choosing such a CA rule is as follows.

De Bruijn Graph Representation
For any CA with an n-variable local rule f : F 2 n → F 2 , the associated De Bruijn graph is a directed graph G = (V, E), where every vertex v ∈ V is labeled with an (n − 1)-bit string. There exists an edge e from vertex v 1 to vertex v 2 if the first (n − 2) bits of the label of v 2 are the same as the last (n − 2) bits of the label of v 1 . For example, the De Bruijn graph with n = 4 has an edge from v 1 = 010 to v 2 = 100 as the first two bits of v 2 are 10, which is same as the last 2 bits of v 1 . Quite evidently, |V | = 2 n−1 , and |E| = 2 · 2 n−1 = 2 n (observe that each vertex has exactly two incoming and two outgoing edges).

Generating Optimal 4 × 4 S-Boxes from De Bruijn Graphs
Given a De-Bruijn graph G = (V, E) with |V | = 2 n−1 , a CA local rule may be derived by associating each edge of this graph with a bit b ∈ {0, 1}. Since there are 2 n edges, the total number of possible CA rules that can be associated with this graph is 2 2 n . In particular, for n = 4, the total number of such CA rules is 2 2 4 = 2 16 . Each such rule gives rise to a unique 4 × 4 function. An exhaustive search of these functions yields 1 536 bijective functions, which are our candidate S-Boxes. Finally, we test these functions for cryptographic optimality in terms of their nonlinearity and differential uniformity, which narrows down our search space to 512 candidate S-Boxes, which may be further sub-classified into four affine-equivalent classes -namely, G 3 , G 4 , G 5 , and G 6 . Details of these S-Boxes have been reported previously in [MPLJ17].
We would like to point out that the number of possible CA-based rules is 2 16 (as n = 4) in our case. Hence, instead of De Bruijin graph representation, we could have also used a simple brute-force approach. However, for higher values of n, where brute force search may not be feasible, a systematic approach with De Bruijin graph is a good choice.

Classification of Cryptographically Optimal CA-based 4 × 4 S-Boxes
Our next step is to classify the 512 cryptographically optimal CA-based 4 × 4 S-Boxes into certain classes, such that each category comprises S-Boxes that are expected to have similar area and power overhead in hardware, as well as similar TI circuit representations. As it turns out, each of these quantities are closely related to the nature of the algebraic normal form (ANF) representation of the S-Boxes. Given that each S-Box under consideration has optimal algebraic degree 3, we use the following facts from [BGN + 14]: • CA-based S-Boxes with the same number of cubic, quadratic, and linear terms in their ANF form have similar area footprint and expected power consumption in hardware.
• CA-based S-Boxes with the same number of cubic, quadratic, and linear terms in their ANF form have nearly identical TI circuits owing to their nearly identical algebraic structure.
Based on this rationale, we classify the S-Boxes depending on the number of linear, quadratic, and cubic terms present in the ANF of the S-Box. According to this classification, we have obtained 12 S-Box classes as shown in Table 1. We also list the CA rules corresponding to representative optimal S-Boxes for each class. Note that class (a, b, c) comprises optimal S-Boxes with a cubic terms, b quadratic terms, and c linear terms, respectively. We also summarize the cryptographic properties of these representative S-Boxes in Table 2, and compare them with the cryptographic properties of popular 4 × 4 S-Boxes that include the S-Boxes of PRESENT, GIFT, Skinny, Piccolo, Noekeon, Midori and Prince.

Threshold Implementations of CA-based S-Boxes
We now describe direct sharing-based TI circuits for the aforementioned classes of CA-based S-boxes, and compare their relative area overheads and power consumption results.

TI of CA-based S-Boxes with Examples
Since each of the representative S-Boxes listed above has algebraic degree equal to 3, we adopt the direct 4-to-4 non-complete sharing method for cubic functions originally  proposed in [Bil15] to obtain the corresponding TI circuits for each of the corresponding CA rules. We explicitly depict two of the most area-efficient and low-power TI circuits below. These correspond to the representative CA-rules for the S-Box classes (1, 2, 2) and (1, 3, 1), respectively. Note that {X j , Y j , Z j , W j } j∈ [1,4] denote the shares for the input bits X, Y, Z and W , respectively, while {f j } j∈ [1,4] denotes the shares for the output f of the CA rule.
Class: (1,2,2) , CA-Rule:  1,3,1) , CA-Rule: Figure 3.1 illustrates the hardware architecture for the direct-sharing based TI circuit corresponding to a given CA rule. The main components of the architecture are the shift registers (cyclic) for the shares corresponding to the input variables, the core block implementing the TI circuit for the CA rule, and the demultiplexer gates that are used to output one bit per clock cycle. Note that the counter bits are dependent only on the clock signal; in particular, they are independent of the other intermediate share values, and hence need not themselves be shared. A comparison of the area and power consumption for the direct sharing-based TI circuits for all representative S-Boxes is given in Table 3.
The following trend is evident from the hardware implementation results: Observation 1. If (i) a 1 < a 2 or (ii) a 1 = a 2 , b 1 + c 1 < b 2 + c 2 then TI of an S-Box belonging to class (a 1 , b 1 , c 1 ) has lower area and power consumption than an S-Box of class (a 2 , b 2 , c 2 ). On the other hand, in the case where a 1 = a 2 and (b 1 + c 1 ) = (b 2 + c 2 ), there is no such obvious trend. This could be attributed to certain optimizations made by the design compiler during synthesis.

Comparison with Direct-Shared TI for Other Popular S-Boxes
Now, we provide a comparative study of our S-Boxes with a class of lightweight S-Boxes that includes PRESENT, GIFT, Skinny, Piccolo, Noekeon, Midori, and Prince. Note that the first six CA-based S-Box representatives (for classes (1, 2, 2) through (1, 5, 3)) in Table 3 have TI circuits with lower area footprint as compared to all the other S-Boxes. Additionally, the power consumption for nearly all CA-based TI circuits is significantly lower. Note that in the direct-shared TI, each input and output variable is four-shared, which leads to a significant area overhead. It is possible to minimize the area overheads of these circuits even further by reducing the number of shares in each case. This is achieved by a technique referred to as composite TI, which we describe in the next section.

Composite TI: Optimizing TI Circuits for Low Area and Power
In this section, we present composite TI -a generic technique that allows for highly optimized TI designs of CA rules, in comparison to direct sharing techniques. A similar technique has been used in [PMK + 11] to obtain a highly optimized TI for the PRESENT S-Box. The idea is to express each 4 × 1 CA rule of algebraic degree 3 as a composition of Boolean sub-functions of degree 2 each. We then proceed by identifying uniform and non-complete sharing for these degree 2 sub-functions, and subsequently cascading them. In order to maintain non-completeness, the cascading must ensure that the TI circuits for the two sub-functions are separated by using registers. This can be illustrated using the following instance. Suppose that a CA-rule f (X) can be expressed as a composition of two sub-rules g(A) and h(X), where A denotes the intermediate output of h(X). Now, consider a uniform first-order 3-sharing of h, denoted as A 1 = h 1 (X 1 , X 2 ) and A 2 = h 2 (X 2 , X 3 ), that are fed subsequently to the sharing of g. Here h(X) = h 1 (X 1 , X 2 ) ⊕ h 2 (X 2 , X 3 ). Note that the share function g 1 (A 1 , A 2 ) can also be written as g 1 (X 1 , X 2 , X 3 ), in which case, a glitch in this function produces a leakage dependent on all the shares of X. This is avoided by partitioning the nonlinear operations with a register that disallows the propagation of a glitch affecting all the shares of an unmasked value. We illustrate the decomposition strategy for the representative S-Boxes of the classes (1, 2, 2) and (1, 3, 1), which are the most area and power-efficient among all the S-Box classes (see Table 3). (1, 2, 2) We begin by illustrating a decomposition of the representative CA-rule for the S-Box class

Decomposition for CA-based S-Box Class
The next step is to obtain a uniform three-sharing for the decomposed functions b 1 , b 2 , and b 3 . We first present a nomenclature of the shares for the various input variables and decomposed functions.
324 Lightweight and Side-channel Secure 4 × 4 S-Boxes from Cellular Automata Rules The three-shared TI circuit is now illustrated below:

Decomposition for CA-based S-Box Class (1, 3, 1)
We now illustrate a decomposition of the representative CA-rule for the S-Box class (1, 3, 1). Once again, while the original rule f has algebraic degree 3, each of the decomposed functions b 1 , b 2 , and b 3 have degree 2.
We now present a uniform three-sharing for the decomposed functions b 1 , b 2 , and b 3 . The nomenclature of the shares for the various input variables and decomposed functions is the same as described above.

Hardware Results for Composite TI of CA-based S-Boxes
In this section, we compare the area and power requirements of the composite TI circuits described above. We also compare these results with composite TI for all the other lightweight S-Boxes mentioned in the previous section. The architecture for the composite TI circuit is illustrated in  Table 4 reveals that the smallest composite TI circuit among CA-based S-boxes has the smallest area footprint and consumes lowest power. In fact, our CA-based has a 35.36% smaller area-footprint and consumes 44.46% less power as compared to the highly optimized composite TI of the GIFT S-Box, which is the best among all the existing lightweight S-Boxes.

Side-channel Leakage Resistance Evaluation using TVLA
We conclude this section by presenting a side-channel evaluation of the best TI circuit among all CA-based S-Boxes, corresponding to the representative CA rule for the class (1, 3, 1). The evaluation was performed by implementing the TI circuit on a Virtex-5 FPGA on a SASEBO-GII board. The programming file for our design was generated using Xilinx ISE 14.7; the "Keep Hierarchy" constraint was kept on while generating the programming file in order to prevent optimizations over module boundaries. We 326 Lightweight and Side-channel Secure 4 × 4 S-Boxes from Cellular Automata Rules

Area and Power Efficient Threshold Implementations for SPN Block Ciphers
In this section, we provide a brief discussion on lightweight TI designs for the other major component of an SPN block cipher, namely, the linear diffusion layer. We then discuss how our CA-based S-Boxes may be combined with such diffusion layers to achieve lightweight TI circuits for full block ciphers. (used in Midori) are essentially bit permutations, and hence no additional overhead is required during TI design of these operations (see Table 5).

MixColumns using Almost-MDS. Another lightweight choice for obtaining diffusion is
MixColumns operation using almost-MDS matrices. Following is the most lightweight 4 × 4 almost-MDS matrix:     0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 This matrix is used in the block cipher Midori. We implemented a TI circuit for multiplying a 4 × 1 state vector with the aforementioned matrix. Note that a straightforward TI circuit must protect 8 XOR gates (two per row of the matrix). In our implementation, we reduce the overhead to 7 XOR gates as follows: we first compute the XOR of all input vector elements (this requires 3 XORs), and then XOR one element per row to obtain the desired output. The area and power requirements for the same are reported in Table 5.

Combining it All Together
In this section, we propose two design paradigms for combining the CA-based optimal S-Boxes with the aforementioned diffusion layer choices to achieve SPN block ciphers with low-area and low-power TI circuits. The first of these paradigms focuses only on optimizing area and power of the TI circuit, without caring for the throughput. In the second design paradigm, we also incorporate the throughput as an additional performance criteria for the TI circuit.

Design Paradigm-1: Focus on Area and Power Only
In this design paradigm, we adopt the SPN block cipher structure of GIFT (which is conceptually identical to that of PRESENT), in the sense that a layer of n 4 × 4 S-Boxes (typically, n = 16) is followed by a bit permutation layer. The S-Box is chosen to be one of the two CA-based S-Boxes (corresponding to classes (1, 2, 2) and (1, 3, 1)), or is the original GIFT/PRESENT S-Box. Note that these CA-based S-Boxes (i) have branch number equal to 2 and (ii) do not posses the BOGI (Bad Output Good Input) property 1 defined in [BPP + 17]. This observation essentially tells us that, to sustain against linear and differential cryptanalysis, the number of rounds required for an SPN block cipher using our CA-based S-Box with bit permutations would be considerably higher than an equivalent cipher using the GIFT/PRESENT S-Box with bit permutations. More specifically, to achieve linear and differential probability less than 2 −80 (assuming 80 bit key size as used in PRESENT), we would require 40 rounds. This is due to the fact that in 40 rounds, there is at least 40 many active S-Boxes and the maximum differential probability of the S-Box is 2 −2 . We note that the aforementioned derivation of the number of rounds is an estimation based solely on the resistance of the cipher against linear and differential analysis. In order to achieve security against other advanced cryptanalytic techniques, additional rounds may be necessary. However, such additions would primarily affect the throughput of the design rather than the area or power consumption. This does not violate the principles the first design paradigm, which primarily targets efficiency in terms of area and power, without much restrictions on throughput. In other words, our CA-based S-Boxes act as viable alternatives to the GIFT/PRESENT S-Boxes in applications where area and power consumption are the primary targets for optimization.
Following the implementation results summarized in Table 6 2 , one can observe that the area requirement and power consumption for SPN block ciphers with CA-based S-Box representing class (1, 3, 1) and bit-permutation is optimal in this design paradigm.

Design Paradigm-2: Focus on Area and Power with Reasonable Throughput
In this design paradigm, we adopt an SPN block cipher structure with the following design choices: • We use standard bit permutations in conjunction with the S-Boxes of PRESENT and GIFT.
• We use a standard bit permutation followed by a MixColumns operation using an almost-MDS matrix in conjunction with our CA-based S-Boxes, and the S-Box of Midori and Skinny.
Note that use of MixColumns operation with an almost-MDS matrix achieves significant diffusion in each round, ensuring a significant reduction in the number of rounds (and hence, an improved throughput) as compared to the previous design paradigm. If we use the same bit permutation and almost-MDS matrix as used in Midori, exactly 16 rounds would be sufficient to achieve the desired security. This analysis essentially follows from the analysis of Midori itself, which has 16 rounds, uses an S-Box with identical branch number (= 2), linear and differential characteristics as our (1, 3, 1) S-Box, and the same almost-MDS matrix. Hence, in this case, it is natural to expect that 16 rounds would provide the same cryptanalytic resistance as Midori. Following Table 7 3 , we observe that block ciphers with CA-based S-Box representing class (1, 3, 1) and bit-permutation followed by almost-MDS MixColumns, retain a reasonable throughput of 43.85 MBps, which is comparable with the throughputs of PRESENT and GIFT (61.41 MBps and 71.42 MBps respectively). On the other hand, even though the CA-based S-Box is used in conjunction with the almost-MDS matrix, the area and power savings from the choice of S-Box make up for the additional overhead due to the MixColumns layer. In fact, the overall area requirement for this CA-based S-Boxes with almost-MDS MixColumns as diffusion is 2 466.54 GE, which is lowest among all the constructions considered here.

Scope for Non-optimal CA-Based S-Boxes: An Exploration
In the aforementioned analysis, we have primarily focused on CA-based S-Boxes that have optimal cryptographic properties with respect to their nonlinearity and differential uniformity. Cryptographic optimality is typically essential for good diffusion: intuitively, using an optimal S-Box in a block cipher construction (as opposed to a non-optimal one) reduces the overall number of rounds required to achieve the desired linear and differential probabilities. This often outweighs the potential area savings afforded by non-optimal S-Box variants. An exception to this intuitive rule is the GIFT S-Box [BPP + 17], which is non-optimal yet allows high throughput, while also being significantly more lightweight as compared to the PRESENT S-Box. The reason for this is the existence of a unique BOGI permutation that compensates for the non-optimality of the GIFT S-Box itself. To explore similar possibilities with respect to CA-based S-Boxes, we explored each of the 1 024 possible non-optimal bijective 4 × 4 CA-based S-Boxes. Our exploration led to the following observations: • Out of the 1 024 non-optimal bijective CA-based S-Boxes, 112 S-Boxes have comparable area overhead with the most lightweight candidate among their optimal counterparts.
• Each of the 1 024 non-optimal bijective CA-based S-Boxes lacks in strong cryptographic properties. To be more specific, either these S-Boxes have nonlinearity 0 or 2 (which is highly undesirable) or linear and differential characteristics greater than or equals to 2 −1.414 .
• Finally, and most crucially, none of the 1 024 non-optimal CA-based S-Boxes exhibit the BOGI property of the GIFT S-Box.
From the aforementioned observations, we conclude that with respect to CA-based S-Boxes, optimality is an essential criteria with respect to both cryptanalytic resistance and throughput. In other words, non-optimal CA-based S-Boxes seem to offer no benefits over their optimal counterparts.

Conclusions and Discussions
In this paper, we present highly optimized TI circuits for cryptographically optimal 4 × 4 S-Boxes, obtained from CA rules. We classify such CA-based S-Boxes into 12 categories based on their amenability to low-area and low-power TI, and present direct-sharings for representative S-Boxes from the each class. The architecture for our implementation direct-shares the local CA rule, and iterates over the same to obtain SCA resistant S-Box implementations. Subsequently, we reduce the number of shares further via functional decomposition of CA-rules, to obtain composite TI-circuits with even lower area footprint and power consumption. Our implementation results on ASIC (180nm technology) show that the most lightweight TI circuit among all CA-based S-boxes has a 49.42% smaller area-footprint and consumes 52.3% less power as compared to the best-known TI of the PRESENT S-Box. The same TI circuit also leads to a 35.36% smaller area-footprint and consumes 44.46% less power as compared to a highly optimized TI of the GIFT S-Box. Finally, this TI circuit also passes the TVLA test over 1 000 000 power traces. Subsequently, we present TI circuits for bit permutations and MixColumns using almost-MDS matrices, with hardware results naturally favoring the former for lightweight applications. We finally present design paradigms for SPN block ciphers that combine TI circuits for our CA-based S-Boxes with TI circuits for bit permutations (and optionally, for MixColumns operations) for full-fledged side-channel resistance. In particular, the use of TI-protected MixColumns operation offers a practical trade-off between area and power savings, and reasonable throughput requirements.
An apparent disadvantage inherent to any CA-based S-Box design strategy is the reduction in throughput due to its iterative nature. One possible workaround is to operate the target device at higher clock frequencies, keeping in mind that local CA rules are usually simple combinatorial circuits, and hence afford designs with higher critical frequencies. Additionally, with respect to TI circuits, iterative architectures seem to minimize the possibility of additional leakages resulting from correlations among the output bits, since they are processed in different clock cycles. A more thorough exploration of the pros and cons of such iterative S-Box design principles can be an interesting direction of future work. Extensions of our design principles to TI circuits for 5 × 5 and 8 × 8 S-Boxes seem to be an intriguing direction of future research.