Constructing Low-latency Involutory MDS Matrices with Lightweight Circuits

. MDS matrices are important building blocks providing diﬀusion functionality for the design of many symmetric-key primitives. In recent years, continuous eﬀorts are made on the construction of MDS matrices with small area footprints in the context of lightweight cryptography. Just recently, Duval and Leurent (ToSC 2018/FSE 2019) reported some 32 × 32 binary MDS matrices with branch number 5, which can be implemented with only 67 XOR gates, whereas the previously known lightest ones of the same size cost 72 XOR gates. In this article, we focus on the construction of lightweight involutory MDS matrices, which are even more desirable than ordinary MDS matrices, since the same circuit can be reused when the inverse is required. In particular, we identify some involutory MDS matrices which can be realized with only 78 XOR gates with depth 4, whereas the previously known lightest involutory MDS matrices cost 84 XOR gates with the same depth. Notably, the involutory MDS matrix we ﬁnd is much smaller than the AES MixColumns operation, which requires 97 XOR gates with depth 8 when implemented as a block of combinatorial logic that can be computed in one clock cycle. However, with respect to latency, the AES MixColumns operation is superior to our 78-XOR involutory matrices, since the AES MixColumns can be implemented with depth 3 by using more XOR gates. We prove that the depth of a 32 × 32 MDS matrix with branch number 5 (e.g., the AES MixColumns operation) is at least 3. Then, we enhance Boyar’s SLP-heuristic algorithm with circuit depth awareness, such that the depth of its output circuit is limited. Along the way, we give a formula for computing the minimum achievable depth of a circuit implementing the summation of a set of signals with given depths, which is of independent interest. We apply the new SLP heuristic to a large set of lightweight involutory MDS matrices, and we identify a depth 3 involutory MDS matrix whose implementation costs 88 XOR gates, which is superior to the AES MixColumns operation with respect to both lightweightness and latency, and enjoys the extra involution property.


Introduction
The development of pervasive computing and the demand for low-cost security have stimulated intensive researches on the design of lightweight symmetric-key cryptographic algorithms.This often boils down to the search for lightweight yet cryptographically strong diffusion and confusion components.
In practice, the diffusion components are typically realized with linear operations, whose functionality, loosely speaking, is to spread the internal dependencies as much as possible.The so-called Maximal Distance Separable (MDS) matrices are probably the most preferable diffusion building blocks.When using MDS matrices as the diffusion layers in iterative block ciphers, it is possible to achieve a desired number of differentially or linearly active non-linear elements with a relatively small number of rounds, and therefore leading to low-latency designs.Moreover, designs with MDS matrices typically enjoy simple and clear security proofs, such as the case of AES [DR02].Actually, it is exactly the elegant security proof offered by AES that initiates the widely application of MDS matrix in the design of symmetric-key primitives.
However, it is not an easy task to find lightweight MDS matrices, and it may be too luxury to use an MDS matrix in a design targeting resource constrained devices.In such situations, the designers compromise by employing almost MDS matrices [BBI + 15, Ava17], or linear operations that can be realized with several bitwise XORs [BJK + 16], or even bit-level permutations which can be implemented with a proper wiring [BKL + 07].Such design strategy more often than not leads to a significant increase of the number of rounds, and complicates the security proof remarkably.Therefore, it is an important endeavor to construct lightweight MDS matrices.In particular, lightweight involutory MDS matrices would be more preferable, since the same circuit can be reused when the inverse is required.Actually, the idea of reusing involutory components in both encryption and decryption has already been applied in some designs [BR00, SPR + 04, BCG + 12].

Related work
If the chip area is the sole consideration, one promising approach proposed by Guo, Peyrin, and Poschmann to reduce the implementation footprint is to find a lightweight matrix A such that A k is MDS [GPP11,GPPR11].The implementation of A k can be obtained by recursively "executing" the implementation of A k times.Then no matter how complex A k is, the cost is determined by A completely.However, this approach comes at the expense of an increased number of clock cycles, which is not desirable in low-latency applications.Therefore, in this work, we focus on the lightweight constructions, where the full MDS matrix is implemented as a block of combinatorial logic circuit such that it can be computed in one clock cycle.We refer the reader to [GPP11, TTKS18, AF14, Ber13, GPV17, WWW12, CLM16] for more information on the recursive constructions.
The initial attempts to find lightweight MDS matrices where the full matrix is implemented mainly focus on the selection of matrix entries enjoying low hardware footprints [SKOP15, BKL16, LS16, LW16, LW17, SS16a, SS16b, SS17, JPST17, ZWS18,GLWL16].This line of work makes a great step forward for our ability of constructing lightweight MDS matrices and can be categorized as local optimizations.In particular, with the knowledge of which kind of entries are better, one can construct MDS matrices from some special classes of matrices, such as circulant, Hadamard, or Toeplitz matrices [SKOP15,LS16,SS16b].Some of these constructions lead to involutory MDS matrices.In particular, Sim et al. observed that involutory MDS matrices can be implemented with almost the same cost as non-involutory ones under some specific metric, the latter being usually non-lightweight when the inverse matrix is required [SKOP15].Note that here the entries of a matrix are not restricted to finite field elements, and can be general linear transformations.Actually, the idea of using general linear transformations leads to notable improvement at the time [BKL16,LW16].
So far, we have a fairly deep understanding of the problem with respect to local optimizations.Hence recent work tend to deal with the problem at a more essential level, viewing it as the well-known Shortest Linear straight-line Problem (SLP) and optimizing globally.Indeed, this approach results in more accurate estimations of the cost of hardware implementations.In [KLSW17], Kranz et al. shows that the AES MixColumns matrix can be implemented with only 97 F 2 × F 2 → F 2 XOR gates with Boyar's tool [BMP13] based on SLP heuristic, while the previous best implementation costs 103 XOR gates [JPST17].Just recently in ToSC 2018/FSE 2019, Duval and Leurent reported some 32 × 32 binary MDS matrices which can be implemented with only 67 XOR gates by searching through a set of circuits ordered by hardware cost and optimizing globally [DL18], whereas the previously known lightest ones of the same size cost 72 XOR gates [KLSW17].

Our Contribution
First, we slightly generalize the structure of the involutory MDS matrix M KLSW (costs 84 XOR gates) proposed by Kranz, Leander, Stoffelen, and Wiemer [KLSW17], and try to construct an involutory MDS matrix G of the generalized form with less 1's than M KLSW in its binary form based on some educated guesses.After applying the SLP heuristic [BMP13] to G, it turns out that G can be implemented with only 80 XOR gates.
Then we further generalize the structure of G to a family of 4 × 4 matrices whose entries are powers of a given 8 × 8 binary matrix A. We show that every involutory matrix in this family can be completely determined by 6 parameters taking integer values.We search through a restricted range of matrices generated by these 6 parameters, and identify some involutory MDS matrices which can be implemented with only 78 XOR gates, while the previous best result requires 84 XOR gates.
Finally, we prove that the depth of a 32 × 32 MDS matrix with branch number 5 (e.g., the AES MixColumns operation) is at least 3. Then we augment Boyar's SLP-heuristic algorithm [BMP13] with circuit depth awareness to limit the depths of its output circuits.Along the way, we give a formula for computing the minimum achievable depth of a circuit implementing the summation of a set of signals with given depths, which is of independent interest.By applying this tool, we search through a large set of lightweight involutory MDS matrices and identify one which can be implemented with 88 XOR gates, whose circuit depth reaches the lower bound 3. A summary of the optimal matrices we find is given in Table 1.We also try to synthesize the matrices from Table 1 with three different technology libraries (NanGate 45 nm, SMIC 65nm and TSMC 28nm).In all cases, our matrices exhibit lower area footprint.Taking the 97-XOR AES MDS matrix for example, it takes 154.811996 um 2 when synthesized with NanGate 45nm technology (194 GE), while our 88-XOR matrix takes 140.447996 um 2 (176 GE).Hence, our 88-XOR matrix enjoys three advantages over the AES MDS matrix: it is involutory; its depth is 3 (the depth of the 97-XOR AES MDS is 8; and its area footprint is lower.Moreover, we make all of our code and results (matrices in binary representations with their actual implementations) publicly available at https://github.com/siweisun/involutory_mds

Organization
In Sect.2, we give some preliminaries on finite fields and MDS matrices.Then metrics used in this work for measuring the circuit cost are given in Sect.3. In Sect. 4 we show how to construct a lighter involutory matrix by generalizing a previously known involutory MDS matrix.In Sect.5, we consider further generalizations and search through a large set of matrices to find lighter involutory MDS matrices.We prove a theorem on the lower bound of the circuit depth of an 32 × 32 MDS matrix with branch number 5, and enhance Boyar's SLP-heuristic algorithm to find lightweight involutory MDS matrices whose depths reach the lower bound.Section 7 concludes the paper.[BMP13], and SLP * means that the result is obtained by applying a modified version of Boyar's SLP heuristic with circuit depth awareness presented in Sect.6.

Preliminaries
Let R be an arbitrary ring, and M k (R) be the set of all k × k matrices whose entries are drawn from R. Therefore, M k (F 2 n ) denotes the set of all k × k matrices over the finite field of 2 n elements, and M k (GL(n, F 2 )) is the set of all k × k matrices whose elements are taken from the general linear group GL(n, F 2 ) formed by all invertible n × n matrices over ) can be represented as an nk × nk binary matrix, which we call the binary representation of A. We use I n and O n to denote the n × n identity matrix and zero matrix over F 2 respectively.We will omit the subscript n whenever it is obvious from the context.Given a vector x in F nk 2 , we denote by ω n (x) the number of non-zero n-bit chunks in x.When n = 1, we simply write ω 1 (x) as ω(x), which is the well known Hamming weight of x.The branch number B n (A) of A ∈ M nk (F 2 ) is defined as min x∈F 2 nk \{0} {ω n (x) + ω n (Ax)}.Definition 1.An invertible nk × nk binary matrix A is MDS over k n-bit words if and only if B n (A) = k + 1.Furthermore, if an MDS matrix A satisfies that A = A −1 , then we call it an involutary MDS matrix.
Definition 2 (Characteristic polynomial [Wan03]).The characteristic polynomial f of a binary matrix is the minimal polynomial of A if and only if f (A) = 0, and for any g Note that a minimal polynomial of A ∈ M m (F 2 ) can be reducible.
It is trivial to verify that the characteristic polynomial of f 's companion matrix is f .
Lemma 2 ([BR99, LW16]).Let L be a matrix in M k (M n (F 2 )).Then L is an MDS matrix (with branch number k + 1) if and only if all square sub-matrices Lemma 2 is employed in this paper to check the MDS property of our candidate lightweight matrices.

Metrics
We estimate the hardware cost of a linear operation as the number of F 2 × F 2 → F 2 XOR gates required in its implementation, where the implementation can be described as a sequence of XOR and assignment operations x i ← x ai ⊕ x bi with a i , b i < i.But, for a given linear operation, it is NP-hard to obtain the minimum number of XOR gates required [BMP08,BMP13], and only metrics determining the upper bounds are available.The metrics used in this paper are listed in the following.
that is, the number of 1s in the matrix A minus nk.This corresponds to a naive implementation of A, where each row of A is implemented as is.DXC(A) is essentially the same as the Hamming weight ω(A) of A up to a constant shift.
Global Optimization.Given a matrix A ∈ M nk (F 2 ), we can obtain an estimation of its hardware cost by finding a good linear straight-line program corresponding to A with state-of-the-art automatic tools based on certain SLP heuristic [BMP13], and this metric is denoted as SLP(A).Note that this is so far the most accurate estimation that is practical for 32 × 32 binary matrices.
In this work, eventually the hardware cost is estimated with Global Optimization.However, before applying the Global Optimization, we first try to construct lighter involutory MDS matrices with fairly low Direct XOR Count (i.e., matrices with low Hamming weights).Finally, we would like to mention that there are other metrics (such as the Sequential XOR Count [JPST17]) in the literature, and we refer the reader to [DL18] for a clear discussion of the comparisons and limitations of different metrics.
Besides the circuit area (measured by the number of XOR gates required for an implementation), another important metric of an implementation is the latency, which imposes constraint on the clock frequency at which the circuit can operate.The latency of an implementation can be characterized by its depth.Definition 5. Let M be an m × m binary Matrix.Then the function f 2 can be implemented with a finite number of XOR gates.The critical path of such an implementation is defined as the path between an input and output involving the maximum number of XOR gates, and the depth of the implementation is the number of XOR gates involved in the critical path.

Our Constructions
By applying the subfield construction [BNN + 10, KPPY14] to the involutory MDS matrix proposed by Sarkar et al. [SS16b], Kranz et al. obtain so far the most lightweight involutory MDS matrix in M 4 (M 2 (F 2 4 )), whose binary representation is .
The involutory MDS matrix M KLSW can be regarded as a matrix in M 4 (GL(8, F 2 )) of the following form     (1) Then we can generalize (1) and try to find lightweight involutory MDS matrices of the following form According to Observation 1, to make G involutory, we have A i+k + A j = O 8 and thus First, our goal is to find an involutory matrix G, such that DXC(G) is small.Since ) + 48 − 32 and heuristically ω(A t ) increases along with |t| when A is very sparse, we prefer instantiations of i, l, j and k, such that |i|, |l|, |j| and |k| (the exponents of A appearing in G) are small.According to [BKL16] (see Table 7 of [BKL16]), DXC(A) ≥ 2 if the characteristic polynomial of A is an irreducible polynomial of degree 8. Therefore, we only consider A whose characteristic polynomial is reducible.We find that if we choose 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 to be the companion matrix of x 8 +x 2 +1, whose characteristic polynomial is ( It is easy to verify that the minimal polynomial of A is also x 8 + x 2 + 1 according to Definition 3. Hence A 8 + A 2 + I = 0 and thus A 8+d + A 2+d + A d = 0 for any integer d.Therefore, solving the equation over two sets where A 2i+k = A i+j according to Observation 1, gives the solutions of l, i, and k such that We can enumerate all solutions and pick one which minimizes 4|l| + 2|i| + 2|k| + 2|i + j|.One such possible solution1 is By applying Boyar's SLP-heuristic algorithm, we obtain an implementation of G with only 80 XOR gates, which breaks the record of 84 XOR gates [KLSW17], and the actual implementation can be found in Table 2 5 More Generalizations The above result motivates us to consider a more generalized form: where shown in Equation (2), and ij are integers for 1 ≤ i, j ≤ 4. Without loss of generality, let Since M is involutory and thus A 2 = I, we can deduce that and According to Equation (3), the matrix M can be completely determined by the parameters 12 , 13 , 14 , r, s and t.Therefore, we inspect all ( 12 , 13 , 14 , r, s, t) ∈ Z 6 satisfying the following conditions 2 (5) Finally, we identify 5550 involutory MDS matrices whose Hamming weights are within the range from 148 to 172.We apply Boyar's SLP-heuristic algorithm to all these matrices to obtain their lightweight implementations and the results are summarized in Table 3.
The above approach produces many equivalent matrices.For instance, let which is parameterized by ( 12 , 13 , 14 , r, s, t).If we exchange the second row and third row, and then exchange the second and third column, we obtain corresponding to the parameter ( 13 , 12 , 14 , s, r, t).Obviously, M is an involutory MDS matrix if and only if M is involutory and MDS.In addition, from any implementation of M , we can derive an implementation of M with the same circuit size and depth.Hence,  4 is the cycle notation of a permutation π over {1, 2, 3, 4}.The parameter in the same row is obtained by permute the columns and rows of according to π. Taking the 4th row for example, we have π = (2, 4, 3), and the transformation is performed as follows A 12 A 13 +r+t I A 12 +t A 14 +r A 14 +r+s A 12 +s I A 13 +r A 12 +s+t A 14 +s A 13 +t I , from which we can see that ( 13 , 14 , 12 , s, t, r) and ( 12 , 13 , 14 , r, s, t) are equivalent.However, such equivalences are not visible to Boyar's tool [BMP13] due to its heuristic nature, where the orders of the rows and columns do matter.That is, Boyar's tool may output circuits with different sizes and depths for two equivalent matrices.Therefore, in our experiment, we still need to search through all matrices we generated, and pick the ones with better implementations.One of the optimal matrices we find is corresponding to the parameter (0, 0, 4, 0, 2, 2), where A is the companion matrix of x 8 + x 2 + 1 shown in Equation (2).The actual implementation of H is given in Table 5.

Searching for Low-latency Involutory MDS Matrices
In the previous section, we identify an involutory MDS Matrix which can be implemented with 78 XOR gates whose circuit depth is 4.Although this matrix is good with respect to lightweightness, we find that it is inferior to AES MixColumns operation in terms of latency.The lightest implementation (97 XOR gates) of the AES MixColumns operation is of depth 8, and if we increase the number of XOR gates, the AES MixColumns can be implemented with depth 3.In the following, we show that depth 3 is optimal.
Theorem 1.The circuit depth of an MDS matrix A ∈ M 4 (GL(8, F 2 )) with branch number 5 is at least 3. Proof.
be an MDS matrix with branch number 5 whose circuit depth is 2, which implies that each of the 4 × 8 = 32 rows of A contains at most four 1's.Then the Hamming weight of each row of the 8 × 8 submatrix A i,j is 1.Otherwise, there is one row of some submatrix A i,j whose Hamming weight is 0, which contradicts our assumption that A is MDS (see Lemma 2).Moreover, each column of A i,j contains only one 1.Otherwise we can identify two linearly dependent rows, which is a contradiction to the MDS property.Therefore, A i,j is a permutation matrix.Now let us consider the submatrix The Hamming weights of each row and each column of A is 2. Thus, the sum of the 2 × 8 = 16 rows of A is a zero vector, meaning that A is not invertible.This is a contradiction to the MDS property of A.
Therefore, our goal is to find lightweight involutory matrices whose circuit depth is 3. Hopefully, we can identify one that is lighter than the MixColumns operation of AES, which does not enjoy the involutory property.For a given 32 × 32 matrix, Boyar's SLPheuristic algorithm [BMP13] is virtually the best tool available for finding its lightweight implementation.However, Boyar's algorithm aims at minimizing the number of XOR gates of an implementation regardless of its circuit depth, which is not applicable in our scenario.
Given a set of input signals and a set of linear predicates represented as a binary matrix, Boyar's algorithm repeatedly picks two signals according to some rules, adds them together as a new signal, and puts this new signal into the signal set.Intuitively, after each iteration the signal set becomes "closer" to the set of linear predicates according to a notion of distance.The algorithm stops executing if and only if the distance becomes 0, that is, the set of signals compute the set of linear predicates.
In the following, we enhance Boyar's algorithm with circuit depth awareness.Basically, we modify Boyar's algorithm by only picking signals which are not going to exceed a specified depth bound, and defining a new notion of distance which takes the circuit depth into account.The details are presented in Algorithm 1, where the subroutine Pick() picks two elements from the current signal set S such that when the exclusive-or of these two elements are put into the signal sets S, the sum of the values in the new distance vector ∆ is minimized among all possible choices of the selected two elements, and ties will be resolved by maximizing the Euclidean norm of ∆.This strategy is exactly the same as Boyar's method [BMP13], except that the distances in ∆ are computed according to our new definition presented in the following.

Algorithm 1: SLP heuristic with bounded circuit depth
Input: An m × n binary matrix M representing m linear predicates in n variables, i.e., (y

and a positive integer H
and for any y k with 1 ≤ k ≤ m, y k can be computed by one element in S l , where Let S be a sequence of signals.For any linear predicate f , we define δ H (S, f ) as the minimum number of additions (XOR gates) required to implement f with input signals from S, such that the depth of the implementation is not greater than H.We call δ H (S, f ) the H-Distance from S to f .Note that our notion of distance is different from Boyar's in that if δ H (S, f ) = k, we not only require that f can be obtained by k additions, but also that there exits an implementation of k additions within depth H.If f can not be implemented within depth H, we have δ H (S, f ) = ∞.In what follows, we use δ(S, f ) to denote the distance defined in Boyar's work [BMP13], where the circuit depth is not considered.
In Algorithm 1, we need a method to compute the minimal circuit depth of v 1 + • • • + v k , where the depths of v i 's are known.Note that there are many different ways of implementing v 1 + • • • + v k which lead to different circuit depths as illustrated in Fig. 1.To deal with this, we prove the following theorem.

then the lower bound of the depth of the circuit implementing
there is always a circuit implementing z with depth log 2 n i=1 2 di , i.e., the lower bound is always achievable.
Proof.We prove by induction on k, the number of terms in the summation.For n = 1 and n = 2, Theorem 2 holds obviously.Assuming that it holds for all k < n, we show in the following that it also holds for k = n.
Without loss of generality, any implementation of Then depth(z) = max{depth(z a ), depth(z b )} + 1.According to the induction hypothesis, we have Therefore, we can obtain that Next, we show that the lower bound is achievable.First, we sort the set {v 1 , • • • , v n } of signals with non-decreasing depths.Then, we remove the leftmost two signals with the same depth, and insert the signal of their sum into the depth-ordered list.Without loss of generality, we assume that {v 1 , • • • , v n } is already in order, and depth(v 1 ) = depth(v 2 ).After we update the set according to the above rule, we have a new set of signals vn) .
We repeat the above operations until we obtain a set of signals . Now, we are ready to give the implementation achieving the lower bound.First, if m > 1, we add z 1 and z 2 and obtain z m+1 = z 1 + z 2 whose depth depth(z m+1 ) = q 2 + 1; Then we add z m+1 and z 3 and obtain z m+2 whose depth depth(z m+2 ) = q 3 + 1; • • • ; Finally, we add z 2m−2 and z m and obtain z which implements and 2 depth(v1) + • • • + 2 depth(vn) is exactly a power of 2. In this case, we have In our algorithm, initially S is the sequence of all input signals.We maintain a list ∆ to track the H-distances of the output signals from S. At the same time, we keep a list D such that D[i] is the circuit depth of S [i].At each iteration, we pick two different elements from S with Pick(S, D, H).Basically, we create a new element for S whose circuit depth is not greater than H by adding the two elements returned by Pick() which minimizes the sum of the new H-distances, where ties are resolved by maximizing the Euclidean norm of the new ∆.This strategy is the same as Boyar's SLP heuristic, and we refer the reader to [BMP13] for more information.Our algorithm is best illustrated by running through a toy example.
Example 4. Let the set of input signals be {x 1 , x 2 , x 3 , x 4 , x 5 }, and , which can be represented as We execute the Algorithm 1 with H = 2.
We apply this algorithm to all matrices we generated in Sect.5, and the lightest one achieving the lower bound of the circuit depth (i.e., 3) we find is Q, corresponding to the parameter (0, −2, −2, 2, 4, 6), where A the companion matrix of x 8 + x 2 + 1 shown in Equation (2).The actual implementation of Q is given in Table 6.
Remark.In Sects.4-6, we only show the best matrices we find.We present a summary of all other results we obtained in Supplementary materials A and B, where we only show the parameter resulting in better circuit when equivalences are encountered.Moreover, The raw data and source code are also submitted as supplementary material along the paper.

Conclusion
In this work, we find so far the lightest 32 × 32 involutory MDS matrices whose branch number is 5 by searching through a large set of matrices whose entries are the powers of the companion matrix of x 8 + x 2 + 1.Moreover, we enhance Boyar's SLP heuristic with

Figure 1 :
Figure 1: Two implementations of the same summation v 1 + v 2 + v 3 with different circuit depths, where the depths of v 1 , v 2 and v 3 are 2, 0, and 3 respectively.

Table 1 :
A summary of the results.All matrices shown in the table are 32 × 32 binary matrices, and M k (R) is the set of all k × k matrices whose entries are drawn from R. The SLP column is obtained by applying Boyar's SLP heuristic

Table 3 :
A summary of the result.The first row means that we identify a set of 18 matrices whose Hamming weight and DXC are 148 and 116 respectively.The maximal and minimal XOR gate counts of these matrices after applying Boyar's SLP heuristic are 80, and the minimum circuit depth is 4.

Table 4 :
A list of equivalent parameters, where the Transformation column corresponds to certain column and row permutations explained in the following.