Towards Low-Latency Implementation of Linear Layers

. Lightweight cryptography features a small footprint and/or low computational complexity. Low-cost implementations of linear layers usually play an important role in lightweight cryptography. Although it has been shown by Bo-yar et al. that ﬁnding the optimal implementation of a linear layer is a Shortest Linear Program (SLP) problem and NP-hard, there exist a variety of heuristic methods to search for near-optimal solutions. This paper considers the low-latency criteria and focuses on the heuristic search of lightweight implementation for linear layers. Most of the prior approach iteratively combines the inputs (of linear layers) to reach the output, which can be regarded as the forward search. To better adapt the low-latency criteria, we propose a new framework of backward search that attempts to iteratively split every output (into an XORing of two bits) until all inputs appear. By bounding the time of splitting, the new framework can ﬁnd a sub-optimal solution with a minimized depth of circuits. We apply our new search algorithm to linear layers of block ciphers and ﬁnd many low-latency candidates for implementations. Notably, for AES Mixcolumns, we provide an implementation with 103 XOR gates with a depth of 3, which is among the best hardware implementations of the AES linear layer. Besides, we obtain better implementations in XOR gates for 54 . 3% of 4256 Maximum Distance Separable (MDS) matrices proposed by Li et al. at FSE 2019. We also achieve an involutory MDS matrix (in M 4 (GL(8 , F 2 ))) whose implementation uses the lowest number (i


Introduction
In recent years, lightweight cryptography has been applied in many fields, such as the Internet of Things (IoTs) and Radio-Frequency IDentification (RFID) tags.Their security has been the central area of focus for researchers because various restrictions lead to new security threats [DGB19].Generally, lightweight cryptography ensures secure encryption and expands cryptography applications to devices with limited resources in circuit size, power consumption, and latency.
There are many criteria for designing lightweight cryptographic primitives, and the most popular one should be the gate equivalents (GE) required to implement a cryptographic algorithm.As it nicely approximates the complexity of digital electronic circuits, there is a growing body of work solely concentrating on decreasing the GE (see [BP10, BMP13, KLSW17, DL18, BFI19, TP19, XZL + 20, LXZZ21] for an incomplete list).Meanwhile, another criterion called latency is also crucial and has been attracting more and more attention, since it not only impacts the throughput of encryption/decryption, but plays an important role in the low-energy consideration of ciphers [BBI + 15].
Generally speaking, research on lightweight cryptography falls in two directions.The first direction focuses on designing new ciphers that suppose to be efficient in either hardware (i.e., by logical gates) or software (i.e., on microprocessors) implementations.The community has devoted a lot to this direction and proposes plenty of structures [BKL + 07, BSS + 13, BBI + 15, ZBL + 15, BJK + 16, BPP + 17, DL18,LMMR21].
The second direction tries to optimize the implementation of given ciphers, which has drawn a lot of attention as well.On the one hand, it is somewhat of more practical significance.For example, the Advanced Encryption Standard (AES) [DR20] has been widely used in practice, and its round function has been frequently used in the design of other cryptographic primitives (e.g., AEGIS [WP13] and ForkAES [ARVV18]); thus, an efficient implementation will directly reduce the cost of deploying AES and the primitives that employ its round function.On the other hand, the optimizing approach can aid the designing of lightweight ciphers.For example, Li et al. applied a heuristic optimization tool to the cost evaluation of the proposed lightweight Maximal Distance Separable (MDS) matrices for linear layers [LSL + 19].This paper follows the second line of work and focuses on the hardware implementation of linear layers that provide diffusion for many cryptographic primitives.
The linear layer of a cryptography cipher can be represented as the multiplication (over F 2 ) between a matrix and a vector.For an m × n binary matrix A, given inputs x = (x 0 , x 1 , ..., x n−1 ) T , the outputs y = (y 0 , y 1 , ..., y m−1 ) T can be calculated by y = A x. We give an example with a matrix A: inputs x = (x 0 , x 1 , x 2 , x 3 , x 4 ) T , and the outputs y = (y 0 , y 1 , y 2 ) T .For the worst case w.r.t. the cost, the implementation can be performed by the procedure described by Figure 1left, requiring 8 XOR gates.An optimized implementation is given in Figure 1-right, saving half of the number of XOR gates.The above optimization can be formulated as a problem of finding the smallest number of linear operations necessary to compute a set of linear forms, which is called the Shortest Linear Program (SLP) problem.Although it has been shown that the SLP problem is NP-hard [BMP08], in practice, we can build circuit implementations of linear layers using a variety of heuristics [Paa97, BP10, BMP13, KLSW17, BFI19, LSL + 19, TP19, XZL + 20, BFI21,LXZZ21].
The first heuristic approach employs the strategy originated in Paar's work [Paa97] and can be regarded as the forward search.Paar's method encodes input bits by one-hot encoding. 1Paar optimized the matrix by iteratively combining different columns in the matrix.Paar's method has been improved in a number of follow-up works.Boyar [BFI21].When executing BP algorithm, they give a bound of depth and only pick the choices that are not beyond the bound.In the selection, some choices which have the low priority in their heuristics may lead to a better implementation.We illustrate the case with the following example.Suppose that the objective matrix is M P : As shown in Table 1-left, Li et al. find the implementation by 11 XOR gates with depth 3. Based on their heuristics, y 4 is always generated with depth 3.This disables y 4 from being used in subsequent calculations.Otherwise, it will be beyond the bound of minimum depth.However, we find that y 4 can be generated with depth 2 and be used in subsequent implementation to reduce the number of XOR gates (see Table 1-right).This shows that there is a gap between existing algorithms and the lowest number of XOR gates with respect to the minimum depth.
Therefore, we propose the backward framework.It has completely different heuristics.Actually, our method is inspired by the Constant Matrix Multiplication (CMM) problem.The CMM problem is defined as finding a solution using additions, subtractions, and shifts to compute the multiplication of an m × n constant matrix M over Z. Kumm et al. proposed an algorithm called RPAG-CMM in [KHZ17] to solve the CMM problem with the minimum depth.It catches our attention.However, in our experiment for binary matrix, the XOR gates generated by RPAG-CMM are more than LSL algorithm.Therefore, we study the special version of RPAG-CMM for binary matrix and propose a new heuristic algorithm based on the backward framework for binary matrices.The main algorithm will be introduced in Algorithm 2, which is more relevant to the SLP problem.

Our Contributions
In this paper, we investigate a new strategy of backward search.Concretely, rather than combining the inputs to reach the outputs, our new method attempts to iteratively split the outputs until all the input values appear.As shown in Table 1-right, for the matrix M P , the minimum depth of y 0 , y 1 , y 2 , y 3 is 3, but others are less than 3.We give the method to compute the minimum depth in Subsection 2.1.Thus, we first consider the four values.We have y 0 = y 4 ⊕ t 7 , y 1 = y 4 ⊕ x 5 , y 2 = y 4 ⊕ x 6 , and y 3 = y 5 ⊕ t 7 .Because y 4 , y 5 , t 7 are not input values, we split them into t 8 , y 6 , x 2 , x 4 , x 5 .Finally, we split t 8 , y 6 by t 8 = x 2 ⊕ x 3 and y 6 = x 0 ⊕ x 1 .The complete processes can be seen in Subsection 3.1.The above process leads to a new implementation with 9 XOR gates and the depth (that is defined by the longest path from the input to the output) 3. It should be noted that the new method can bound the depth in the search, which contributes to the solutions that take both latency and GE into consideration.For some matrices, the framework can cover more implementations than previous algorithms with minimum depth in a limited time, as illustrated in details in Subsection 3.4.
Then, we apply the strategy to linear layers of block ciphers and find many low-latency candidates for implementations.The results can be seen in Table 2 and Table 3.For the 11 linear layers that we analyzed, we find 8 matrices that have the same XOR gates with minimum depth and optimize 3 matrices in XOR gates.For the 26 lightweight matrices proposed, we find 9 matrices that have the same XOR gates with minimum depth and optimize 8 matrices in XOR gates.Notably, for AES Mixcolumns, we achieve an implementation with 103 XOR gates and depth 3.This is the same as the best low-latency result recently reported in [BFI21].We also apply the algorithm to 4256 MDS matrices proposed by Li et al. in [LSL + 19], and achieve better implementations in XOR gates for 54.3% of them (see Table 8).The smallest matrix among them can be implemented with 86 XOR gates and depth 3, while the previous result is 88 XOR gates in [LSL + 19].
Last but not least, we synthesize the above results using two different ASIC libraries, namely TSMC 90 nm and NanGate 45 nm.Our implementation of the AES MixColumns has lower power and latency in the two libraries than those in [LSL + 19, XZL + 20, LXZZ21, BFI21].All the source codes and results of this paper are available at https://github.com/QunLiu-sdu/Towards-Low-Latency-Implementation.

Organization
In Section 2, we give some basic notations and metrics.Moreover, in Section 3, we formally propose our algorithm and give some examples using our algorithm.All the results and implementations in hardware are given in Section 4. Finally, we conclude and propose future research directions in Section 5.

Notations
Let F 2 be a field with two elements and its additive and multiplicative identities are respectively denoted as 0 and 1.Let F n 2 be the vector space of all n-dimensional vectors over F 2 and F n 2 be the vector space of all -dimensional vectors over F n 2 .We use M (F n 2 ) to denote the set of all × matrices over F n 2 , and use M (GL(n, F 2 )) to denote the set  of all × matrices whose elements are taken from the general linear group GL(n, F 2 ) formed by all invertible n × n matrices over F 2 .Note that we abuse M (F n 2 ) or M n (F 2 ) to denote the set of all n × n binary matrices.
For a vector x ∈ F n 2 , let ω n (x) be the number of non-zero n-bit chunks.Particularly, For any linear layer of a cipher associated to an m × n binary matrix A, given inputs x = (x 0 , x 1 , ..., x n−1 ) T , the outputs y = (y 0 , y 1 , ..., y m−1 ) T of the linear layer can be calculated by y = A x, and y i can be computed by where each coefficient a ij is the entry of matrix A at i-th row and j-th column.We can then associate y i with a binary vector: We use "node" to define such binary vector.That is, the node N xi is the unit node and N yj is the target node.For three nodes N yi 1 , N yi 2 and N y i , we say N yi 1 and N yi 2 generate N y i , if N y i = N yi 1 ⊕ N yi 2 with ⊕ element-wise plus.For a node N yi , we define its depth D(N yi ) as the maximum number of XOR operations of a path from unit nodes to N yi .Obviously, the depth of D(N y i ) ≥ D(N yi 1 ) + 1 and D(N y i ) ≥ D(N yi 2 ) + 1.We define the minimum depth of a node N yi as: ( Given an m × n binary matrix A over F 2 , the minimum depth of A is defined as: Note that we can treat the implementation of the matrix A as a graph.Thus, the depth of the implementation is the critical path length of the graph.

Metrics
In this sub-section, we recall some metrics for a matrix over F 2 , which are helpful in the proposed solver of minimum depth SLP problem.

The direct-XOR (d-XOR) [KPPY14].
Given an m × n binary matrix A over F 2 , the d-XOR count is defined as ω 1 (A) − m, where we define ω 1 (A) as the Hamming weight of A, i.e., the number of 1's in A.

The sequential-XOR (s-XOR) [JPST17].
Let A ∈ GL(n, F 2 ) be an invertible matrix.Assume x 0 , x 1 , ..., x n−1 are the n input values of A. It is always possible to perform a sequence of XOR instructions x i = x i ⊕ x j with 0 ≤ i, j ≤ n − 1, such that the n input values are updated to the n output values.The s-XOR count of A is defined as the minimal number of XOR instructions to update the inputs to the outputs.

The general-XOR (g-XOR) [XZL + 20].
Given an m × n binary matrix A over F 2 , the implementation of A can be viewed as a sequence of XOR operations x i = x j1 ⊕ x j2 where 0 < x j1 , x j2 < i and i = n, n + 1, . . ., t − 1.The g-XOR count is defined as the minimal number of operations x i = x j1 ⊕ x j2 that compute the m outputs completely.Since the d-XOR is intuitive and easy to compute, it has been adopted in the design of new lightweight diffusion layers.For further optimization, the s-XOR and the g-XOR are used in evaluating matrices.The difference is that the g-XOR can generate new values while the s-XOR always renews original values.For example, for computing x 0 ⊕ x 1 , the s-XOR performs Global optimization.For an m × n binary matrix A over F 2 , we can obtain an estimation of its hardware cost by finding a good linear straight-line program corresponding to A with state-of-the-art automatic tools based on certain SLP heuristics [BMP13].Using different heuristics, global optimization leads to better results than local optimization [KLSW17].And in this paper, the global optimization corresponds to optimizing with respect to the g-XOR metric.

Backward Search
In this section, we formally present the new search framework and algorithm.The new framework provides another intuitive perspective to solve the SLP problem with respect to achieving the minimum depth.We find that other methods always ignore some choices because of their rules.Our new framework can keep such choices (ignored by other methods) and potentially can achieve better solutions.We only use the g-XOR metric in our framework.
First, we introduce the backward search framework and provide an example in Subsection 3.1.The example using our strategy can help readers to understand our framework.We begin with the example and then formalize the process.Then, we propose five heuristics rules of splitting nodes for the low-area backward strategy in Subsection 3.2, and discuss their priorities (i.e., to determine which rule will be used if multiple ones can be matched) in Subsection 3.3.Finally, we compare LSL algorithm with our framework in Subsection 3.4, to explain the advantages of the backward search for low-latency implementation compared to the forward one.
The notations used in this section are as follows.For convenience, we use node y instead of N y .
• A: An m × n binary matrix to be implemented.
• x i : The unit node with the i-th bit set.
• y i : The target node of A.
• X : A set of unit nodes.
• d yi : The minimum depth of y i .
• d A : The maximum value of the depth of each row in A.
• L d : A set of all the nodes with minimum depth d.
• W: The working set containing target nodes.
• P: The predecessor set containing the predecessor nodes.
• E: The edge set containing the edges used to generate the graph.
• s: The current depth of W used to determine which nodes will be considered.

The Backward Strategy
The backward strategy can be regarded as the search from outputs to inputs by iteratively splitting the nodes.Given an implementation of a matrix A : {x i = x j1 ⊕ x j2 } with 0 < j 1 , j 2 < i and i = n, n + 1, . .., every non-unit node x i (i ≥ n) can be split into two nodes x j1 and x j2 .If we use new nodes to split the target nodes into unit nodes, we can also find an implementation of A. For convenience, we call p and q the predecessors of w, if and only if w = p ⊕ q holds.The backward strategy is given in Algorithm 2. We bring the example M P in Section 1 to cast some light on our framework: Initialization.In M P , y i represents the i-th row of the matrix, and x i is the unit node with the i-th bit set.We put each row in M P into the working set W. The unit set X contains all the unit nodes.Thus, we have the predecessor set P = φ, W = {y 0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 }, and X = {x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 }.The nodes in W need to be split.Then, we use Equation (1) to give the minimum depth of each node: And we calculate the minimum depth of M P is d M P = 3 by Equation (2).
Step 1.The current depth s is d M P = 3.For y i ∈ W, if the minimum depth d yi of y i meets d yi < s, we put it from W to P. Therefore, W = {y 0 , y 1 , y 2 , y 3 }.P = {y 4 , y 5 , y 6 }.
Step 7. Because of W = φ, we set W ← P, P ← φ, and s = s − 1 = 0. Now, W = {x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 }.All the target nodes are split into unit nodes.We finish the search.Now, we formalize our framework.First, we state the problem that we focus on.The SLP problem is defined as follows.

Definition 1 ([BMP08]
).The Shortest Linear Program (SLP) problem is defined as finding a solution with the minimum number of XORs to compute the multiplication of an m × n constant matrix A over F 2 .
Then, we extend the SLP problem by considering the depth of the solution.The backward framework aims at solving the SLP problem as well as achieving the minimum depth.In other words, we give the solution of the SLP problem, where the depth of each node is not greater than the minimum depth of A.
Definition 2. The backward framework is an approach to search for a solution for the SLP problem that starts from target nodes, chooses a node iteratively, and splits it into two ones until all the nodes are unit ones.
Our framework returns a directed graph formed by nodes and by edges connecting pairs of nodes.In the case of a directed graph, each edge has an orientation, from one node to another.If there exists an edge from p to q, we say that q has a predecessor node p.And the in-degree of a node is defined as the number of edges whose origins are the node.As our framework always splits one node into two predecessor nodes, the in-degree of every node is either 0 or 2. This gives rise to the following property.
Property 1.The backward framework returns a directed graph.In the graph, the indegree of each node is 0 or 2. Every unit node has the in-degree 0. And every non-unit node has the in-degree 2 and can represent an XOR gate.
We use the set E to save the graph.The implementation is encoded in E in the form (p, u), which means that there exists an edge from p to u. Normally, a graph is defined by a tuple of sets, one for edges and one for vertices (i.e., the nodes).However, for the sake of brevity, we omit the set of vertices since every edge (p, u) explicitly implies that the graph contains two nodes p and u.For each non-unit node, there exist two nodes p and q such that (p, u) and (q, u) are saved in E. In Step 2 of the example, we have y 0 = y 4 ⊕ t 7 .Thus, we put (y 4 , y 0 ) and (t 7 , y 0 ) into E.
For a non-unit node u, the splitting method uses two predecessor nodes a and b to split u by u = a ⊕ b.The depth of a graph is defined as the number of non-unit nodes involved in its critical path.If we choose the appropriate predecessors, we can ensure that the SLP problem is solved with respect to achieving the minimum depth.
This contradicts to 2 d−1 < ω 1 (y).Therefore, we must have Proposition 1 can help us to execute the splitting process.We use an example to illustrate it.Suppose that y ∈ L 3 .The Hamming weight ω 1 (y) of y is 5, 6, 7, or 8.If y 1 , y 2 ∈ L 1 , ω 1 (y 1 ) and ω 1 (y 2 ) are not greater than 2. y = y 1 ⊕ y 2 is impossible.Thus, we must have y 1 ∈ L 2 or y 2 ∈ L 2 .We use W to save the nodes which need to be split.Note that the splitting method may generate new nodes.We put them into P.We recommend reusing the nodes in P for reducing the number of XOR gates.Our heuristics in Subsection 3.2 aim to reuse non-unit nodes based on the current states of W and P.And if we cannot match any heuristics, a function DefaultSplit() is used to split the nodes by a default method (see Algorithm 1).

Proposition 2. Given an m × n binary matrix A and its minimum depth d A , we can always find a graph based on the backward framework where each target node y i can be split into unit nodes and the depth of each node is not greater than
Proof.The target nodes are y 0 , y 1 , ..., y m−1 .We put them into the working set W and initialize the predecessor set P ← φ.Based on Proposition 1, for each node y i in W, if y i is not unit node, we can always find two predecessors t p and t q where d tp and d tq are less than d yi .Through repeating the process, we split all the nodes in W into P. Next, we treat P as W and continue to split the nodes in W. The above process stops only when no nodes need to be split.Finally, every target node will be split into unit nodes.
Proposition 2 ensures that our framework can always work.Note that the proof of the proposition implies the processes of the backward framework.It is simple to see that this method is suitable for low-latency implementation.For an m × n binary matrix A, the minimum depth of A is d A .If using the backward strategy, we can use d yi to represent the minimum depth of each target node y i (0 ≤ i ≤ m − 1) and have y i ∈ L dy i .When selecting predecessor nodes, we always choose them from L k (k < d yi ), which can reach the bound of minimum depth easily.We can give the following steps and the complete process is in Algorithm 2.
1. Initialize the target set W and the predecessor set P.
2. If X ∪ W = X , return the implementation E.  As the processes of splitting and generating nodes are randomized, recalling the computations at different times would lead to different results.Hence, it is difficult to determine how long we wait to achieve the best solution.We execute the algorithm in a limited and reasonable time (several days) to collect many implementations and select the best one among them.This strategy is quite similar to many previous search approaches in, e.g., [KLSW17, TP19, XZL + 20, BFI21].
For a matrix, the initial information includes the target nodes and unit nodes.All the target nodes will be put into W. Next, we generate new predecessor nodes or use existing nodes to split the nodes in W. Which predecessor nodes will be used depends on different strategies.Since the selection of the predecessors is randomized, it is possible that we cannot find a good implementation taking a long time.Thus, we provide five rules of heuristics (see Subsection 3.2).The rules can help us to find the sub-optimal result.It is easy to transform our implementation to a circuit.The depiction "a is split into b and c" means a = b ⊕ c.
Given the current depth s, we only search the predecessor nodes from L k (0 ≤ k < s).L k is a set of all the nodes with minimum depth k.The loop condition in Algorithm 2 X ∪ W = X indicates that there is at least one non-unit node in W, which will be split according to our strategy.In Step 7 of the example, the loop condition does not hold, and it returns the result.In addition, for splitting the nodes with fewer XOR gates, we give a function Search() to describe how the algorithm uses the heuristics to split nodes and update parameters.We can view more details in Algorithm 3. The purpose of the heuristics is to reduce the XOR gates in a reasonable time.

Heuristics of Splitting Nodes
In this section, we present our heuristic algorithm that takes the working set W, the predecessor set P, and the current depth s in Algorithm 2 and output the best candidate node to be split and the splitting scheme.We present several splitting rules of W, P, and s and the corresponding action of splitting.The heuristic search is performed by matching one of the rules and conducting the corresponding actions.Most examples are from M P .
Algorithm 3 Search() Input: W, P, and s in the framework

w 1
The working set: The predecessor set:

Discussion on the Priority
Another problem is to decide which rule takes precedence.If we make bad choices, the number of nodes in P will increase, which may lead to a worse implementation.Our strategy is a greedy one to choose the best candidate in the current state rather than the global optimal choice.The method helps us return a feasible solution in a reasonable time.Thus, we need to investigate the priority between different rules.For this problem, we make a series of experiments.We try to modify the order of different rules, give them a hierarchical priority, and compare their results.Meanwhile, we give the theoretical costs of each rule in Table 4. Ideally, the output nodes can be generated without any cost.However, this case cannot happen.Each non-unit node has two predecessors with one XOR gate.
Rule 1 will be matched first because it reduces one node in W and adds one node in P without additional XOR operations.In Rule 2, we need one XOR gate.Besides, we don't generate new predecessors in Rule 2, while Rule 3 generates a new predecessor.Thus, we prefer Rule 2 to Rule 3 if both of them can be matched.
In the worst case, i.e., Rule 5, a node will be split into two predecessor nodes.Rule 4 is preferred than Rule 5 because two nodes in W are only split into 3 new predecessors.Actually, Rule 4 can be regarded as the combination of Rule 3 and Rule 5. Thus, in some cases, Rule 3 and Rule 4 can be regarded as the same.Therefore, the relation of rules' priorities is Rule 1 > Rule 2 > Rule 3 ≥ Rule 4 > Rule 5.
Candidates with the same priority.In the running of our strategy, it is inevitable that several candidates with the same priority may be chosen.For the case of a tie, one Rule 5 1 1 2 1 The number of nodes we deal with.
2 The number of XOR gates we use. 3 The number of new predecessors we generate.
possible solution is to record all the candidates and try them sequentially.However, this may lead to a large memory requirement.Sometimes, we even cannot exhaust search all possible candidates in a reasonable time.We use an alternative method.It takes a random selection and randomly selects a candidate to speed up the search process.

Comparison of Backward and Forward Search for Low-latency Implementation
In this section, we explain the advantage of the backward searching for low-latency implementation beyond the forward one.The advantage is that the backward framework ensures that each node reaches the minimum depth, which holds for all matrices.However the forward algorithm cannot (see Section 1).This feature will affect whether the node can be used to generate new nodes.Thus, for some matrices, the framework can cover more implementations than previous algorithms with minimum depth in a limited time.Then, we give a further explanation as follows.
First, we review the very effective forward search algorithm proposed by Boyar et al. in [BP10].Given a set of unit nodes and target nodes as a binary matrix, for generating every row y i (0 ≤ i ≤ m − 1) in matrix, BP algorithm places all unit nodes {x 0 , x 1 , ..., x n−1 } into the Base set B and initialize an m-integer vector Dist which keeps track of the distances of each target node from B. The Dist is [δ(B, y 0 ), δ(B, y 1 ), ..., δ(B, y m−1 )], where δ(B, y i ) indicates the minimum number of XOR gates required that can obtain y i from B. Then, Boyar repeatedly picks two nodes from B according to some rules, adds them together as a new node, and puts this new node into B. The rules are described as follows: 1. Perform XOR on every unique pair of nodes in B to generate a new node.The node is used to re-evaluate the Dist vector, and calculate the new distance Select the smallest distance and put the corresponding node into B.Meanwhile, if a pair can generate the target node, then choose it first.
3. In the case of a tie, use the Euclidean norm of Dist to determine which candidate is better.
Intuitively, after each iteration, the Base set becomes closer to the target nodes according to the reduction of Dist.The algorithm stops executing if and only if the Dist is a zero vector.That is, the Base set can compute all the target nodes.In the following, LSL algorithm enhances BP algorithm with circuit depth awareness in [LSL + 19].It proposes a new set that keeps track of the circuit depth of Base.At each iteration, LSL algorithm only picks two different elements from Base and generates a new node that will never be out of depth bound.
However, there are some noteworthy issues in the above algorithms with circuit depth awareness.Based on the LSL algorithm, we always choose the pair of nodes in Base which can reduce the maximum distance.In other words, if one pair cannot reduce the distance or it only reduces fewer distance, it is the bad choice, and may never be chosen.We surprisingly find that the known forward search approaches for low-latency always omit some good implementations.While backward approach can cover them.Of course, our approach also abandons the choice that appeared to be not good in the current state.It might miss good implementations since it cannot achieve the exhaustive search.Nevertheless, from the experimental results, this new strategy enables us to find better implementations in multiple cases.
We take a more comprehensive example to illustrate it.For the matrix M C used in Camellia [AIK + 00], we provide two implementations.The first is generated by LSL algorithm with 20 XOR gates produced.Table 5 shows the depth, new nodes and dist.It can be seen that the distance reduces faster in the first half of the execution, which is related to the heuristic rules.Next, we use our backward framework to generate the implementation, which is shown in Table 6 and only needs 19 XOR gates, thanks to our new strategy.
Only y 0 , y 1 , and y 7 are included this combination.Meanwhile, x 3 ⊕ x 6 is also included in y 0 , y 1 , y 7 and y 3 .In this case, the priority of x 0 ⊕ x 3 is always lower than x 3 ⊕ x 6 .We will explain that x 0 ⊕ x 3 will never be chosen in M C by LSL algorithm.After choosing x 3 ⊕ x 6 , the distance is 32.The distance of y 0 , y 1 , y 3 , y 7 is reduced.When performing the next XOR, x 0 ⊕ x 3 will not reduce any distance.x 3 ⊕ x 6 limits the effect of x 0 ⊕ x 3 .Actually, no heuristics based on BP algorithm will choose this choice even though it leads to fewer XOR gates.
Our framework can avoid such an issue.For a target node y, we calculate its minimum depth d y , and choose predecessors from L dy−1 .Thus, the choice x 0 ⊕ x 3 can be used.
Moreover, there is another good feature of our algorithm from that of LSL.A reusable node means that we need not generate it again.This kind of node can be found by backward framework easily based on our splitting Rule 2 and Rule 3. It hopes that fewer nodes are used to split nodes in the working set W. While LSL algorithm does not pay attention to the above feature.It tries to find the nodes which make the Base closer to the target.It is not to say that LSL algorithm will not reuse nodes.The difference may explain why for some matrices, our algorithm performs well, for others, we do not.For example, in M C , t 8 = [0, 1, 0, 0, 1, 0, 1, 1] is more suitable for the implementation.In Table 6, t 8 is used four times.Therefore, we use only four nodes in L 2 to split all the target nodes.There are six nodes in L 2 in Table 5.

Hardware Implementation
Our algorithm aims at finding optimized implementation in circuit size, power consumption, and latency.In this subsection, we synthesize existing implementations and show their performance in hardware.We first provide the implementations of AES Mixcolumns.Through three metrics (area, power, and latency), we can discuss which is the better implementation.
The AES results are synthesized using two different ASIC libraries, namely TSMC 90 nm and NanGate 45 nm (in Table 9).The logic synthesis is performed with Synopsys Design Compiler version D-2010.03-SP1(using the compile_ultra -no_autoungroup command), and simulation is done in Mentor Graphics ModelSim SE v10.2c.From the above tables, we can find our AES Mixcolumns result has more advantages than the results from LSL and BFI algorithms in hardware, and ours has less power and latency than the results from [XZL + 20] and [LXZZ21], which is crucial in devices with limited resources.Besides, we synthesize the result about R using the ASIC library NanGate 45 nm.Results are also shown in Table 9. Notably, the results are better than the results from [LSL + 19] and [BFI21].As the best matrix found by [LSL + 19] with 88 XOR gates and depth 3 needs 176 GE while R we find has 172 GE.

Conclusion
In this paper, we investigate a new framework of heuristic search for the implementation of a given linear layer.Our new approach takes the strategy that iteratively splits the output bits until all the input bits appear, which is very suitable to the low-latency criteria.Our new framework contributes to • an implementation of AES Mixcolumns with 103 XOR gates with a depth of 3, which is among the best hardware implementations of AES linear layer with minimum depth; • better implementations for 54.3% of matrices proposed in [LSL + 19], in which we find an involutory MDS with fewer XOR gates (i.e., 86, saving 2 from the state-of-the-art result) of XORs with minimum depth.
Though backward framework can solve the low-latency problem easily, it is still important to further reduce the number of XOR gates without any constraints (i.e.no depth limitation).In addition, our research provides a new tool for the construction of lightweight MDS matrices.There should exist some matrices more compatible with our algorithm and thus have better implementations with minimum depth, which we leave as a promising future work.

Proposition 1 .
For any y ∈ L d (d ≥ 1), there exist y 1 and y 2 with y1 ∈ L d−1 or y 2 ∈ L d−1 such that y 1 ⊕ y 2 = y.Proof.Based on Equation (1) and the definition of L d , we havelog 2 (ω 1 (y)) = d.Therefore, we obtain that2 d−1 < ω 1 (y) ≤ 2 d .For d = 1,Proposition 1 holds obviously.We consider the case of d ≥ 2. If the minimum depth d (yi) (i ∈ {0, 1}) of y 1 or y 2 meets d (yi) ≥ d, y ∈ L d does not hold.To ensure that each node reaches the minimum depth, we only consider the case of d y1 , d y2 < d.Assume that y 1 / ∈ L d−1 and y 2 / ∈ L d−1 .Without loss of generality, let y 1 , y 2 3. If W = φ, treat P as the target set to split, s ← s − 1, go to Step 2. 4. Use heuristics to split.If it is successful, go to Step 3. 5. Use the default method to split.Go to Step 3. Algorithm 2 backward search framework Input: An m × n binary matrix A Output: The

Figure 2 :
Figure 2: The different splitting rules et al. proposed a new algorithm called BP algorithm in [BP10].It uses a set Base to save all the values which have been generated by the algorithm.It repeatedly picks two values from Base according to some rules, adds them together as a new value, and puts this new value into Base.It further optimizes the procedure of route searching by dedicatedly choosing values to combine.BP algorithm has a series of variants (see [VSP18, RTA18, TP19]).Then, Xiang et al. decomposed the matrix A into the product of several elementary matrices, based on which the search can be significantly improved [XZL + 20].However, the above algorithms cannot solve another problem: how to take circuit depth into account and optimize the matrices with respect to achieving the minimum depth.Liet al. provided a solution by adding a depth constraint in BP algorithm called LSL algorithm [LSL + 19], which is further improved by Banik et al. called BFI algorithm

Table 1 :
The implementations of M P (left: forward algorithm; right: backward algorithm).

Table 2 :
[KLSW17]number/depth of implementation cost of matrices.Except for the last row, every matrix is consistent with the choice in[KLSW17].The results take the number of XOR gates into account with respect to the minimum depth.
bThe results only take the number of XOR gates into account.c d We show the lowest one from all the results.

Table 3 :
[DL18]R number/depth of implementation cost of matrices with depth 3 in[DL18].

Table 4 :
The costs of predecessorsRule Output nodes 1 Gates 2

Table 5 :
The implementation of M C using LSL algorithm

Table 6 :
The implementation of M C using our framework

Table 8 :
The optimized results of matrices with depth limitation from [LSL + 19] HW a Size Depth Number of matrices Optimizations b Maximum c a The Hamming weight of matrices.b The number of matrices that have fewer XOR gates than the results from [LSL + 19].c It represents the maximum number of reduced XOR gates.

Table 9 :
Synthesized results using two different ASIC libraries