More Inputs Makes Difference: Implementations of Linear Layers Using Gates with More Than Two Inputs

. Lightweight cryptography ensures cryptography applications to devices with limited resources. Low-area implementations of linear layers usually play an essential role in lightweight cryptography. The previous works have provided plenty of methods to generate low-area implementations using 2-input xor gates for various linear layers. However, it is still challenging to search for smaller implementations using two or more inputs xor gates. This paper, inspired by Banik et al. , proposes a novel approach to construct a quantity of lower area implementations with ( n + 1)-input gates based on the given implementations with n -input gates. Based on the novel algorithm, we present the corresponding search algorithms for n = 2 and n = 3, which means that we can efficiently convert an implementation with 2-input xor gates and 3-input xor gates to lower-area implementations with 3-input xor gates and 4-input xor gates, respectively. We improve the previous implementations of linear layers for many block ciphers according to the area with these search algorithms. For example, we achieve a better implementation with 4-input xor gates for AES MixColumns, which only requires 243 GE in the STM 130 nm library, while the previous public result is 258.9 GE. Besides, we obtain better implementations for all 5500 lightweight matrices proposed by Li et at FSE 2019, and the area for them is decreased by about 21% on average.


Introduction
In recent years, lightweight cryptography has been a significant trend in many fields, such as the Internet of Things (IoTs) and Radio-Frequency IDentification tags (RFID). Lightweight cryptography means a low-cost implementation, where the cost covers the circuit area, latency, power consumption, and so on. It extends cryptography applications to devices with limited resources. Security has been a core area of concern for researchers, as various limitations have led to new security threats.
Generally speaking, research on lightweight cryptography falls in two directions. The first direction focuses on designing new ciphers that are assumed to be efficient implementations, such as PRESENT [BKL + 07], LED [GPP11], MIDORI [BBI + 15], and SAND [CFS + 22]. The second direction tries to optimize the implementations of given ciphers, which has also drawn much attention. For example, the Advanced Encryption Standard (AES) [DR20] has been widely used in practice. Its round function has been 1 frequently used in designing other cryptographic primitives (e.g., AEGIS [WP13] and Rocca [SLN + 21]). Therefore, an efficient implementation will directly reduce the cost of deploying AES for primitives using its round function.
In practice, we can build circuit implementations of linear layers using a variety of heuristics (e.g., [Paa97, BP10, BMP13, KLSW17, LSL + 19, TP20, XZL + 20, BFI21, LXZZ21, LWF + 22] for an incomplete list) originating in Paar's work [Paa97]. Most of those previous works only consider the circuits using 2-input gates. However, most standard cell libraries of CMOS logic processes have dedicated gates that support 3-input xor gates or 4-input xor gates ([BPMC18, RMTA20, BFI21, BDK + 21]). Meanwhile, it should be noted that using gates with more than two inputs may give rise to some more efficient circuits. We first consider the most popular criteria for the lightweight implementation gate equivalents (GE). 1 In this paper, we use two ASIC libraries (see Table 1), adopted from [BDK + 21]. We give an example with a matrix M 1 , 1 1 1 0 0 1 1 1 , the inputs ⃗ x = (x 0 , x 1 , x 2 , x 3 ) T , and the outputs ⃗ y = (y 0 , y 1 ) T . The optimal solution with the minimum number of XOR operations to compute ⃗ y = M 1 ⃗ x can be performed by the procedure described by Figure 1-left, requiring three gates. In STM 130 nm library, the circuit needs 9.99 GE and cannot be improved by 2-input xor gates. However, if we construct the circuit with 3-input gates (see Figure 1-right), it only needs 9.32 GE.
Despite the potential advantage of the circuit using gates with more than two inputs, it is still challenging to design an approach to find suitable circuits. Baksi et al. [BDK + 21] directly searched for the circuits by adopting the BP algorithm proposed in [BP10] with 3-input xor gates. It is the first heuristic method that takes them into account and performs well in STM 130 nm library. Besides, this approach may significantly increase the searching space due to the larger amount of gate inputs, leading to a long time to run until a proper circuit is found. In [BFI21], Banik et al. proposed another strategy that attempts to convert a circuit with 2-input gates into another one with 3-input gates. This strategy can benefit from the known heuristic algorithms that produce circuits with 2-input gates. We call it the BFI algorithm and give a brief introduction. The algorithm starts with the BP algorithm to generate many circuits with 2-input gates, and then transforms them into ones with 3-input gates. This intelligent strategy reduces the search space and obtains many circuits for linear layers with fewer areas than before. Nevertheless, we note that firstly, it is still unknown whether it can be generalized to gates with more than 3 inputs, and secondly, the one (with 2-input gates) to one (with 3-input gates) transformation may reduce too much the search space. It is a bit of waste that a suitable circuit is only transformed into one circuit with 3-input gates. Therefore, this is an exciting research direction.

Our Contributions
In this paper, we follow the line of the work of Banik et al. and propose two algorithms to provide more generalized and efficient algorithms to reduce the circuit areas of linear layers. Then, we instantiate them to optimize the existing matrices with 3/4-input xor gates.
New algorithms to optimize the circuit areas with 3/4-input xor gates. This paper proposes two novel algorithms that can construct a quantity of lower area implementations with (n + 1)-input gates based on the given implementations with n-input gates. The first method is the transform algorithm. It can convert a circuit using gates with no more than n inputs into a lower area circuit using gates with no more than n + 1 inputs. As a transform framework, we can utilize it to convert circuits using gates with more inputs. The second method is the graph extending algorithm. It can produce massive equivalent circuits with low area and depth for a given circuit. The algorithm utilizes more information hidden in the given circuit and can be applied after any existing algorithms to optimize the results further.
Based on the novel algorithms, we instantiate the corresponding search algorithms EGT2 for n = 2 and EGT3 for n = 3, which means that we can efficiently transform an implementation with 2-input xor gates and 3-input xor gates to lower-area implementations with 3-input xor gates and 4-input xor gates, respectively.

Application to many linear layers of block ciphers.
We apply EGT2 and EGT3 to several linear layers from the literature, including matrices already used in different ciphers [DR20, CMR05, JNP15, Ava17, BBI + 15, BCG + 12, ADK + 14, Ava17, BJK + 16, AIK + 00]. With the help of these search algorithms, we improve the previous implementations of linear layers for many block ciphers according to the area. The results are listed in Table 2 and Table 3, where XZLBZ is a heuristic algorithm proposed in [XZL + 20]. For the thirteen linear layers in the tables below, we optimize eight matrices in circuit areas and obtain four with the same circuit areas. Notably, for AES MixColumns, we achieve a circuit with 243 GE, better than the previous best result (258.9 GE) reported in [BDK + 21].
We also apply our algorithms to 5500 lightweight matrices proposed by Li et al. in [LSL + 19] and obtain better circuits for all the matrices than all known state-of-the-art results. Figure 2 shows the comparison between the GE concerning different algorithms. On average, for each matrix, the circuit area is decreased by about 21%.
Additionally, we synthesize different implementations of AES MixColumns in hardware (see Table 4). The results show that our implementation achieves a better area and reduces power at the cost of slight and reasonable growth of latency.

Organization
In Section 2, we give some basic notations and definitions. Then, we propose the transform algorithm in Section 3. In Section 4, we propose the graph extending algorithm and give some examples. Next, in Section 5, we combine two algorithms and instantiate EGT2 and EGT3 algorithms. Finally, we conclude and propose future research directions in Section 6. The results consist of the circuit area (GE) and the gates. We use "(m)" to represent m 2-input xor gates, use "(m, p)" to represent m 2-input xor gates and p 3-input xor gates, and use "(m, p, q)" to represent m 2-input xor gates and p 3-input xor gates and q 4-input xor gates.  The results consist of the circuit area (GE) and the gates. We use "(m)" to represent m 2-input xor gates, use "(m, p)" to represent m 2-input xor gates and p 3-input xor gates, and use "(m, p, q)" to represent m 2-input xor gates and p 3-input xor gates and q 4-input xor gates. a Using 2-input xor gates. b Using 2/3-input xor gates. c Using 2/3/4-input xor gates.

Notations
Let F 2 be the finite field with two elements 0 and 1 and F n 2 be the vector space of all n-dimensional vectors over F 2 . M m×n denotes an m × n matrix over F 2 . wt(M ) denotes the Hamming weight of a matrix M over M m×n , which counts the number of 1's contained in M .

g ϵ -XOR Metric
For any linear layer of a cipher associated to an m × n binary matrix M , given inputs ⃗ x = (x 0 , x 1 , ..., x n−1 ) T , the outputs ⃗ y = (y 0 , y 1 , ..., y m−1 ) T of the linear layer can be computed by ⃗ y = M⃗ x, and y i (0 ≤ i ≤ m − 1) can be computed by where each coefficient a ij is the entry of M at i-th row and j-th column. We first recall some metrics to compute M⃗ x with less number of 2-input xor gates for a matrix M over M m×n . Definition 2 (s-XOR [JPST17]). It is always possible to perform a sequence of XOR gates x i = x i ⊕ x j with 0 ≤ i, j ≤ n − 1, such that the inputs are updated to the outputs. The s-XOR count of M is defined as the minimal number of updating operations.
Definition 3 (g-XOR [XZL + 20]). The circuit of M can be viewed as a sequence of XOR gates x i = x j1 ⊕ x j2 where 0 < x j1 , x j2 < i. The g-XOR count is defined as the minimal number of operations x i = x j1 ⊕ x j2 that compute the m outputs completely.
Since the d-XOR metric is intuitive and easy to compute (i.e., the number of 1's in M ), it has been adopted to design new lightweight diffusion layers. The s-XOR and g-XOR metrics are used in evaluating matrices for further optimization. The difference is that the g-XOR can generate new values while the s-XOR continuously renews original values. For example, in the procedure of computing x 0 ⊕ x 1 , the s-XOR performs x 0 = x 0 ⊕ x 1 or Next, we use a new metric for optimization. We define the ϵ-operation (ϵ ∈ N) as an operation containing ϵ continuous 2-input xor gates. 1-operation represents a 2-input xor gate, 2-operation represents a 3-input xor gate, 3-operation represents a 4-input xor gate, and so on. Different operations may have different costs (i.e., GE in hardware). λ ϵ is defined as the cost of the ϵ-operation. Then, we expand the definition of g-XOR. Actually, the definition is similar to [BDK + 21].
where e i counts the number of the i-operation.
If λ 1 = 1, g 1 -XOR is the g-XOR. The g ϵ -XOR metric can use 1-operation, 2-operation,. . ., ϵ-operation to optimize matrices. The circuit with 1-operation, 2-operation,. . ., ϵ-operation are called the circuit with the g ϵ -XOR metric. For convenience, we use XOR2, XOR3, and XOR4 to represent the 1-operation, 2-operation, and 3-operation, respectively. Another metric is the circuit depth. The critical path of a circuit is defined as the path between an input and output involving the maximum number of gates. The circuit depth is the number of gates involved in the critical path.

Directed Acyclic Graph
A graph is formed by nodes and by edges connecting pairs of nodes. In the case of a directed graph, each edge has an orientation from one node to another. A Directed Acyclic Graph (DAG) is a directed graph that has no cycles. In-degree of a node is defined as the number of edges that end at this node, and out-degree of a node is defined as the number of edges whose origin is the node. We use in(u) and out(u) to represent the in-degree and out-degree of u, respectively. Moreover, the input set I(u) and the output set O(u) are used to save the nodes relevant to u. If there exists an edge from u to v, we will put v into O(u) and put u into I(v). We have Definition 5 (Reachability Relation). The reachability relation can be formalized as a partial order ⪯ on the nodes of the DAG. In this partial order, two nodes u and v are ordered as u ⪯ v exactly when a directed path exists from u to v in the DAG.
Definition 6 (Reachability Set). Given a directed acyclic graph G, the reachability set R u of the node u (u ∈ G) is defined as the set in which each node v satisfies v ∈ G and u ⪯ v. The reachability set R G of the graph G is defined as the set containing all the R u .
The definition of the reachability set shows that if one node v ∈ R u , the path from u to v must exist. A path consists of multiple consecutive edges. Then, we introduce the topological ordering, which is used to sort the nodes.
Definition 7 (Topological Ordering). The topological ordering T G of a directed acyclic graph G is an ordering of its nodes into a sequence. For every edge, the start node of the edge occurs earlier in the sequence than the ending node.

Transform Algorithm
In this section, we introduce the generalized algorithm for converting a circuit with gates up to n inputs into a circuit with gates up to n + 1 inputs. Reducing the number of XOR2 gates is helpful for linear layers. The circuit area can be reduced by finding the minimum number of XOR2 gates. However, there is a gap between the above circuit and the smallest circuit area. To adapt to different situations, we limit the types of gates. If we use g ϵ -XOR metric, only i-input (i ≤ ϵ + 1) xor gates can be used. For example, the g 2 -XOR metric means that only 2/3-input xor gates can be used. Transforming circuits with gates from n inputs into n + 1 inputs means that the metric used in the circuits is changed from g n−1 -XOR metric to g n -XOR metric. The cost of the circuit is changed from where e i and e ′ i are the number of i-operations before and after the transformation. Notably, for the AES MixColumns in the library named STM 130 nm, we can decrease the circuit area to 255 GE with the g 2 -XOR metric (see Table 5) and to 243 GE with the g 3 -XOR metric (see Table 6).
Searching for circuits means finding a DAG from all the unit nodes to all the target nodes. Besides, the depth of a graph is defined as the number of edges involved in its critical path. According to Liu et al., every unit node has the in-degree 0, and every non-unit node has the in-degree n (n ≥ 2) [LWF + 22]. The following property shows the features of nodes in DAG.
Property 1. The circuit with g ϵ -XOR metric can be converted into a DAG, in which every unit node has in-degree 0 and every non-unit node has in-degree n (n ≤ ϵ + 1) and represents an (n + 1)-input xor gate (i.e., n-operation).

Transforming DAG
Given the available operations, 1-operation, 2-operation, . . ., ϵ-operation, and their cost λ 1 , λ 2 , . . ., λ ϵ , we can obtain the cost of a DAG: where e i (i ≤ ϵ) counts the number of the nodes with in-degree i + 1. The core idea of the transform algorithm is how to reduce the cost by removing nodes in the DAG. Our first concern is what happens when we remove nodes from a DAG. Removing u means that we delete the edges from u to O(u) and from I(u) to u and add the edges from every node in I(u) to every node in O(u). Meanwhile, the in-degree of every node in O(u) increases.
Next, we discuss which nodes can be removed to reduce the cost of the DAG. Suppose that we have a direct acyclic graph G with g ϵ -XOR metric. We define S u as the reduced cost by removing u from G. S u can be computed by considering the change of in-degree of the node in O(u) and related to hardware libraries. Thus, S u > 0 means that we can benefit from removing u.
A general approach is to compute the reduced cost S u of every node in G and choose the maximum cost. The transformation will stop if every S u is smaller than 0. However, given two nodes u and v, it is difficult to determine which is better. Different gates and libraries lead to different costs, and different metrics also influence the comparison. For the case with the fixed library and metric, we can usually find a standard to compare them. In Section 5, we give the comparisons with g 2 -XOR and g 3 -XOR metrics.
Notably, not all the nodes can be removed. We can only use the i-operation (i ≤ ϵ), which means that the in-degree of a node u is not greater than ϵ + 1. Thus, we have the following proposition.
Proposition 1. Suppose that the circuit is with the g ϵ -XOR metric and in(u) = j. Only when the in-degree k of every node in O(u) is not greater than ϵ + 2 − j, can we remove u.
Proof. Removing u means that we delete the edges from u to O(u) and from I(u) to u and add the edges from every node in I(u) to every node in O(u). After removing u, the in-degree of the node in O(u) will increase by j − 1. In the g ϵ -XOR metric, the in-degree of every node is not greater than ϵ + 1. Thus, we have Therefore, we can propose the generalized transform algorithm.
1. Suppose that we need to transform the graph G from g ϵ−1 -XOR metric into g ϵ -XOR metric. Initialize the set A G , in which every node u meets that the reduced cost S u > 0.
2. Compute the reduced cost of S u for every node u ∈ A G and choose the node v with the maximum value S v and remove v from Because many operations are related to the specific libraries and gates, we instantiate two algorithms, EGT2 and EGT3, based on the following graph extending algorithm in two libraries, STM 90 nm and STM 130 nm. These two instances show that our algorithms perform well.

Graph Extending Algorithm
In this section, we propose the graph extending algorithm, which is a local optimization algorithm. There are many local techniques based on specific reduction rules (see [TP20,LXZZ21]). However, these rules cannot cover all the cases. We find some cases in which no reduction rules are effective. Therefore, we first show an example and propose our algorithm formally. The new algorithm converts a circuit into many circuits and fully utilizes the circuit's information. Note that new circuits may have fewer gates and a smaller depth.
We take the circuits with XOR2 as an example. The graph extending algorithm also holds for other operations (e.g., XOR3 and XOR4). For a matrix M over M m×n , a circuit of M can be seen as a sequence of l XOR2 gates t i = t j ⊕ t k where i = n, n + 1, ..., n + l − 1 and j, k < i. We use an operation t i,j,k instead of t i = t j ⊕ t k for convenience. We say that the implementation of t i is (t j , t k ) and t j and t k are the predecessor nodes of t i in the DAG. The circuit is represented as seq = t n,j0,k0 , t n+1,j1,k1 , ..., t n+l−1,j l−1 ,k l−1 . Next, we introduce the issue briefly.
The implementations of t 16 and t 17 are changed. We observe that t 13 is not used in any operations and t 13 is not the output value, which is redundant. Thus, we can remove t 13,9,10 , and the circuit only needs nine 2-input xor gates. The example shows that there is still room for further improvements, and it is not enough to consider whether a node can be removed by checking the out-degree of the node. We need to explore more features of circuits. Our graph extending algorithm can optimize a given circuit and generate many equivalent circuits. The algorithm can be used in any heuristics to optimize the generated circuits. The application to multi-input gates can be seen in the next section.

Single Graph and Extended Graph
The reachability set and the topological ordering have been introduced. Our graph extending algorithm uses them to optimize a given DAG. We give some necessary explanations.
The definition of the reachability set of u shows which nodes are generated by u. Note that if we have u ⪯ v and v ⪯ w, u ⪯ w holds. This property can help us to find the reachability set quickly. We initialize two temporary sets temp 1 and temp 2 . For each node u ∈ G, we execute the following steps.
3. For each node v ∈ temp 1 , we put v into R u , put O(v) into temp 2 , and remove v from temp 1 . When temp 1 = ϕ, we let temp 1 = temp 2 and temp 2 = ϕ. Then, go to Step 2.
The procedures will be performed iteratively for each node in G and obtain the reachability set R G . We use GetReachabilitySet() to calculate the reachability set of a graph. Another definition is the topological ordering. The ordering implies the reachability relation. If one node a occurs earlier than b in the topological ordering, we can infer that the relationship b ⪯ a does not hold. The problem of finding a topological ordering can be solved in linear time by Kahn's algorithm [Kah62]. The strategy is as follows: 1. Suppose that graph G contains n nodes. The topological ordering T is ϕ, and S contains the nodes with in-degree 0.
2. If S ̸ = ϕ, go to Step 3. Otherwise, we stop the search procedures. If T contains n nodes, return T . If T contains m nodes (m < n), return error.
3. For node v ∈ S, we remove v and corresponding edges from G and let the in-degree of nodes in O(v) decrease by 1. Next, we recheck all the nodes in G and put the nodes with in-degree 0 into S. Then, go to Step 2.
We use TopologicalOrdering() to represent the strategy. In Step 2, we have two possible outputs. If the algorithm returns T , we can get the topological ordering. However, if the algorithm returns error, there exist cycles in G. The cycle is defined as a path from one node to this node. The node in the cycle cannot occur in T since its in-degree is always greater than 0. If a cycle exists in the graph, it will be the wrong circuit for the linear layer. Because each node has only one implementation in the graph, the nodes in the cycle cannot be calculated by unit nodes.
We introduce two types of graphs used in the graph extending algorithm. The definitions of the single graph and extended graph are shown as follows.
Definition 8. The single graph is a directed graph so that each non-unit node has only an implementation. The extended graph is the directed graph so that each node can have more than one implementation.
In the single graph, each non-unit node has one implementation. We can add new implementations to the nodes in the single graph to generate another DAG instance, the extended graph, in which the in-degree of every non-unit can be more than 2.

Graph Extending Algorithm
In [LXZZ21], the authors check all the reduction rules and remove some nodes which are only used once. Our algorithm says that the nodes only have the out-degree 1. If the out-degree is greater than 1, their rules do not work. However, as is shown in Subsection 4.1, there still exist redundant nodes hiding in the graph. Our algorithm is performed as follows: 1. Generate the extended graph from a single graph.
2. Split the extended graph into many single graphs.
3. Remove redundant nodes from every single graph.

Delete wrong graphs.
Totally, the goal of our algorithm is to generate many equivalent graphs and optimize them. The circuit of a linear layer can be treated as a single graph. We generate the extended graph by adding different implementations. Then, the extended graph can be split into many single graphs. After removing redundant nodes, we can choose the best one from proper graphs. We take the XOR2 as an example. Other gates can also be used. The following section will use XOR3 to generate the extended graph. We explain every step in detail, introduce complete procedures, and provide an example of the matrix M P used in Subsection 4.1.
Generate the extended graph with XOR2. From Property 1, we know that the unit node has the in-degree 0, and the non-unit node has the in-degree 2 and represents one XOR2 gate. However, in previous work, each node had only one implementation, and other implementations were ignored. Suppose that we have unit nodes {x 0 , x 1 , x 2 } and the sequence, t 2 can also be generated by t 2 = t 1 ⊕ x 0 . If out(t 0 ) is 0, we can remove t 0 from the circuit. Note that out(t 0 ) and out(t 1 ) are changed in the above steps. Thus, the transformation helps consider other implementations in the single graph.
Algorithm 1 gives a method to generate the extended graph. First, for each non-unit node u, we try to find different implementations of u where some nodes will not be utilized. For example, we will not use a node to generate its predecessor nodes, i.e., if u = a ⊕ b, we do not use u to generate a or b. It may lead to cycles in the graph. The nodes that can generate u are called the available nodes. We use the available set A u to save the available nodes of u. For a node u ′ (u ′ ̸ = u), either u ′ ∈ R u or u ′ ∈ A u holds. op in the algorithm decides which operation can be used. This paper only uses XOR2 and XOR3. op = 2 means using XOR2 to generate the extended graph. op = 3 means that we use XOR3. More operations can be used in the algorithm, and we omit them for the sake of brevity. In this section, we only consider the XOR2 gates. This means that we will try all the combinations of two nodes in A u to generate u. If we find a new implementation u = p ⊕ q, we will add edges from p to u and from q to u. Thus, the in-degree of every non-unit is a multiple of 2, which guarantees that the extended graph can be split. Finally, we get the extended graph G e .
We can use GetReachabilitySet(G s ) to obtain the reachability set (see Table 7). Then, we use GenerateExtendedGraph(G s , 2) to obtain the extended graph G e , If a node has different implementations, we use "{}" to represent them. For example, {t 10,4,5 , t 10,14,15 } means that t 10 can be generated by (t 4 , t 5 ) or (t 14 , t 15 ). We find that five nodes have different implementations: t 10 , t 14 , t 15 , t 16 , t 17 . Split the extended graph.
In the extended graph G e , some nodes have many implementations, i.e., every implementation can generate the corresponding node. Thus,  and u has not the implementation (w, v, p) then Add (w, v, p) for u in G e ▷ Adding a new implementation for u end if end for end if end if end for return G e a large number of single graphs can be generated by using the property. However, the complexity increases exponentially as the number of implementations of nodes increases. Theoretically, it may exceed the existing computing power in generating single graphs from the G e . Therefore, we provide two methods to split the extended graph G e .
We define n i as the number of implementations of each node t i in G e . Suppose that there are k nodes in G e . The number N of single graphs can be computed by: Then, we define the limitation N ′ as the maximum number of single graphs. If N ≤ N ′ , we split the extended graph using the complete split method. If N > N ′ , we use the partial split method.
Property 2. Given a DAG, if out(u) = 0, u must be the target node or the redundant node.
If a non-unit node with the out-degree 0 is not the target node, we call it the redundant node. Removing the redundant nodes from the graph can decrease the number XOR2 gates. In the example of M P , t 13 in Equation (3) is the redundant node and can be removed. We give the procedures to remove redundant nodes as follows.
1. Suppose that we need to remove redundant nodes in G. We set a variable success = 1.
It implies whether one node is removed.

We set success = 0 and check every non-input node in G.
If v with out(v) = 0 is not the target node, we remove v and the corresponding edges from G, and set success = 1. Next, the out-degree of each node in I(v) decreases by 1. Then, go to Step 2.
We use the function RemovingRedundantNodes() to remove redundant nodes. Note that there is a loop in the function. When we remove one node, we set success = 1 and recheck the graph because the out-degree of each node in I(v) changes. New redundant nodes may occur.
In the example of 32 single graphs, each graph includes 10 non-unit nodes. We apply the function RemovingRedundantNodes() to all single graphs. 18 graphs still have 10 non-unit nodes, 12 graphs have 9 non-unit nodes, and 2 graphs have 8 non-unit nodes.
Although we try to avoid this case (e.g., the available set), the cycle may occur in single graphs. We use the function TopologicalOrdering() to execute the procedure. If TopologicalOrdering() returns error, cycles must exist in the graph, and we delete the graph. After the step, we call the left graphs the reduced graphs. They are the results of our graph extending algorithm. We can choose the best one from all the reduced graphs or further optimize them with more gates.
In the example of M P , 16 graphs are finally left, in which 12 graphs have 10 non-unit nodes, and 4 graphs have 9 non-unit nodes. We show the graphs with 9 non-unit nodes in Table 8. Now our graph extending algorithm is finished. After the graph extending algorithm, we obtain different reduced graphs. Some of them have less cost in hardware. If we plan to optimize a single graph further, the reduced graph will provide more precise information.
In the above example, if we take the number of 2-input xor gates and the depth into account, we choose the 1-st and 3-rd reduced graphs with depth 3 from Table 8. They are the best circuits after our graph extending algorithm. The complete algorithm of our graph extending algorithm is shown in Algorithm 2. We can use other operations (e.g., XOR3) to generate the extended graph. We will discuss it in the next section.

Applications
In this section, we instantiate the transform algorithms. With the help of the graph extending algorithm, we propose two algorithms to optimize the given circuit using XOR3 and XOR4 gates, respectively. We start from a circuit with XOR2 gates. For the g ϵ -XOR metric with ϵ ≥ 4, we can follow similar procedures, and thus we do not discuss them in this section. The source codes are available at https://github.com/QunLiu-sdu/Using-Gates-with-More-Than-Two-Inputs.
Algorithm 2 ExtendGraph2() Input: A single graph G s Output: The set G 2 containing all the reduced graphs G e = GenerateExtendedGraph(G s , 2) ▷ The extended graph ▷ Generating the single graphs for each G r ∈ G 2 do ▷ Removing additional nodes G r = RemovingRedundantNodes(G r ) end for for each G r ∈ G 2 do ▷ Deleting wrong graphs if TopologicalOrdering(G r ) = error then

Transforming Gates from 2 Inputs into 3 Inputs
We first focus on the g 2 -XOR metric starting the circuits with XOR2 gates, i.e., we try to convert gates from 2 inputs into 3 inputs. In the problem, we can use 1-operation and 2-operation. λ 1 and λ 2 represent the corresponding cost. Our goal is to find min(λ 1 e 1 + λ 2 e 2 ).
If v = a ⊕ b ⊕ c, we say that v has implementation (a, b, c). We consider a case where the out-degree of one node is 1, which is discussed in [BFI21].
Note that if v i is the output value, we will lose the output signal of the circuit after the merge procedure.
According to our transform algorithm, more nodes can be removed. Suppose that we have the circuit: and out(u) is 2. The circuit area is 3λ 1 . If 3λ 1 − 2λ 2 > 0 holds, we can remove u and let I(u) point to O(u) by merging two 3-input xor gates. The new circuit is and the area of the new circuit is 2λ 2 < 3λ 1 . We can remove more nodes based on the above cases to reduce the cost. Thus, we give the following proposition.

Proposition 2.
Let N be the maximum value such that (N + 1)λ 1 − N λ 2 > 0 holds. Given a circuit with XOR2 gates, it can reduce the cost by removing the nodes with out-degree n (n ≤ N ).
Proof. Suppose that out(u) is n (n ≤ N ). There are n + 1 related XOR2 gates. One of them is used to generate u. Others are used to generate the nodes in O(u). Thus, removing u will delete (n + 1) XOR2 gates and add n XOR3 gates. If (n + 1)λ 1 − nλ 2 > 0, the cost of the circuit is reduced.
Proposition 2 shows which nodes can be removed in g 2 -XOR metric. For convenience, the maximum value N is called the upper bound. If one node has the out-degree of n (n ≤ N ), we can remove it by deleting n + 1 XOR2 gates and adding n XOR3 gates. Another problem is if the nodes with different out-degrees can be removed, how to determine the priority. We provide a proposition to solve the problem.
Proposition 3. Suppose that the upper bound is N . If out(u) = m and out(v) = n (n < m ≤ N ), removing v will reduce more cost than u. That is, Proof. We can remove u by deleting m + 1 XOR2 gates and adding m XOR3 gates. The saved cost S u is (m + 1)λ 1 − mλ 2 . We can also remove v by deleting n + 1 XOR2 gates and adding n XOR3 gates. The saved cost S v is (n + 1)λ 1 − nλ 2 . We have Thus, S v > S u holds.
We propose our algorithm using graph extending algorithm called EGT2 (see Algorithm 3). ExtendGraph2() is used to obtain the set G 2 that contains many reduced graphs. For each graph G r in G 2 , we execute the following procedures: 1. Compute the upper bound N . We use U to save the nodes that can not be removed.
u ∈ U means that u is the target node or u represents the XOR3 gate.
2. Set n = 1. Then, Step 3 is recursively executed. Each time we finish Step 3, we set n = n + 1. If n ≤ N , we continue to execute Step 3. Otherwise, we stop the procedures and put the new graph G r into G 3 .
3. Check the nodes in the topological ordering. O(u) ∩ U ̸ = ϕ means that at least one node in O(u) has been optimized by the XOR3 gate and cannot be merged again. If the out-degree of one node is n and the node is not the target node, we will remove u by adding n XOR3 gates and deleting n + 1 XOR2 gates based on Proposition 2. Then, we put the nodes in O(u) into U.
We explain why O(u) ∩ U = ϕ in Step 3 is necessary. Suppose that we have the circuit: in which out(v) = 1 and out(u) = 2. When n = 1, we can remove v and have the new circuit: t u,a,b , t 3 w,u,c,d , t y,p,u . Next, we have U = U ∪ {v}. Then, we set n = 2. However, we cannot remove u since O(u) = {w, y} and w represents an XOR3 gate. Thus, only when O(u) ∩ U = ϕ holds, can u be removed.
Because of the topological ordering property, in Step 3, the next node that needs to be checked can always be generated by the checked nodes. If we combine all the checked nodes to a new graph, a path must exist from unit nodes to every non-unit node, and no nodes can be removed with the current upper bound n. Every non-unit node u in the new graph may have three states: • out(u) is greater than the current upper bound n. We cannot remove u in the current state based on the Proposition 2.
• out(u) is m (m < n). Based on the Proposition 3, we have checked the u in the previous procedures. If u has been left, we must have O(u) ∩ U ̸ = ϕ or u ∩ U ̸ = ϕ.
• out(u) is n. We have checked u in Step 3 and cannot remove u.
In addition, according to Proposition 3, we first check the nodes with out-degree 1. When we have checked all the nodes, Step 3 is finished, and we set n = n + 1. If we match the condition when we check one node, we will remove the node, delete corresponding edges, and add corresponding edges to the graph. We use n XOR3 gates to replace n + 1 Algorithm 3 EGT2() Input: A single graph G s Output: A set G 3 containing all the reduced graphs with 2/3-input xor gates G 2 = ExtendGraph2(G s ) ▷ Containing the reduced graphs with XOR2 gates N ← 0 ▷ The upper bound while ((N + 1) + 1)λ 1 − (N + 1)λ 2 > 0 do N ← N + 1 end while for each G r ∈ G 2 do ▷ Removing nodes T = TopologicalOrdering(G r ) The set U containing all the target nodes in G r n ← 1 while n ≤ N do for each node u in T do if u / ∈ U, out(u) = n, and O(u) ∩ U = ϕ then We remove u, delete corresponding edges, and add n operations in G r Put the nodes in O(u) into U end if end for n ← n + 1 end while G 3 ← G 3 ∪ {G r } end for return G 3 XOR2 gates. The saved cost is S v is (n + 1)λ 1 − nλ 2 . We set the upper bound N = 0 if 2λ 1 < λ 2 because the cost of two XOR2 gates is less than one XOR3 gate. Not all the libraries can apply our algorithm. We only take the libraries with 2λ 1 > λ 2 into account.
The new circuit area is only 26.6 GE (one XOR2 gate and five XOR3 gates).
For a given circuit with XOR2 gates, we use Algorithm 3 to convert the initial graph into many graphs with 2/3-input xor gates and save them in G 3 . For each G r in G 3 , we use a new function ExtendGraph3() to generate the extended graph G e with 2/3-input xor gates. The new function is similar to ExtendGraph2() and use GenerateExtendedGraph(G r , 3) instead of GenerateExtendedGraph(G r , 2). Then, we split G e into many single graphs. For each new single graph, we optimize it with XOR4 gates.
Similar to the above section, we discuss which nodes can be removed. In our algorithm, t 4 u,a,b,c,d represents an XOR4 gate. We propose three circuit types, which can be transformed into 4-input xor gates.

Type 1.
If t u,p,q and t 3 p,a,b,c are contained in the circuit, we can obtain t 4 u,q,a,b,c by removing p. We say that p matches Type 1 (see Figure 4-left).

Type 2.
If t 3 u,p,q,w and t p,a,b, are contained in the circuit, we can obtain t 4 u,a,b,q,w by removing p. We say that p matches Type 2 (see Figure 4-middle).  We take Type 1 and Type 2 into account because the nodes in Type 3 can be transformed into Type 2. The following observations can help us to simplify the the analysis procedures.
Observation 1. If p matches Type 1, in(p) = 3. If p matches Type 2, in(p) = 2. Thus, if p matches one type, it never matches another type.
Observation 2. Suppose that we have a node p and its output set O(p). For any node q ∈ O(p), if p matches one type, other nodes in I(q) will never match another type.
The above types only consider the case where the out-degree of a node is 1. Next, we discuss the case in which the out-degree is greater than 1. Suppose that p matches Type 1 or Type 2 and O(p) = {u 1 , u 2 , . . . u n } (n ≥ 2). If p matches Type 1, we can use n XOR4 gates instead of an XOR3 gate and n XOR2 gates. Figure 4-left shows the case where out(p) is 2. If p matches Type 2, there exist two cases: • u i (1 ≤ i ≤ n) has the in-degree 2; • u i (1 ≤ i ≤ n) has the in-degree 3. For the first case, we use an XOR3 gate instead of an XOR2 gate additionally (see Figure 4middle). For the second case, we use an XOR4 gate instead of an XOR3 gate additionally (see Figure 4-right). To decide which nodes can be removed, we can obtain following proposition by extending Proposition 2. Proposition 4. Let N 1 be the maximum value such that λ 2 + N 1 (λ 1 − λ 3 ) > 0 holds and N 2 be the maximum value such that λ 1 + λ 2 − λ 3 − (N 2 − 1) · min((λ 2 − λ 1 ), (λ 3 − λ 2 )) > 0 holds. It can reduce the cost if • the node matches Type 1 and has the out-degree m (m ≤ N 1 ), or • the node matches Type 2 and has the out-degree n (n ≤ N 2 ). Proof. Suppose that p matches Type 1 and out(p) is m (m ≤ N 1 ). There are m related XOR2 gates and a related XOR3 gate. The XOR3 gate is used to generate p. Other gates are used to generate new nodes by p. Thus, we can remove p by adding m XOR4 gates and deleting m XOR2 gates and an XOR3 gate. If λ 2 + m(λ 1 − λ 3 ) > 0, the circuit area decreases.
Suppose that p matches Type 2 and out(p) is n (n ≤ N 2 ). If n = 1, We can use an XOR4 gate instead of an XOR2 gate and an XOR3 gate. Then, the circuit area decreases by λ 1 + λ 2 − λ 3 > 0. If p is also used to generate new node u, there are two different cases. If in(u) = 2 (u ∈ O(p)), we will use an XOR3 gate instead of an XOR2 gate. If in(u) = 3 (u ∈ O(p)), we will use an XOR4 gate instead of an XOR3 gate. Thus, we choose min((λ 2 − λ 1 ), (λ 3 − λ 2 )). The remained proof is similar to Proposition 2. We do not repeat it.
Another question is if the nodes with different out-degrees can be removed, how to determine the priority. Proposition 3 provides a solution. If the nodes with different outdegrees can be removed, we discuss how to determine the priority of each node. Suppose that we have two nodes u and v, and out(u) = m, out(v) = n (m < n). We discuss this on a case-by-case basis.
• If u and v are not the same types, we just deal with them in order. Based on Observation (2), we have O(u) ∩ O(v) = ϕ. The operations between u and v are independent.
• If two nodes u and v match the same type, we will have S u > S v because of m < n. The proof is similar to Proposition 3. Thus, we can search for two types separately. We provide an algorithm called EGT3 (see Algorithm 4). The procedures are as follows.
2. For each graph G r in G 3 , run the function GenerateExtendedGraph(G r , 3) to obtain the extended graph G e , split it into different single graphs, and put all the single graphs into G 4 .
3. Compute N 1 and N 2 . Let n 1 = 1 and n 2 = 1. We use U to save the nodes that can not be optimized. u ∈ U means that u is the target node or u represents the XOR4 gate.
4. If n 1 > N 1 and n 2 > N 2 , we stop the procedures. Otherwise, go to Step 5.
5. Check the nodes in the topological ordering. If u matches Type 1, out(u) = n 1 (n 1 ≤ N 1 ), O(u) ∩ U = ϕ, and u / ∈ U, we can remove u by adding n 1 XOR4 gates and deleting an XOR2 gate and n 1 XOR3 gates. Then, we let n 1 = n 1 + 1 and go to Step 6. 6. Check the nodes in the topological ordering. If u matches Type 2, out(u) = n 2 (n 2 ≤ N 2 ), O(u) ∩ U = ϕ, and u / ∈ U, we can remove u. We check O(u) and for each node u i ∈ O(u), -add an XOR3 gate and delete an XOR2 gate (u i is the first case in Type 2);

Applying Our Algorithms to Many Proposed Matrices
In this section, we apply EGT2 and EGT3 to several linear layers from the literature, including matrices used in many ciphers [DR20, CMR05, JNP15, Ava17, BBI + 15, BCG + 12, ADK + 14, Ava17, BJK + 16, AIK + 00]. The results in ASIC1 are shown in Table 2 and the results in ASIC2 are shown in Table 3. We list the results in [XZL + 20], [BDK + 21], and [BFI21]. Note that [XZL + 20] only searches for circuits with the g 1 -XOR metric. Then, we run the XZLBZ algorithm and optimize the circuits using the BFI, EGT2, and EGT3 algorithms.
It is worthy to say that the circuit with the minimum number of 2-input xor gates may have the worse performance with g i -XOR metric (i > 1). For example, in Table 3, we use an circuit for AES MixColumns with 96 XOR2 gates instead of 92 gates. The area of the 96-gate circuit can be reduced to 255 GE by EGT2, while the 92-gate circuit cannot.
As the procedures of splitting the matrix into elementary matrices and generating s-XOR sequences are randomized, recalling the computations at different times would lead to different results. Hence, it is difficult to determine how long we wait to achieve the best solution. We execute the XZLBZ algorithm in a limited and reasonable time (e.g., 24 hours) to collect many implementations and select the best one among them. This strategy is quite similar to many previous search approaches in, e.g., [KLSW17, TP20, XZL + 20, BFI21].
In our experiments, many previous results can be optimized. Notably, for the matrix used in AES MixColumns, we achieve the circuit with 255 GE in ASIC2, while the best previous result is 258.9 GE [BDK + 21] in the same library. Moreover, we can decrease the circuit area to 243 GE with the help of XOR4 gates. This shows the effectiveness of our strategy.  Hamming weight. "best of BFI" represents the proportion that BFI algorithm can obtain the best results and "best of EGT2" represents the proportion that our EGT2 can obtain the best results. "BP", "BP+BFI", "BP+EGT2", and "BP+EGT3" represent the circuit area of a matrix on average. Note that the BP algorithm uses g 1 -XOR metric, both BFI algorithm and EGT2 use the g 2 -XOR metric, and EGT3 uses the g 3 -XOR metric.
From the results, there exists significant room for improvement. On the one hand, the method using XOR3 gates can optimize all the circuits generated by the BP algorithm. On the other hand, our algorithms have better performance. In ASIC1, BFI can obtain about 35% matrices with the best results. In ASIC2, the proportion decreases to 1.3%. EGT2 and EGT3 can always obtain the best results.

Hardware Implementations
Our algorithms aim at searching for low-area circuits. It is insufficient for other hardware metrics (e.g., latency and power consumption) to simply conduct the estimation based on the every single gate. These criteria are closely related to the standard cell library [BFI21,BDK + 21]. In this respect, we synthesize the implementations with UMC 55 nm library (λ 1 = 2.5 GE, λ 2 = 4.5 GE). The logic synthesis is performed with Synopsys Design Compiler version R-2020.09-SP4 (using the compile ultra and compile ultrano autoungroup commands), and simulation is done in Mentor Graphics ModelSim SE v10.2c. All the results are shown in Table 4. We give some explanations. There are four different circuits to implement AES MixColumns, which are from different tools based on different metrics.
Type 1 is from [LXZZ21] and is the best result with g 1 -metric. It only needs 91 XOR2 gates with depth 7.
Type 2 is the circuit found by our algorithms based on Type 1. We always focus on the area and generate an optimized circuit with 61 XOR2 gates and 15 XOR3 gates.
Type 3 is from [LWF + 22] and is one of the best results with respect to the minimum depth (another best result is from [BFI21]). It needs 103 XOR2 gates with depth 3. We provide it to show the comparison in latency.
Type 4 is from the synthesizer starting from Type 1. The synthesizer takes a tradeoff between multiple metrics. We provide it to show the comparison with the previous synthesizers.
For the AES MixColumns, our implementation achieves the best area and the reduction of energy consumption (1.49 uW) at the cost of slight and reasonable growth of latency (0.13 ns). The goal of our algorithms is to achieve implementations with the smallest area, and our results show that considering gates with more input bits offers more possibilities.

Conclusion
In this paper, we propose the transform algorithm and the graph extending algorithm, and then combine the two algorithms to instantiate EGT2 and EGT3 with g 2 -XOR and g 3 -XOR metrics, respectively. The two instantiated algorithms can be used to further optimize the circuit area of matrices in hardware. Our methods contribute to • better circuits of AES MixColumns with 243 GE (with g 3 -XOR in ASIC2), which is the best result than ever before; • better circuits of several linear layers from the literature; • better circuits for 100% of matrices proposed in [LSL + 19]. Though our algorithms can reduce the circuit area easily with local optimization, it is still important to perform the procedures with the global optimization based on heuristics like [BDK + 21]. In addition, our research provides a new tool for the construction of lightweight matrices. There should exist some matrices more compatible with our algorithms and thus have better circuits with g ϵ -XOR metric, which we leave as future work.