Low-Latency Boolean Functions and Bijective S-boxes

. In this paper, we study the gate depth complexity of (vectorial) Boolean functions in the basis of { NAND , NOR , INV } as a new metric, called latency complexity , to mathematically measure the latency of Boolean functions. We present efficient algorithms to find all Boolean functions with low-latency complexity, or to determine the latency complexity of the (vectorial) Boolean functions, and to find all the circuits with the minimum latency complexity for a given Boolean function. Then, we present another algorithm to build bijective S-boxes with low-latency complexity which with respect to the computation cost, this algorithm overcomes the previous methods of building S-boxes. As a result, for latency complexity 3, we present n -bit S-boxes of 3 ≤ n ≤ 8 with linearity 2 n − 1 and uniformity 2 n − 2 (except for 5-bit S-boxes for whose the minimum achievable uniformity is 6). Besides, for latency complexity 4, we present several n -bit S-boxes of 5 ≤ n < 8 with linearity 2 n − 2 and uniformity 2 n − 4 .


Introduction
Studying properties of all n-bit Boolean functions is not an easy task when n > 5. One of the hardware properties of (vectorial) Boolean functions to study is their latency. Finding the minimum latency for hardware implementation of a Boolean function with a synthesizer is not possible if the number of inputs is high and if we only use the truth table of the function. One approach to achieve a lower latency is providing a gate-level netlist for the synthesizer to implement the Boolean function. However, finding a gate-level netlist that provides the lowest latency for implementing a Boolean function is usually not an easy task when the number of input bits to the functions is high (even for 5-bit S-boxes). Typically, gate depth complexity defined as the minimum length of the longest path from an input bit to an output bit within all possible implementations of a (vectorial) Boolean function, is considered to mathematically model the lowest latency for implementing that function. But, as a result of [BMP08], determining the gate depth complexity of a Boolean function is an NP-hard problem and consequently finding low-latency implementation of (larger) Boolean functions stays challenging.
Furthermore, the latency metric of a circuit is practically quite complicated, since different gates have different delays and it is dependent on several parameters that vary under different operating conditions (such as driving power, voltage, and temperature) and more importantly under different technologies that the circuit will be implemented. Even if the parameters are optimized for achieving the low-latency implementation, the latency of different logic gates covers a wide range. Therefore, it is not realistic to consider the gate depth complexity alone as a metric for the minimum latency of a Boolean function.

Our Contributions
In this paper, in Section 3, we use a new metric to have a model that is closer to reality than in the case of gate depth complexity. We call it latency complexity which is the gate depth complexity in the basis of {NAND, NOR, INV} (see Definition 8). We present a unique structure (circuit) to model the implementation of any Boolean function with the latency complexity of d and we show that the latency complexity is invariant under the extended bit permutation equivalence.
In Section 4, we study the latency complexity of (vectorial) Boolean functions. We first show that the latency complexity of (vectorial) Boolean functions stays invariant over the extended bit permutation equivalence. Then we discuss the computation cost of the search for determining the latency complexity of a given Boolean function and present several techniques to make it faster. Applying these techniques, we compute the latency complexity of all n-bit Boolean functions for up to n = 5. Besides, in Subsection 4.3 we present an algorithm to find all Boolean functions with low-latency complexity (d ≤ 4) for n ∈ {6, 7, 8}. Furthermore, we explain several speed-up techniques on the search for computing the latency complexity of a given Boolean function. We present another algorithm to find all possible structures to implement a Boolean function with a gate depth of the same as its latency complexity. We use this algorithm to determine the latency complexity of the previously known S-boxes in symmetric cryptography.
By applying the low-latency Boolean functions, we look for the existence of bijective n-bit S-boxes with a given latency complexity and study their cryptographic properties (precisely, their linearity, uniformity, and algebraic normal form degree) in Section 5. While the previous algorithm for building S-boxes [Can07,MB19] is useful for classifying n-to m-bit Boolean functions under linear or affine equivalences, for our need (which uses the extended bit permutation equivalence) it is not efficient. Thereby, we introduce a new algorithm for building S-boxes that with respect to the computation cost, our method overcomes the previous method. As a result, for latency complexity 3, we present n-bit S-boxes of 3 ≤ n ≤ 8 with linearity 2 n−1 and uniformity 2 n−2 (except for 5-bit S-boxes that the minimum achievable uniformity is 6). Besides, we find several n-bit S-boxes of 5 ≤ n < 8 with latency complexity 4, linearity 2 n−2 , and uniformity 2 n−4 .
In Section 6, we describe an approach to optimize the suggested structures produced by our algorithm to find a circuit with the lowest latency in a real ASIC hardware implementation. We apply our approach to find efficient implementations for previously known and newly introduced S-boxes that are minimized with respect to the latency and then its area.
All the results for latency complexity of Boolean functions and S-boxes are published publicly at https://gitlab.science.ru.nl/shahramr/LowLatencySBoxes.git.

Related Works
Designing cryptographic primitives with minimum a low latency in hardware is still a young and emergent research topic. First work in this area was [KNR12] by Knezevic, Nikov and Rombouts that compared the latency properties of multiple (lightweight) block ciphers. Immediately after, the first dedicated low-latency block cipher called PRINCE introduced by Borghoff et al. [BCG + 12]. Designing low-latency primitives continued with block cipher QARMA by Avanzi [Ava17], Gimli: a high performance cross-platform cryptographic permutation by Bernstein et al. [BKL + 17], PRINCEv2: an updated version of PRINCE, by Bozilov et al. [BEK + 20], Orthros: a pseudorandom function (PRF) by Banik et al. [BIL + 21], and SPEEDY: a family of block ciphers by Leander et al. [LMMR21].
There are also some works focusing on the latency of some particular cryptographic building blocks only. For instance, in [BFP19], Boyar, Find and Peralta present some techniques for finding small low-depth circuits for cryptographic functions [BFP19]. In [LSL + 19], Li et al. show how to construct involutory low-latency Maximal Distance Separable (MDS) matrices.
A recent work in the direction of determining circuit complexity of functions is [Sto16] that presents a SAT-solver-based technique to optimize the implementations of some S-boxes with respect to different criteria such as the gate depth complexity. While this technique gives one solution for implementing small S-boxes, it is unable to find the complexity in some cases, especially when the input size of the S-box is larger than 5.
In [BCBP03], Biryukov et al. presented an efficient algorithm to check if two functions are equivalent and another algorithm to find the representative in the linear equivalence or the affine equivalence class. Later in [MB19], De Meyer and Bilgin improved the algorithm for mappings of n-to m-bit with m < n. Since we will deal with the extended bit permutation equivalence and not the linear or affine equivalence, we modify these algorithms according to the properties of extended bit permutation equivalence. While these algorithms perform efficiently for linear or affine equivalence, we experienced that they are not suitable for (extended) bit permutation equivalence. Details of the modification and our solution for these problems are explained later.
In [BMD + 20], Bilgin et al. present techniques to construct S-boxes with a low-latency masked variants for applying in the side-channel countermeasures that basically requires a low multiplicative depth and gate complexities. However, this is not directly related to the development of low-latency symmetric primitives in general, as the requirements are quite different and sometimes even direct opposites. While in regular cryptographic S-boxes, non-linear gates are beneficial for area and latency over the linear gates, in masked S-boxes on the other hand, linear operations are optimal and non-linear gates are the primary cost factor [BMD + 20].

Basics and Notations
In this section, we introduce the necessary basics related to Boolean functions and their implementations based on the logic circuits. We assume that the reader has some, but not necessarily extensive, familiarity with these concepts.
We use Z n to denote the finite set {0, 1, . . . , n − 1}, that is the set of non-negative integers smaller than n. By F 2 , we denote the finite field of two elements, {0, 1}, and call it the binary field where the addition of this field is denoted by ⊕ and called XOR. By F n 2 with n being a positive integer, we denote the binary vector space of dimension n and call it the space of n-bit vectors.
Let a ∈ F n 2 , then by a [i] with i ∈ Z n , we denote the i-th element of a, i.e., a = (a[0], . . . , a[n − 1]). Note that in this paper, we always count starting from 0. Let a, b ∈ F n 2 be two n-bit binary vectors. We use ⟨a, b⟩ to denote the inner product between a and b which is defined as ⟨a, b⟩ = n−1 i=0 a[i]b [i]. Also, by hw(a), we denote n−1 i=0 a[i] that is called the Hamming weight of a. To denote concatenation of two vectors a ∈ F n 2 and b ∈ F m 2 , we use (a∥b) that is (a[0], . . . , a[n − 1], b[0], . . . , b[m − 1]). To make it easier and space-efficient to display a binary vector, for a ∈ F n 2 , instead of displaying its all binary elements, we show it by its corresponding integer value in Z 2 n . Precisely, we use the simple mapping of elements in F n 2 to the elements in Z 2 n that maps any a ∈ F n 2 to n−1 i=0 a[i] · 2 n−i−1 .

Boolean Functions
The functions from the vector space F n 2 to the binary field F 2 are called Boolean functions with n-variables or simply n-bit Boolean functions. The number of n-bit Boolean functions is 2 2 n , and this number is too large to study the properties of each n-bit Boolean function when n > 5. For this reason, determining and studying those Boolean functions satisfying the target conditions is not feasible through an exhaustive computer search. 1 Therefore, it is necessary to find solutions that make it easier to study properties of Boolean functions or find Boolean functions satisfying the target properties. In the following, we briefly explain the necessary terms and notations of Boolean functions used in this paper.
The truth table is the most basic way to represent a Boolean function. Let f be an n-bit Boolean function, then the truth table of f is a binary vector T f ∈ F 2 n 2 such that for any x ∈ F n 2 , T f [x] shows the value of f (x). Among the other classical representations of Boolean functions, the one most often used in cryptography is the algebraic normal form (ANF) which is the n-variable polynomial representation over F 2 of the form n−1 , i.e., the corresponding monomial for x i variables with I[i] = 1. Note that each a I is a binary value and every coordinate x i appears in this polynomial with exponents at most 1. It is well-known that the ANF representation is unique 2 and can be computed for the given truth table with a complexity of n · 2 n operations.
The Hamming weight of a Boolean function's truth table is called its weight which is the number of x ∈ F n 2 with f (x) = 1. Balanced Boolean functions are the ones whose weight is equal to 2 n−1 , i.e., for half of x ∈ F n 2 , it maps to 1, and for the other half, it maps to 0.
Definition 1 (Algebraic Degree). For an n-bit Boolean function f , the algebraic degree is the maximum Hamming weight of all occurring monomials in the ANF representation of a function which we denote by deg(f ), i.e., Linear Boolean functions are those Boolean functions for which, for any a, b ∈ F n 2 , we have f (a ⊕ b) = f (a) ⊕ f (b). Each n-bit linear Boolean function can be represented as ℓ α (x) = ⟨α, x⟩ with corresponding α ∈ F n 2 . Note that each of these functions with α ̸ = 0 are balanced.
Separating Boolean functions by their algebraic degree, the ones with degree one, two, or three are called affine, quadratic, and cubic functions, respectively. Affine functions, which are the extension of linear functions by a constant XOR in the output, can be displayed as ⟨α, x⟩ ⊕ c with corresponding α ∈ F n 2 and c ∈ F 2 .

Vectorial Boolean Function
While Boolean functions map n-bit vectors to a one-bit value, vectorial Boolean functions map n-bit vectors to m-bit vectors. To specify the input and output bit size of these functions, we call them n-to m-bit vectorial Boolean functions, and when the input and output bit sizes are the same, we simply call them n-bit vectorial Boolean functions. Clearly, the vectorial Boolean functions include the Boolean functions which correspond to m = 1. In cryptography, vectorial Boolean functions are usually called S-boxes which provide confusion in the cipher. The S-boxes play a primary role in the key-alternating block ciphers, especially in the Substitution-Permutation-Network (SPN) ones.
1 Consider determining if a 6-bit Boolean function has the target properties needs one millisecond (10 −6 seconds). Then, it needs about 2 44 seconds (about half a million years) to visit all the 6-bit Boolean functions.
2 Precisely, the ANF polynomial in F 2 [x 0 , . . . , Let F be an n-to m-bit Boolean function, then the Boolean functions f 0 , . . . , f m−1 defined by F (x) = f 0 (x), . . . , f m−1 (x) for every x ∈ F n 2 are called coordinate functions of F . Also, for every non-zero α ∈ F m 2 , the Boolean function x → ⟨α, F (x)⟩ is called a component function of F , and we denote this function by ⟨α, F (x)⟩. In this paper, to denote the truth table of F easily, we use an array of 2 n elements in Z 2 m , i.e., F (0), . . . , F (2 n −1) .
As for Boolean functions, the property of balancedness plays a crucial role in vectorial Boolean functions. An n-to m-bit Boolean function F is called balanced if it takes every value of F m 2 the same number of times, i.e., 2 n−m times. The balanced n-bit vectorial Boolean functions are the permutations on F n 2 . It is well-known in the literature that an n-to m-bit Boolean function F is balanced if and only if its all component functions are balanced (for a proof see, e.g., [Car21]).
The algebraic degree of F is the maximum of the algebraic degrees of all coordinate functions. Hence, we use the same definition for linear, affine, quadratic, and cubic functions of the vectorial Boolean functions.
Definition 2 (Linearity and Differential Uniformity). For a vectorial Boolean function F : F n 2 → F m 2 , the linearity and differential uniformity are defined as lin(F ) = max i.e., on all the x i variables.

Equivalences
To study properties of vectorial Boolean functions, it is sometimes easier to partition them by a defined equivalence relation for which the studied properties are invariant. Affine equivalence and extended affine equivalence are the most applied ones in studying vectorial Boolean functions.
Definition 4 (Linear, Affine, and Bit Permutation Equivalences). Two n-to m-bit Boolean functions F and G are called linear equivalent if there exist an n-to n-bit linear bijection L in and an m-to m-bit linear bijection L out in such a way that F = L out • G • L in .
In the same way, affine equivalence and bit permutation equivalence are defined. F and G are called affine equivalent if there exist an n-to n-bit affine bijection A in and an m-to m-bit affine bijection A out in such a way that F = A out • G • A in ; and they are called bit permutation equivalent, if there exist a bit permutation of n bits P in and a bit permutation of m bits P out in such a way that F = P out • G • P in . Note that the n-bit vectorial Boolean function P is called bit permutation of n bits, if it maps ( It is clear that if F and G are bit permutation equivalent, then they are also linear and affine equivalent. Besides, if they are linear equivalent, then they are also affine equivalent.
Similar to extending the linear equivalence to the affine equivalence, it is possible to extend the bit permutation equivalence to what is called extended bit permutation equivalence [LP07]. This equivalence is the most important one for our work in this paper.
Definition 5 (Extended Bit Permutation Equivalence). Two n-to m-bit Boolean functions F and G are called extended bit permutation equivalent if there exist a bit permutation of n bits P in , a bit permutation of m bits P out , α ∈ F n 2 , and β ∈ F m 2 in such a way that Note that the affine equivalence covers all the other above-mentioned equivalences. The algebraic degree, linearity and uniformity are example properties of (vectorial) Boolean functions that are invariant over any of these equivalences.
It is common to consider the lexicographically smallest function in an equivalence class as the representative one. In this paper, we also use the same definition for representatives.
In [BCBP03], Biryukov, De Cannière, Braeken and Preneel presented an efficient algorithm to check if two functions are equivalent together with another algorithm to find the representative in the linear or in the affine equivalence class. Later in [MB19], De Meyer and Bilgin improved the algorithm for mappings of n-to m-bit with m < n. Since in this paper, we deal with the extended bit permutation equivalence and not the linear or affine equivalence, we modify these algorithms according to the properties of extended bit permutation equivalence. While these algorithms perform efficiently for the case of linear and affine equivalences, we experienced that they are not suitable for (extended) bit permutation equivalence. We discuss the details about this problem later in Subsection 5.1.

Implementation of Boolean Functions
To mathematically model the costs for implementing a Boolean circuit for some specific applications, some terms are defined that are known as the complexity of the implementation. In the following, we bring the general definition of these complexities. In all of these definitions, we consider that G is the set of all allowed gates to use, which is usually called the basis of implementation, e.g., all the gates with fan-in number of at most two. 3 The basis must provide completeness property with the meaning that based on the given type of gates in this basis, it is possible to build any (vectorial) Boolean function. The most common basis is G = {XOR, AND, INV}; but, since XOR itself can be realized based on three AND and two INV gates, G ′ = {AND, INV} is also a complete basis.
Definition 6 (Gate Count Complexity). It is the smallest number of gates required to compute the function while type of each used gate must be included in G.
Even though different types of gates have different implementation (area) costs, this definition is typically considered the first simplified estimation for the minimum area cost for hardware implementation of a function. To achieve a reasonable estimation of the area cost, it is common to consider only the gates with a fan-in number of one or two.
Definition 7 (Gate Depth Complexity). It is the minimum value for the longest path (concerning the number of gates used in the path) from any input to any output for implementing the function. Note that type of each used gate must be included in G.
It is clear from its definition that the gate depth complexity of a vectorial Boolean function is the maximum of gate depth complexity for each of its coordinate Boolean functions.
Note that if any gate with any fan-in number is allowed, then every function can be implemented by a circuit with gate depth at most 3, e.g., by using conjunctive or disjunctive normal form expression of the function in which there are one AND, one OR, and probably one INV gates in each path from any input to any output. However, this can lead to a G that is usually not available in practice. Again, it is typical to use only gates with a fan-in of at most 2.
Similar to the gate count complexity, in the case of the gate depth complexity, even though different types of gates have different implementation (delay) costs, this definition is usually considered the first estimation for the minimum delay cost for hardware implementation of a function. But, as we discuss later in detail, this oversimplified metric does not provide an appropriate estimation for the delay cost. It is mainly because of the wide range of delays of different types of gates (even with the same fan-in and fan-out numbers). Therefore, it needs a modification to be used as a close estimation for the delay cost of implementing a function.
Generally, any costs for hardware implementation of (vectorial) Boolean functions are invariant over bit permutation equivalence. This is because the bit permutations are realized by wiring, which means it costs a negligible value at most. Therefore, any two (vectorial) Boolean functions that are different only by a permutation of the input bits and a permutation of the output bits have the same implementation cost and the same hardware complexity in practice.
Moreover, since the cost of an inverter gate (shown by INV and in some literature is called the NOT operation) is comparably smaller than any other gates, it is reasonable to consider that the cost of implementing two (vectorial) Boolean functions which are different with a constant addition in the input or the output, is not very much different. Another reason for this consideration is that each (vectorial) Boolean function is implemented as a combinatorial circuit. Its input and output wires are usually connected to other combinatorial circuits. Hence, the INV gates in the input or output bits can be combined with the gates in the previous or the following combinatorial circuits. Note that if there is a layer of registers or a layer of buffers, such as in the round based or unrolled implementation architectures, respectively, the INV gates in the input and the output of a combinatorial circuit can be combined with the registers or the buffers (by changing the BUF gate to an INV gate). Therefore, by accepting a small tolerance, it is usually considered that the hardware implementation costs are invariant over the extended bit permutation equivalence.
As a result of [BMP08], we know that determining any of these complexities for a Boolean function is considered to be an NP-hard problem. The SAT-solver-based tool by Stoffelen [Sto16] finds a single solution for implementing small S-boxes. But, notice that it does not provide all the solutions for implementing the S-box, or it cannot find the complexity in the case when S-box size is larger than 5.
To use functionality of the gates in the equations, we use the signs ∧, ∨, ∧, and ∨ to denote the operation of the AND, OR, NAND, and NOR gates, respectively. Besides, we use ¬f to denote the inverted value for the output of a Boolean function f , and we use x to indicate the inverted value of the input x. Moreover, for simplicity, from now on, we do not mention fan-in number of the gates, unless it is more than 2.

Latency Complexity of Boolean Functions
Due to the wide range of delays of logic gates provided by the applied ASIC library for implementation, it is not realistic to consider the gate depth complexity as a metric for the minimum latency of a Boolean function. For instance, in almost all the libraries, a 2-bit XOR or XNOR gate has a latency of about twice the latency for other gates with a fan-in number of 2. Another example is the difference in the latency of the gates with different fan-in numbers; the latency of the gates with a higher number of fan-ins is larger than the latency for similar gates but with fewer fan-ins. On the other hand, except for the INV gate, whose fan-in number is one, the 2-bit NAND gate and the 2-bit NOR gate have the minimum latency in almost all the ASIC libraries. To make a view of the latency for different gates, we list latency and area of all logic gates (with fan-out number one) in NanGate 15 nm and 45 nm Open Cell Libraries with typical operating conditions in Table 1.
To have a more accurate metric for the latency, we use the gate depth complexity by restricting ourselves to only INV, 2-bit NAND and 2-bit NOR gates. But since the INV gate has comparably lower latency than the other two gates, and more importantly, because of the reason explained later in Proposition 1, we do not consider the INV gates in the gate count of the implementation. We define the latency complexity of a (vectorial) Boolean function as in the following definition.
Definition 8 (Latency Complexity). It is the minimum value for the longest path (concerning the number of only NAND and NOR gates) from any input to any output for implementing the function while the set of allowed gates to use is G = {INV, NAND, NOR}.
Similar to the case for gate depth complexity, the latency complexity of a vectorial Boolean function is the maximum of latency complexity for each of its coordinate functions.
It is noteworthy to mention that using the basis G = {INV, NAND, NOR} is equivalent to using G ′ = {INV, NAND} or G ′′ = {INV, NAND, NOR, AND, OR}. Since both AND and OR are slower than both NAND and NOR, we exclude them from the basis. However, including NOR makes it possible to have a more simple structure for low-latency implementation of a Boolean function that is explained in detail in Proposition 1.
In the following example, we explain how to count the gate depth of implementation when we are computing its latency complexity.
Example 1 (Latency Complexity of MUX2). The circuit shown in Figure 1a is an instance for implementing the function f (x 0 , x 1 , x 2 ) = (x 0 ∧ x 1 ) ∨ (x 0 ∧ x 2 ) using the gates in G = {INV, NAND, NOR}. Note that f is a balanced function with the ANF representation of x 0 x 1 ⊕ x 0 x 2 ⊕ x 2 . Besides, it represents the functionality of a multiplexer (MUX2) that x 0 acts as the selector to choose either x 1 or x 2 .
The circuit implies that y = ¬ ¬(x 0 ∧x 1 ) ∨ ¬(x 0 ∧x 2 ) and the length of the paths from each input to the output (by counting only NAND and NOR gates) is two. Using the equation ¬(y 0 ∨ y 1 ) = y 0 ∧ y 1 , one can simplify the implementation in Figure 1a to y = (x 0 ∧x 1 ) ∧ (x 0 ∧x 2 ) as in the implementation shown in Figure 1b.   x 0 Figure 1: Low-latency implementation of a MUX2 and an XOR.
Even though the latency complexity of implementing f is two, to determine it, we need to consider the length of the longest path in all possible implementations for f .
It is an interesting observation that the latency of a MUX2 gate in the typical conditions is usually about 3 times the one for NAND gate (see Table 1). Hence, if the target is to lower the latency, one can use the implementation shown in Figure 1b instead of the MUX2 gate, but then it needs more area compared to the case of using original MUX2 gate. Precisely, depending on the ASIC technology, it needs 15% or 57% more area to reduce the latency by about 30%, which sounds to be a good trade-off.
Example 2 (Latency Complexity of XOR). The XOR function of two variables can be represented as f (x 0 , x 1 ) = (x 0 ∧ x 1 ) ∨ (x 0 ∧ x 1 ). Using the same approach as in Example 1, this representation changes to f (x 0 , x 1 ) = (x 0 ∧x 1 ) ∧ (x 0 ∧x 1 ) with gate depth of two. The implementation of this function is shown in Figure 1c which is similar to the one in Figure 1b with an extra INV gate.
The latency complexity of XOR is also two. Considering the fact that the latency of XOR is always more than twice the latency for NAND, it is reasonable to exclude the XOR gate from the gate basis of the latency complexity.
Proposition 1. Any n-bit Boolean function f (x 0 , . . . , x n−1 ) with latency complexity d can be implemented by a circuit of the structure shown in Figure 2. In this structure, each of g i,j gates with 0 < i ≤ d and j ∈ Z 2 d−i are either a NAND or a NOR gate, and the inputs to the gates in the first level is either a i or its inverted value, a i , while each a i with i ∈ Z 2 d is chosen from the set {x 0 , . . . , x n−1 }.

Proof.
To prove the proposition, we need to show that any function with latency complexity d can be implemented by 1) using INV gates only in the beginning, i.e., in the depth level 0, and 2) using the output of each gate in the input of only one gate in the next depth level. To show the first part, we use the equations ¬(y 0 ∧ y 1 ) = y 0 ∨ y 1 , ¬(y 0 ∨ y 1 ) = y 0 ∧ y 1 .
These equations imply that if there is an INV gate in the output of a NAND or NOR gate, it can be removed from the output and replaced in both inputs by changing the gate type to NOR or NAND gate, respectively. Therefore, in an implementation of a function, if there is an INV gate in the output of a NAND or a NOR gate at depth level i, we can replace it with two INV gates in the depth level of i − 1 by changing the NAND/NOR gate's type. Thus, starting from depth level d, we can bring all the INV gates to the previous depth level and update the implementation. By updating the implementation, we mean changing the corresponding NAND or NOR gate type and removing (if there exist) two repeated INV gates in the depth level i − 1.
To show the second part, assume that the output of a NAND or a NOR gate is used in the input of other gates more than once, i.e., its fan-out number is greater than 1. This can be eliminated by repeating the implementation of the corresponding sub-circuit of this output such that the output of each repetition is used only once. Besides, assume that the output of a gate in the depth level i is used in the depth level of i ′ with i ′ > i + 1. This also can be eliminated by using similar equations to ¬y ∧ ¬y = y, i.e., repeating the g1,0 g1,1 inverted circuit for y. Therefore, we ensure that the output of a gate in the depth level i is used only once in an input of a gate in the depth level of i + 1.
Note that to prove the second part in the above proof, we may repeat implementing a sub-circuit several times, which is not efficient area-wise. We emphasize that the structure of Figure 2 is only for a straightforward representation to implement all the Boolean functions with latency complexity d and not for an optimized low-latency implementation. As it comes later, it helps us to study the latency complexity of Boolean functions. Besides, we describe a method to find the implementation with the lowest latency of a given Boolean function in Subsection 4.4 which uses the implementation following the structure in Figure 2.
In the structure proposed in the previous works, e.g., in [Sto16,BMD + 20], it is considered that the output of gates can be used in the input of several gates and in any of the following depth levels. Unlike our representation, theirs needs more variables and long-expressed equations in the SAT model, which makes solving the SAT-model harder.
As shown in Figure 2, for a low-latency implementation of a function, we only need to use the INV gates once and at the beginning (we call it depth level 0). Since in the unrolled implementation, it is necessary to use a layer of buffers for the input variables of each combinatorial circuit to amplify the voltage of the wires, if the input variable needs to go through an INV gate, there is no need to use such a buffer. This means, in the depth level 0, each input variable a i goes through a BUF gate (with the output of a i ) or through an INV gate (with the output ofā i ). Besides, note that latency of an INV gate is lower than that of a BUF gate. This is the main reason why we do not count the INV gates in gate count of the longest path.
We emphasize neglecting latency of the depth level 0 is an oversimplification for measuring the latency of a single S-box implementation. In our structure and the ones for previous related works, we consider that for each repeated input variable, one independent wire enters the combinatorial circuit. But, in reality, there are at most two wires for each input variable entering the combinatorial circuit: one goes through a BUF and one through an INV gate, both with a higher fan-out number. This indeed affects and increases the latency of the implementation and should not be neglected. However, without this simplification, it is impossible to present a solution to model this complicated implementation parameter.
We map a NAND gate to 0 and a NOR gate to 1. By α ∈ F 2 d 2 , we denote whether a i with i ∈ Z 2 d goes through an INV gate. More precisely, if α[i] = 1, then a i is used as the input of the gate in depth level 1, otherwise a i itself. We use π ∈ Z 2 d n to denote the choice of x j variables by a i variables, i.e., a i = x π [i] , and by P : F n 2 → F 2 d 2 , we denote the corresponding mapping applied by π, i.e., Besides, we use I i,j (with 0 < i ≤ d and j ∈ Z 2 d−i ) to denote the corresponding sub-circuit from the inputs x 0 , . . . , x n−1 to the output of gate g i,j . Therefore, each I i+1,j (with 0 < i < d and j ∈ Z 2 d−i ) can be represented by the tuple ( To find the latency complexity of an n-bit Boolean function, by knowing that it is not smaller than d, one can try all the possibilities for the structure of Figure 2. To do so, we need to go through all 2 2 d −1 choices for G, all 2 2 d choices for α, and all n 2 d choices for π, which ends up with a total computational complexity of about 2 2 d ·(2+log 2 n) . It is clear that if d > 4, it is impossible to do this computation in practice. Even for d = 4, if n > 4, this computation is not practical.
We applied the SAT-solver-based tool presented in [Sto16] for the gate depth optimization. We modified it to find the latency complexity of a Boolean function by replacing the previous model (with more gate types and free wiring style) with the one in the structure of Figure 2. Even though this modification makes the tool faster, it only provides a single solution for low-latency implementation of small Boolean functions, and it cannot find any solutions for full-dependent n-bit Boolean functions with n > 5.

Boolean Functions with Low-Latency Complexity
In this section, we first show that the latency complexity of (vectorial) Boolean functions stays invariant over the extended bit permutation equivalence. Next, we explain several speed-up techniques on the search for computing the latency complexity of a given Boolean function and also on finding all Boolean functions with a given latency complexity. Then, we present a general algorithm to find all Boolean functions with a given latency complexity. As a result, we determine the latency complexity of all Boolean functions up to 5 bits, together Boolean functions of up to 8 bits with latency complexity at most 4. Afterwards, we present another algorithm but faster one to find all possible structures to implement a Boolean function with gate depth of same as its latency complexity.

Extended Bit Permutation Equivalence
It is clear that a bit permutation or a constant addition in the input or the output of a Boolean function does not change the latency complexity. These properties are explained in detail in the following proposition.
Proposition 2. Let f 1 and f 2 be two n-bit Boolean functions which are equivalent under extended bit permutation equivalence, i.e., f 2 (x) = f 1 P ′ (x) ⊕ α ′ ⊕ c for any x ∈ F n 2 with P ′ being a bit permutation function with corresponding permutation of π ′ , and α ′ ∈ F n 2 , c ∈ F 2 being constant values. Then the latency complexity of f 1 and f 2 are the same.
Moreover, if (G, α, π) is one instantiation of implementing f 1 in the structure of Figure 2, with P being the mapping applied by π, the corresponding implementation for f 2 can be realized by (G, α ⊕ P (α ′ ), π • π ′ ) if c = 0; and if c = 1 then it can be implemented by (G, α ⊕ P (α ′ ), π • π ′ ) while G and α denote the complement value of G and α, respectively.
Proof. A bit permutation can be realized in hardware by replacing the wires, and a constant addition can be realized by adding INV gates. Therefore, we can modify the implementation for f 1 and use it for f 2 without changing the length of the longest path (by only counting the NAND and NOR gates). Since this is true for any implementation of f 1 , the latency complexity of f 2 is the same as that of f 1 .
To prove the second part, we first consider c = 0. The input of the structure in the case of f 1 for a given x ∈ F n 2 is P (x), then the corresponding input of the structure for the case of f 2 must be P (P ′ (x)⊕α ′ ). Since P is a linear mapping, then P (P ′ (x)⊕α ′ ) = P •P ′ (x)⊕P (α ′ ). The function P • P ′ can be realized by applying π • π ′ mapping. The next part, P (α ′ ), also can be combined with the INV gates in the depth level of 0. To do this combination, instead of using INV gates corresponding to α, we use INV gates corresponding to α ⊕ P ′ (α ′ ).
In the case of c = 1, we implement ¬f 2 = f 1 P ′ (x) ⊕ α ′ as it is explained for the case of c = 0 and then insert an extra INV gate in the output of ¬f 2 to realize implementation of f 2 . It is possible to replace the INV gate in the depth level of d with two INV gates in the depth level of d − 1 by changing the gate type, i.e., g d,0 . Repeating this for d times, we end up with 2 d extra INV gates in the depth level 0 and changing all the gate types. This means we changed G to G and α ⊕ P ′ (α ′ ) to α ⊕ P ′ (α ′ ) = α ⊕ P ′ (α ′ ).
Proposition 2 suggests that instead of studying all n-bit Boolean functions, it is enough to evaluate the latency complexity of the representative Boolean functions for each equivalence class. Table 2 shows the number of n-bit Boolean functions up to extended bit permutation equivalence for n ≤ 5. There, N 1 denotes the number of all Boolean functions up to the equivalence, N 2 denotes the number of full-dependent Boolean functions up to the equivalence, N 3 denotes the number of balanced Boolean functions up to the equivalence, and N 4 denotes the number of full-dependent and balanced Boolean functions up to the equivalence. By a full-dependent function, we mean the functions in which the output is dependent on all input variables.
To find all the representative functions, we used a technique that is based on the following lemma.
with f 0 and f 1 both being (n − 1)-bit Boolean functions; i.e., for their corresponding truth tables, we have T f = (T f0 || T f1 ). If f is an n-bit representative Boolean function in an equivalence, then f 0 must be an (n − 1)-bit representative Boolean function in the same equivalence. Besides, f 0 must be lexicographically smaller than the representative for f 1 .
Applying this lemma, for finding all the n-bit representative Boolean functions, we need to use two (n − 1)-bit representative Boolean functions and extend the lexicographically larger one by equivalence. Then, we need to check if the resulting n-bit function is representative. In this case, with the extended bit permutation equivalence, for finding all the n-bit representative Boolean functions, we need to consider about |N 1,n−1 | 2 · (n − 1)! · 2 n possibilities to check if the resulting n-bit Boolean function is representative, where N 1,n−1 denotes the number of all (n − 1)-bit representative Boolean functions. For example, to find all 5-and 6-bit representative Boolean functions, we need to check the representative-ness of about 2 25 and 2 51 functions, respectively.

Possible Speed-Up Techniques
To find all the n-bit Boolean functions with latency complexity d, or to find all possible implementations of a given Boolean function with latency complexity of d, one can compute all possible functions in the structure of Proposition 1. As mentioned before, the computational complexity of this search is about 2 2 d ·(2+log 2 n) . In the following, we explain several techniques to reduce the computational complexity of the search when we want to find 1) all the n-bit Boolean functions with latency complexity d, 2) all possible implementations with the same depth as the latency complexity for a Boolean function. Thereby, we try to remove all the redundant computations through all the possibilities.

Reduction on the Possibilities for
) tuples both make the same implementation for I d,0 , it is enough to check one of these tuples. Furthermore, a similar reduction is valid for implementing the other smaller sub-circuits; i.e., both (I i,2j This redundancy can be eliminated by limiting the possibilities for G.
Note that the multiplication by 2 is because of the number of choices for g i+1,j . Hence, the number of reduced possibilities for G after this reduction is equal to 2, 6, 42, 2 10.8 , 2 21.6 , and 2 43.3 for d to be equal to 1, 2, . . ., and 6, respectively.
Moreover, we know that if (G, α, π) is the corresponding implementation for function f , then we can implement f ⊕ 1 function by using (G, α, π). Thus, if we are searching for all the Boolean functions with a given latency complexity (up to the extended bit permutation equivalence), we can use this technique to fix the type of a single gate. For simplicity, we fix g d,0 to be a NAND gate. We emphasize that this reduction is only valid when we are searching for the Boolean functions with a given latency complexity.
Reduction on the Possibilities for α: Similar to reducing the possibilities for G, we can also reduce the number of possibilities for α. Since the order of inputs to a NAND or a NOR gate does not affect the output, we can reduce the computation complexity by fixing the order of the inputs. Precisely, by the notation of the structure in Figure 2, Swapping these two inputs does not change the implemented function. We can omit this redundant computation by only considering that α[2i] ≤ α[2i + 1] for any i ∈ Z 2 d−1 . Therefore, instead of checking for all 4 2 d−1 = 2 2 d possibilities for α, it is enough to check only 3 2 d−1 of them.
Using these two reductions on G and α, for a given n-bit Boolean function, determining the all possible implementations with latency complexity d for an n-bit Boolean function needs checking about 2 11.7 · n 8 , 2 23.5 · n 16 and 2 47 · n 32 possibilities for d = 3, d = 4 and d = 5, respectively. It is clear that if d > 4, it is impossible to do this computation in practice. Even for d = 4, if n > 4, this computation is not practical.

Reduction on the Possibilities for π:
To find all the Boolean functions with a given latency complexity (up to the extended bit permutation equivalence), we can still reduce the number of possibilities for π. The implementations based on (G, α, π) and (G, α, π • π ′ ) by π ′ being a permutation of Z n , build bit permutation equivalent Boolean functions. Therefore, we can reduce the possibilities for π such that if we use π, we do not check for any other π • π ′ choices. This reduction reduces the computational complexity by a factor of about n!.
Besides, the implementations based on (G, α, π) and (G, α ⊕ P (α ′ ), π) by α ′ ∈ F n 2 , build Boolean functions those are different only in a constant addition in the input. Again, we can reduce the possibilities for α such that if we use α, we do not check for any other α ⊕ P (α ′ ) choices. Note that using this reduction requires knowing the choice for π. Since we previously reduced the choices for α with α 2i ≤ α 2i+1 restriction for any i ∈ Z 2 d−1 , this reduction reduces the computational complexity by a factor smaller than 2 n .
Altogether, even with using all these reductions on the number of possibilities, the computational complexity of the search for all the n-bit Boolean functions with a latency complexity d (up-to the extended bit permutations equivalence), we need to consider more than 2 11.7 · n 8 n! · 2 n , 2 23.5 · n 16 n! · 2 n and 2 47 · n 32 n! · 2 n possible (G, α, π) tuples for d = 3, d = 4, and d = 5, respectively. For instance, finding all the 5-bit Boolean functions with latency complexity 4 and 5, we need to consider more than 2 49 and 2 109 functions, respectively.
Note that each of these functions built by these tuples is not a representative function. To achieve the set of representatives, we need to remove the equivalent functions. To do this, we compute the representatives for each Boolean function built by these choices and remove the duplicated representatives. We used the simplest method to compute the representative of a Boolean function in the extended bit permutation equivalence. We try all the equivalent Boolean functions by choosing one of the n! bit permutation functions and one of the 2 n constants in the input of the function while we choose the output constant bit in such a way that in the resulting equivalent function, the point 0 is mapped to 0. Hence, computing the representative for each of these Boolean functions costs about 2 n · n! operations. Therefore, computing the set of all n-bit representative Boolean functions with latency complexity d needs the same amount of computations as for finding all possible implementations for an n-bit Boolean function with the same latency complexity; i.e., 2 11.7 · n 8 , 2 23.5 · n 16 and 2 47 · n 32 operations for d = 3, d = 4 and d = 5, respectively.
To remove the equivalent Boolean functions, we also tried the approach introduced in [MB19] for determining if two functions are equivalents. We modified their method for the case of extended bit-permutation equivalence instead of the affine equivalence. However, due to the difference in corresponding equivalences (that the affine equivalence is a stronger one than the extended bit-permutation), for our application in this paper, we find using the method in [MB19] slower than using the simplest method. We emphasize that their method is only slower here, where we are removing the equivalent Boolean functions up to the extended bit-permutation. In the case of removing the equivalent n-bit to m-bit vectorial Boolean functions, with a large value for n and a small value for n − m, modification of the [MB19] method is significantly faster than the simple approach.

Finding Boolean Functions with Low-Latency Complexity
In this subsection, we present an efficient algorithm to find all the n-bit Boolean functions with latency complexity d.
Algorithm 1: Computing F n,d , the set of all full-dependent n-bit Boolean functions with latency complexity d up to the extended bit-permutation equivalence.
Compute the corresponding bit-permutation function P .
To find all the n-bit Boolean functions with latency complexity d, we only need to find them up to the extended bit permutation equivalence. Assume that for each d ′ < d and n ′ ≤ n, we already have the set of all representative and full-dependent n ′ -bit Boolean functions with latency complexity d ′ . We denote each of these sets by F n ′ ,d ′ .
If f is an n-bit Boolean function with latency complexity d, then there exist two Boolean functions f 0 and f 1 with latency complexity d 0 and d 1 , respectively, such that Since we are only finding the Boolean functions up to extended bit permutation equivalence, it is enough to only consider one of the f 0 ∧ f 1 and f 0 ∨ f 1 cases that here we use f 0 ∧ f 1 .
Assume that f 0 and f 1 are n 0 -and n 1 -bit full-dependent Boolean functions, respectively. Then we know that for each k ∈ F 2 , there exist f * k ∈ F n k ,d k , bit permutation function P k with corresponding permutation π k ∈ Z n k n such that for i and j with 0 ≤ i < j < n k , we have π k (i) ̸ = π k (j), α k ∈ F n k 2 , and c k ∈ F 2 which form f k , i.e., f k (x) = f * k P k (x)⊕α k ⊕c k . The restriction on π is because the Boolean functions are full-dependent, and therefore π k must be a permutation of Z n k . As explained previously in the speed-up techniques, we can reduce the choices for both π k permutations and both α k constants. For simplicity, we choose π 0 = (0, . . . , n 0 − 1) and α 0 = 0. Besides, we put a restriction on π 1 that for 0 ≤ i < j < n 1 , if n 0 ≤ π 1 [i] and n 0 ≤ π 1 [j], then π 1 [i] < π 1 [j]. Moreover, we restrict α 1 in a way that for 0 ≤ i < n 1 if n 0 ≤ π 1 [i], then α 1 [i] = 0.
These reductions on building F n,d , decrease the number of possibilities for π 1 to (n 0 ! · n 1 !)/ (n − n 0 )! · (n − n 1 )! · (n 0 + n 1 − n)! and for α 1 to 2 n0+n1−n . All together, if we consider all the cases for n 0 , n 1 , d 0 and d 1 , the computational complexity of this algorithm to find all the n-bit Boolean functions with latency complexity d up to the extended bit permutation equivalence is about We recall that the set of Boolean functions built by each of these choices is not the same as F n,d . To achieve F n,d , we need to remove the equivalent functions. We compute the representatives for each of these Boolean functions, which cost about 2 n · n! operations for each function. A pseudo-code for our approach is provided in algorithm 1.
For comparison, in the case of 5-bit Boolean functions, for latency complexity 4 and 5, we need to consider about 2 29 and 2 48 functions, respectively. Therefore, to compute F 5,4 and F 5,5 , we need to do about 2 41 and 2 60 computations, respectively. We recall that using the method in Subsection 4.1 with reducing all possible reductions, these searches need to consider more than 2 49 and 2 109 functions while determining only the representative functions we need about 2 61 and 2 121 computations, respectively.
Similarly, the computational cost of F 6,4 , F 7,4 , F 8,4 and F 6,5 are about 2 49 , 2 56 , 2 61 and 2 82 , respectively. Using the above algorithm, we compute all the Boolean functions in F n,d for n ≤ 5 with all d values and for n ≤ 8 with d ≤ 4. However, in the case of F 8,4 , we only take the balanced Boolean functions and then compute their representatives. This makes it possible to reduce the computation cost, and we can find all the balanced Boolean functions.
The number of functions in F n,d sets is shown in Table 3. Besides, we partition the balanced Boolean functions in F n,d with respect to their linearity. Table 6 and Table 5 show the number of full-dependent n-bit balanced Boolean functions categorized with their linearity or ANF degree, and latency complexity.

Finding All Possible Implementations of a Boolean Function
In the following, we present an efficient algorithm to find all the possible implementations of an n-bit full-dependent Boolean function whose latency complexity is d, for the cases that either n ≤ 5, or n ∈ {6, 7, 8} with d ≤ 5. Note that here, we are only interested in the implementations that have the minimum depth in the basis of {NAND, NOR, INV}. Precisely, we are only looking for the implementations with structure in Figure 2 for a depth of d. We recall that to do this, we only need to find structures up to the reductions explained in Subsection 4.2.
For simplicity, we assume that we know what is the value for latency complexity of the given Boolean function. Note that this assumption is based on the fact that we can compute representative of the given n-bit full Boolean function with n! · 2 n computations and then we can search if it is in F n,d . For the cases that n ∈ {6, 7, 8} and d > 4 that we do not have the corresponding F n,d sets, we assume d = 5 and try to find a structure for its implementation and if it is not possible, we conclude that d is higher than 5.
For starting the algorithm, we assume that for each d ′ < d, the set of all n-bit Boolean functions with latency complexity d ′ is already computed. We denote these sets with F * d ′ . Note that these functions are not necessarily full-dependent. It means F * d ′ involves all equivalent Boolean functions for each representative function from F n,d ′ , or each function from F n ′ ,d ′ extended to n bits.
If f is the given n-bit Boolean function with latency complexity d, then there exist two Boolean functions f 0 and f 1 , both with latency complexity less than d and such that f = f 0 ∧ f 1 or f = f 0 ∨ f 1 . We use the following lemma to find all potentially possible f 0 and f 1 functions to build the structure of f with the minimum depth.
Lemma 2. Let f , f 0 and f 1 be three n-bit Boolean functions such that In other meaning, for the case For the first step of the algorithm, by applying the above lemma, we consider each n-bit Boolean function with depth less than d as a potential candidate for f 0 and f 1 , and check if it the candidate function, g, satisfies ¬f ∧ g = ¬f or ¬f ∨ g = ¬f conditions. If so, depending on which condition is fulfilled, we add the function to one of A ∧ or A ∨ sets. We provide a pseudo-code of this approach at algorithm 2.
Since for a representative n ′ -bit full dependent Boolean function, there can be at most 2 · 2 n ′ · n ′ ! · n n ′ equivalent n-bit Boolean function, the computation cost to complete each of A ∧ and A ∨ sets is about That is for d = 4, computation costs are about 2 16 , 2 20 , 2 24 , 2 28 and 2 31 , for n = 4, n = 5, n = 6, n = 7 and n = 8, respectively, and for d = 5, these numbers are about 2 16 , 2 30 , 2 41 , 2 48 and 2 25 · |F 8,4 |. Note that this computation cannot exceed 2 2 n , and this upper bound only happens for d values which are close to the maximum possible latency complexity for the corresponding n. For instance, for n = 4, it happens in the case of d ∈ {4, 5} and for n = 5 it happens only if d = 6.
To reduce the memory usage and efficient computation of the algorithm, instead of computing and saving the F * d ′ sets, we can compute each element of these sets and check if it is a valid candidate for building up the given function f . Moreover, we do not need to compute the function from F * d ′ completely. We only need to compute the function output for the entries that are needed for the corresponding condition check and as soon as it does not fulfill the condition for one entry, we reject the function. Hence, the memory usage of this computation is only saving all of F n ′ ,d ′ sets together with A ∧ and A ∨ sets.  Otherwise, if there is no such pair of functions, we conclude that the latency complexity of f is higher than d. Note that this only happens for d = 5 with n ∈ {6, 7, 8}.
In this recursive algorithm, the bottleneck of our computations is computing A ∧ and A ∨ for f function, i.e., in the depth level d. For other depth levels, even if we have several (f 0 , f 1 ) pairs to find their structure, the computations are comparably smaller.
Note that for the given function, there might be several solutions for representing it with the structure of Figure 2 and it is not necessarily unique. Besides, the above-mentioned algorithm to find possible structures of the given function, there might be cases where the suggested solution does not follow the structure shown in Figure 2. This happens in the cases where the given function f with latency complexity d can be built with two functions f 0 and f 1 with latency complexities of d 0 = d − 1 and d 1 < d − 1, respectively. In this case, the sub-structure for f 1 will be with gate depth d 1 and shorter than the one for function f 0 . Clearly, it does not follow the structure of Figure 2, but it is a simplification of the structure for implementing it.

Latency Complexity of Known S-boxes
In this subsection, we compute the latency complexity of previously introduced S-boxes in the cryptographic primitives and compare their latency complexity together with their cryptographic properties: linearity, uniformity, and algebraic degree. We listed all these S-boxes together with the latency complexity of each coordinate function of the S-boxes in Table 7 that are categorized first by the input size of the S-boxes, and then by their uniformity and linearity. It also shows the algebraic degree (of ANF representation) for each coordinate of the S-boxes and their inverse S-boxes (in the case of bijective ones). Note that in the case of n-bit S-boxes with n ∈ {6, 7, 8}, if the latency complexity of a coordinate is 5 or higher, we denote it by x.
In the case of 3-bit S-boxes, all the S-boxes have a latency complexity of 3; and for 4-bit S-boxes, regardless of their cryptographic properties, their latency complexity is either 4 or 5, except for χ 4 (which is a non-bijective S-box) and Midori-s 0 S-box whose latency complexity are 3.
Within 5-bit S-boxes, χ 5 , also known as KECCAK S-box, has the minimum latency complexity, 3, whose uniformity is 8 and linearity is 16. However, to achieve the minimum uniformity and linearity, such as in Fides-5 S-box, they have a latency complexity of 5.
For the case of 6-bit S-boxes, the only previously introduced S-boxes with a low-latency complexity are the non-bijective χ 6 function with latency complexity 3, uniformity 16 and linearity 32, and Speedy S-box with latency complexity 4, uniformity 8 and linearity 24.
For 7-bit and 8-bit S-boxes, there is no S-box with latency complexity less than 5, except χ 7 (with uniformity 32 and linearity 64) and χ 8 (non-bijective and with uniformity 64 and linearity 128) that both have latency complexity of 4. As results show, there are very few n-bit S-boxes that are optimized for their latency, especially when n > 4.
With this motivation, using the Boolean functions with low-latency complexity found in the previous subsections, we build and introduce some new bijective S-boxes with a low-latency complexity in the following section.

Bijective S-boxes with a Low-Latency Complexity
To build an n-bit bijective S-box with latency complexity of d, as of the S-box's coordinate functions, we can use all the balanced Boolean functions equivalent to one of the representative balanced functions in one of F n ′ ,d ′ sets with d ′ ≤ d and n ′ ≤ n. One can restrict to only those S-boxes for which each of its coordinate functions is a full-dependent Boolean function, i.e., each output bit of the S-box is dependent on all of the input bits. Another restriction can be to put a limit on the linearity and the uniformity of the S-boxes or on the algebraic degree of the coordinate functions. Note that these restrictions make it possible to find cryptographically stronger S-boxes.
Assume that F * is the set of all representative functions those are following our limits for the target S-boxes, e.g., n-bit S-boxes with latency complexity of d, linearity of at most ℓ and uniformity of at most u. Then any n-bit S-box of our target can be formed as S = (f 0 , . . . , f n−1 ) such that for all i ∈ Z n , we have Searching through all the possibilities for f i , P i , α i , and c i , we need to consider |F * | · n! · 2 n+1 n cases. By fixing α 0 = 0, P 0 to the identity function, and c i = 0 for all i ∈ Z n , we can find all the S-boxes up to the extended bit permutation equivalence. Besides, due to the bit permutation in the output bits, we can also fix the order of the coordinate functions, e.g., for each i < j, we fix f i to be lexicographically smaller than f j . Then the computational complexity of the search is reduced to about |F * | n · (n!) n−2 · 2 n 2 −n . For instance, to build 6-bit S-boxes, the complexity of this search is about 2 68 · |F * | 6 . Even if we restrict ourselves to full-dependent S-boxes with latency complexity of 4 and linearity of 16, then |F * | = 1546 (see Table 6); therefore, we need to consider about 2 131 possibilities. In the following, we present an algorithm to reduce the computational complexity of this search.

Step-By-Step Method for Building Bijective S-boxes
We use the property of the bijective S-boxes together with the definition for linearity.

Lemma 3.
For an n-bit bijective S-box S with linearity ℓ, each of its component functions, namely ⟨α, S⟩ with α ∈ F n 2 \ {0}, is balanced and has a linearity of at most ℓ.
Using Lemma 3 makes it possible to filter out some of the possibilities, only by having some of the coordinate functions. Precisely, assume that f * 0 and f 1 are already chosen, then without choosing other coordinate functions, we can check for balancedness and linearity of f * 0 ⊕ f 1 . If f * 0 ⊕ f 1 is balanced and has a linearity at most ℓ, then we choose the third coordinate function, f 2 . Again, we can check for balancedness and linearity of f * 0 ⊕ f 2 , f 1 ⊕ f 2 , and f * 0 ⊕ f 1 ⊕ f 2 . Continuing in this way, after choosing the last coordinate function, f n−1 , we can check for balancedness and linearity of other 2 n−1 − 1 component functions. If these 2 n−1 − 1 conditions are met, then we have a bijective S-box with linearity at most ℓ, and we can compute its uniformity.
Assuming that the average probability of satisfying all 2 i − 1 conditions over all possible choices for f i is p i , then the computational complexity of this search is about where N f ≈ |F * | · n! · 2 n and the first division by n! is because of that due to the output bit permutation the coordinate functions can be ordered. Note that without using this step-by-step choosing of the coordinate functions, the complexity of the search is about |F * | · N n−1 f /n! which is significantly larger than when we choose the coordinate functions step-by-step.
Moreover, we use the following technique to omit a dominant part of the computations. After choosing f * 0 in step 0, we compute the set of possible choices for f 1 , ⊕ α , f * ∈ F * , α ∈ F n 2 , P : bit permutation , f ⊕ f * 0 fulfills all the conditions . By fulfilling the conditions by function g, we mean that g is balanced and lin(g) ≤ ℓ. We know that not only f 1 ∈ F † 1 (f * 0 ), but also all other coordinate functions must be in this set; i.e., for each 1 ≤ i < n, f i ∈ F † 1 (f * 0 ). Note that determining F † 1 (f * 0 ), for a given f * 0 , needs N f computations and it includes N f · p 1 Boolean functions. Therefore, to build the S-box, in step 0, we choose f * 0 ∈ F * and compute F † 1 (f * 0 ). In step 1, after choosing f 1 ∈ F † 1 (f * 0 ), we compute the set of possible choices for f 2 , Note that since we only check for f ∈ F † 1 (f * 0 ), it already fulfills the conditions for f ⊕ f * 0 . Again, we know that not only f 2 ∈ F † 2 (f * 0 , f 1 ), but also all the next coordinate functions must be in this set; i.e., for each 2 ≤ i < n, f i ∈ F † 2 (f * 0 , f 1 ). Determining F † 2 (f * 0 , f 1 ), for given f * 0 and f 1 , needs N f · p 1 computations and it includes N f · p 1 · p 2 Boolean functions. In step 2, after choosing f 2 ∈ F † 2 (f * 0 , f 1 ), we compute the set of possible choices for f 3 , Again, since we only check for f ∈ F † 2 (f * 0 , f 1 ), it already fulfills the conditions for f ⊕ f * 0 , f ⊕ f 1 , and f ⊕ f 1 ⊕ f * 0 . Besides, we know that not only f 3 ∈ F † 3 (f * 0 , f 1 , f 2 ), but also all the next coordinate functions must be in this set; i.e., for each 3 , for given f * 0 , f 1 and f 2 , needs N f · p 1 · p 2 computations and it includes N f · p 1 · p 2 · p 3 Boolean functions.
We continue in this way until we choose all the coordinate functions for the S-box and fulfill the conditions. Therefore, the built S-box is a bijection with linearity at most ℓ. Then, we can check for the condition on the S-box's uniformity. On average, this techniques reduce the computational complexity of building a bijective S-box to Clearly, the modification explained above makes the step-by-step algorithm much faster than its simpler version. We provide a pseudo-code for our new method of building S-boxes at algorithm 3. This algorithm, in the simplest mode, needs to save all N f ≈ |F * | · n! · 2 n Boolean functions, in the beginning, to reduce the redundant computations in the next steps. However, it is also possible to only save the Boolean functions in the set F † (f * 0 ) for each f * 0 ∈ F * . This way, we need to save only about N f · p 1 Boolean functions, significantly less than the previous way, but on the other hand, we need to repeat computing all N f equivalent functions for |F * | times. It is noteworthy that the value of p 1 is strongly related to the target properties for the S-boxes we are searching and also the functions in F * which is not easy to compute.
Using the Upper Limit on the Uniformity: Similar to using the upper limit on the S-box's linearity, we can use the limit on the S-box's uniformity in the intermediate steps of the algorithm.

Lemma 4.
For an n-bit S-box S = (f 0 , . . . , f n−1 ) with uniformity u, the uniformity of sub-S-box S ′ i = (f 0 , . . . , f i ) with i < n is upper bounded by min{u · 2 n−i−1 , 2 n }. Applying this lemma, in step i of the step-by-step algorithm, after choosing the coordinate function f i , it is possible to check uniformity of the sub-S-box S ′ i = (f * 0 , f 1 , . . . , f i ). Note that for small i values, this condition does not filter the choices for the coordinate function. For instance, when i = 0, then uniformity of f * 0 is limited by min{u · 2 n−1 , 2 n } = 2 n that is a trivial condition and indeed we do not need to check it. For i = 1, the uniformity of (f * 0 , f 1 ) is limited by min{u · 2 n−2 , 2 n }, and it is non-trivial if u = 2, i.e., it is meaningful to check this condition if we are looking for APN S-boxes.

Comparison to the Previous Method of Building S-box
In [MB19], the authors used an algorithm to classify n-to m-bit quadratic balanced Boolean functions up to the affine equivalence with n ≤ 6, that this algorithm is the most applied algorithm for building S-boxes up to the applied equivalence, e.g., [Can07]. A pseudo-code of this is provided at algorithm 4.
In this algorithm, consider that F * is the set of all n-bit representative Boolean functions under the equivalence that we want to use them to build bijective n-bit S-boxes. As the first step, we choose f * 0 from F * and f 1 from the set of all functions that are equivalent to one of the functions in F * that we denote by F. By using these two functions together, we have an n-to 2-bit function of the form (f * 0 , f 1 ) that can be checked for chosen criteria (if there are any), e.g., balancedness and linearity. Trying this for all choices of f * 0 and

Algorithm 3: The new method of building n-bit bijective S-boxes.
Data: F * // the set of all n-bit representative balanced Boolean functions with linearity of at most ≤ ℓ (and extra criteria such as their latency complexity) Result: R // the set of all n-bit representative bijective S-boxes with linearity of at most ℓ and uniformity of at most u . . . f 1 , we have a set of n-to 2-bit functions that some are equivalent to each other. We remove all equivalent ones and only keep one function within each equivalence class as a representative for that class and put it in a set called R 2 . Note that here, we do not need to consider the lexicographically smallest function as the representative one.
In the second step, we use one function from R 2 and another function from F. By using these two functions together, we have an n-to 3-bit function that can be checked for the criteria. Again, we try this for all choices of those two functions, and then we have a set of n-to 3-bit functions that some are equivalent to each other. By removing all the equivalents and only keeping one for each equivalence class, we build a set called R 3 .
Similarly, we do another n − 3 steps to build R 4 , . . . , R n that the latest one is the set of all n-bit S-boxes up to the equivalence. Note that this algorithm can be applied for different equivalences and not only for linear or affine equivalence.
While this method is efficient for classifying n-to m-bit Boolean functions under linear or affine equivalences, for our need it is not that efficient. The main reason for this is that the applied equivalences there and here are different. Extended bit permutation equivalence is covered by affine equivalence. Hence, the number of equivalence classes in the extended bit permutation equivalence is more than the number of classes in affine equivalence. Indeed, there is a big difference in these numbers. While number of n-to m-bit Boolean functions belonging to an extended bit permutation equivalence class is about 2 n+m ·n!·m! (note that for simplicity here, we are not considering self-equivalent Boolean functions), in the affine equivalence this number is about 2 n+m · n−1 Therefore, the ratio of the probability that two n-to m-bit randomly chosen Boolean functions are extended bit permutation equivalent to the probability that they are affine equivalent is quite small, n! · m! · 2 −n 2 −m 2 +2 . For example, for n = m = 4, this ratio is about 2 −20 and for larger n values, it gets smaller. We emphasize that this probability is not the case for ours, because it considers randomly chosen Boolean functions, while we are dealing with balanced Boolean functions and with a limit on their linearity, hence they are not randomly chosen, and also it does not consider the self-equivalent functions. However, even in our case, the ratio of these two probabilities is very small. Therefore, we find the new method more suitable and efficient for our case to find bijective S-boxes with small latency complexity. Then, if there is any possible S-box to build, by applying another algorithm, we reduce the equivalent S-boxes up to the extended bit permutation equivalence. This algorithm of reducing the equivalent S-boxes is based on the one explained and given in [BCBP03,MB19] for linear and affine equivalence. We modified that algorithm for the case of extended bit permutation equivalence and made it efficient by applying properties of this equivalence that generally do not hold for the linear or affine equivalence. For comparison, we applied both methods to build 5-bit S-boxes with latency complexity 3, linearity 16, and uniformity at most 8. For this, we used the balanced Boolean functions in F 5,3 , together with the extension of Boolean functions in F 4,3 and F 3,2 ∪ F 3,3 to 5-bit Boolean functions. Altogether, up to the bit permutation equivalence, there are only 10 balanced Boolean functions with linearity 16. Using these Boolean functions and their equivalents, based on our algorithm, it is possible to build about six thousand S-boxes with linearity 16 and uniformity 6 or 8 that up to the equivalence they are only 509 S-boxes. The time to find the S-boxes takes needs 2 minutes and another 2 minutes for removing the equivalent S-boxes. We also applied the previous method to build such S-boxes which takes about 96 minutes. Note that for both computations, we used a single thread on an Intel i7-10610U CPU @ 1.80 GHz CPU.
As the example shows, our method is more suitable for the extended bit permutation equivalence. We should mention that if the number of starting coordinate functions increases (here it was 10), or we search for larger S-boxes, the efficiency of our method increases.

Results on the Bijective S-boxes with Low-Latency Complexity
To build low-latency S-boxes, we start with the lowest possible limit on the linearity or uniformity of the S-box. If it is not possible to build any S-boxes with such criteria, we increase the target linearity and uniformity and repeat the search for building S-boxes. This way, we can find the best possible S-boxes with respect to linearity and/or uniformity.
In the following, we report the best results with respect to their linearity and uniformity for building n-bit S-boxes. Note that the list for all of these S-boxes is presented in [Ras22]. We emphasize that we only search for the bijective S-boxes and the reported number of Boolean functions and the number of S-boxes are up to the extended bit permutation equivalence. Also, we use F n,d,ℓ as the set of all n-bit full-dependent balanced Boolean functions with a latency complexity at most d and linearity of at most ℓ.

3-bit S-boxes
There is one function in F 3,2,4 which is equivalent to a multiplexer (see Example 1). Using this function, it is possible to build two S-boxes with linearity 8 and uniformity 4.
For latency complexity of 3, since all the 3-bit balanced Boolean functions with linearity 4 are included in F 3,3,4 , we can build all the bijective 3-bit S-boxes with linearity 4 and uniformity 2 those are affine equivalent to the inversion in F 2 3 . Up-to the extended bit permutation equivalence, there are 7 of such S-boxes. Note that for the last S-box in this list, all the coordinate functions are equivalent up the extended bit permutation.

4-bit S-boxes:
By extending the only function in F 3,2,4 to a 4-bit Boolean function, it is possible to build one bijective and quadratic S-box whose linearity and uniformity, both are 16.
For latency complexity of 3, there are 4 functions in F 4,3,8 with linearity 8 that algebraic degree of all these functions is 3. Using these functions, it is possible to build 152 S-boxes with linearity 8 and uniformity of 4 that both are the minimum values for a 4-bit S-box. Next, we combine F 4,3,8 with extension of Boolean functions in F 3,3,4 to 4-bit Boolean functions. Using these functions, it is possible to build another 129 S-boxes with a linearity 8 and uniformity of 4.
It is noteworthy to mention that the inverse of S-boxes number 73 from the first set, 32 and 38 from the second set have latency complexity of 3 and also they can produce involutive S-boxes. Besides, from the first set, coordinate functions for S-boxes number 22,41,42,55,80,99,136 and 147 are equivalent while there is no such an S-box from the second set with this property.
In comparison with previously known S-boxes, the only S-boxes with latency complexity 3 are Midori-s 0 and the non-bijective χ 4 that the first one is equivalent to the representative S-box number 32 from the second above-mentioned set. However, for the same level of latency complexity, uniformity and linearity, the new S-boxes offer a wide variety with respect to the algebraic degree of coordinate or component functions.
Moreover, since F 4,4,8 includes all 4-bit balanced Boolean functions with linearity 8, therefore, within latency complexity of 4, we can build any 4-bit S-box with linearity 8 and uniformity 4, those are named as Golden S-boxes by [LP07].

5-bit S-boxes:
By extending the function in F 3,2,4 to a 5-bit Boolean function, it is possible to build 13 different bijective S-box whose linearity and uniformity, both are 32.
F 5,3 includes 4 balanced Boolean functions with linearity 16 those build F 5,3,16 . Using these functions, it is possible to build 4 S-boxes with a linearity of 16 and uniformity of 6. Note that all the coordinate functions of these S-boxes and their inverse S-boxes have an algebraic degree of 4. Besides, all the coordinate functions of the last S-box, are extended bit permutation equivalent of each other.
Using the extended Boolean functions of F 3,3,4 and F 4,3,8 to 5-bit Boolean functions together with the ones in F 5,3,16 does not improve the minimum achievable linearity or uniformity in the S-boxes. It only gives another 9 S-boxes with the same linearity and uniformity values.
For the latency complexity of 4, there are 93 functions in F 5,4,8 . Using these functions, it is possible to build 2514 different S-boxes with linearity 8 and uniformity 2 (due to the large number of these S-boxes, we only provide them in [Ras22]). It is noteworthy to mention that the coordinate functions of these S-boxes have an algebraic degree of 2 or 3, and are equivalent to only 28 functions (out of 93).
Moreover, since all balanced 5-bit Boolean functions with minimum linearity are included in F 5,5,8 , any 5-bit S-box with linearity 8 has latency complexity of at most 5.
In comparison with previously known S-boxes, the only known S-box with latency complexity 3 is χ 5 that is used in KECCAK and has uniformity 8 and linearity 16. For the same level of latency complexity, uniformity 6 or 8, and linearity 16, the new S-boxes offer a wide variety with respect to the algebraic degree of coordinate or component functions and also with a higher dependency of the coordinate functions on the input variables (while for χ function is always 3 bits).
For the case of S-boxes with uniformity 2 and linearity 8, while the previously known S-boxes all have a latency complexity of 5, our new proposed S-boxes can achieve these properties within the latency complexity of 4. It is noteworthy to mention that the algebraic degree of the coordinate or component functions of these new S-boxes is either 2 or 3; there are also some S-boxes with all quadratic or all cubic coordinates.

6-bit S-boxes:
Using the extension of the Boolean function in F 3,2,4 to a 6-bit Boolean function, it is possible to build 3 bijective S-box whose linearity and uniformity are 64 and 32, respectively. But these S-boxes are a combination of two parallel 3-bit S-boxes found previously in 3-bit S-boxes. Excluding this kind of S-boxes, it is possible to build 19 bijective 6-bit S-boxes whose linearity and uniformity both are 64. F 6,3 includes 3 balanced Boolean functions with 3 different linearities, 24, 32 and 40. Using these Boolean functions, it is possible to build an S-box with linearity 64 and uniformity 20 that all the coordinate functions of the S-box equivalent to the representative function in F 6,3 with linearity 32 and algebraic degree 5.
Then, we use these two functions with less linearity, i.e., F 6,3,32 , together with extension of the Boolean functions in F 5,3,16 to 6-bit Boolean functions. It is possible to build another S-box with the same linearity and uniformity that all coordinate functions for this S-box have an algebraic degree of either 4 or 5.
Next, by combining F 6,3,32 and extension of Boolean functions in F 5,3,16 and F 4,3,8 to 6-bit Boolean functions, we could not find any S-box with the same or better linearity or uniformity. One step more, we combine the extension of the Boolean functions in F 3,3,4 to 6-bit Boolean functions, with the other three above-mentioned sets, to see if there is any S-box with better uniformity or linearity. We found 49 new S-boxes (excluding the ones that are a combination of two parallel 3-bit S-boxes) with linearity 32 and uniformity 16. It is noteworthy to mention that all the coordinate functions of each last 5 S-boxes in this list, are equivalent to each other and a single function in F 3,3,4 . Besides, within these 49 S-boxes, there are 15 quadratic ones (i.e., for each of these S-boxes, the algebraic degree of all coordinate functions is 2) which are usually considered to be suitable for side-channel countermeasures.
For the case of latency complexity 4, we first start to build quadratic bijective S-boxes. There are only 2 quadratic Boolean functions in F 6,4 ; one with linearity 16 and another with linearity 32. Using the one with linearity 16, it is possible to build 4 S-boxes with both linearity and uniformity 32. However, by involving the extension of quadratic 5-bit Boolean functions F 5,4,8 to 6-bit Boolean functions (there are 4 such functions), it is possible to build 908 S-boxes with linearity 16 and uniformity 4. Due to the large number of these S-boxes, we only provide them in [Ras22]. It is noteworthy to mention that in this list, there are two S-boxes that coordinate functions of each S-box are equivalent to each other, namely S-boxes number 496 and 775.
In comparison with the previously known 6-bit S-boxes, the only ones with latency complexity less than 5, are χ 6 and Speedy S-boxes. χ 6 is a non-bijective function with latency complexity 3, linearity 32 and uniformity 16. It is comparable with our 49 bijective S-boxes with a wide variety with respect to the algebraic degree of coordinate or component functions and also with a higher dependency of the coordinate functions on the input variables.
Speedy S-box has a latency complexity of 4, linearity 24 and uniformity 8. For the same level of latency complexity, we presented 908 quadratic S-boxes together with a hybrid-quadratic-cubic S-box, all with linearity 16 and uniformity 4. While the new S-boxes are better than Speedy S-box with respect to the uniformity or linearity, they have a smaller algebraic degree. However, if the target is to have a higher algebraic degree, we found several S-boxes with linearity 24, uniformity 6 and all coordinate functions have an algebraic degree of 5. Clearly, these S-boxes have better cryptographic properties than the Speedy S-box, but we recall that in the design of Speedy S-box, it was only restricted to have two layers of NAND gates with fan-in number of 3 or 4 (with the same latency complexity of 4) that makes the S-box has slightly lower latency than our S-boxes.

7-bit S-boxes:
Using the extension of the function in F 3,2,4 to a 7-bit Boolean function, it is possible to build 92 bijective S-box whose both linearity and uniformity are 128. Three of these S-boxes are parallel combination of a 3-bit S-box with a 4-bit S-box.
We repeat a similar approach to 6-bit S-boxes with latency complexity 3 for 7-bit S-boxes. We first use extension of 6-bit functions in F 6,3,32 to 7-bit Boolean functions, to see if it is possible to build 7-bit bijective S-boxes. Then, we use F 6,3,32 ∪ F 5,3,16 , F 6,3,32 ∪ F 5,3,16 ∪ F 4,3,8 and F 6,3,32 ∪ F 5,3,16 ∪ F 4,3,8 ∪ F 3,3,4 , step by step. Except in the last two step, it is not possible to build any bijective S-boxes. For the case of F 6,3,32 ∪ F 5,3,16 ∪ F 4,3,8 there are many S-boxes with linearity and uniformity, both to be 128. Using F 6,3,32 ∪ F 5,3,16 ∪ F 4,3,8 ∪ F 3,3,4 , there are 1074 S-boxes with linearity 64 and uniformity 32 that 152 × 7 = 1064 of them are parallel combination of a 3-bit S-box with a 4-bit S-box those found previously in 3-bit and 4-bit S-boxes with latency complexity 3. It is noteworthy to mention that the coordinate functions of these S-boxes, all are extended bit permutation equivalent to extension of one function in F 4,3,8 and two functions from F 3,3,4 to 7-bit Boolean functions. Moreover, in the last two S-boxes of this list, all coordinates are extended bit permutation equivalent of each other and equivalent to the extension of one function in F 3,3,4 .
For the case of latency complexity 4, we only checked building quadratic bijective S-boxes. Using extension of the single quadratic 6-bit Boolean function in F 6,4,32 to a 7-bit function, it is possible to build 1110 S-boxes with linearity 64 and uniformity 32. Due to the large number of these S-boxes, we only provide them in [Ras22]. One step forward, by involving the extension of four quadratic 5-bit Boolean functions F 5,4,8 to 7-bit Boolean functions, it is possible to build 134 S-boxes with linearity 32 and uniformity 8. It is noteworthy to mention that coordinate functions for two S-boxes of this list (namely S-boxes number 51 and 133) are equivalent to each other.
There are only two previously known 7-bit S-boxes: χ 7 with a latency complexity 3, linearity 64 and uniformity 32, and Wage S-box with a latency complexity higher than 4, linearity 40 and uniformity 8. χ 7 is comparable with the new 10 S-boxes with an algebraic degree of 2 or 3 for the coordinate or component functions and with a higher dependency of the coordinate functions on the input variables.
Wage S-box has a latency complexity of at least 5. However, for the latency complexity of 4, we presented 134 quadratic S-boxes with linearity 32 and uniformity 8. Clearly, the new S-boxes are better than Wage S-box with respect to the latency complexity, uniformity or linearity, but they have a smaller algebraic degree. If the target is to have an S-box with a latency complexity of 4 and a higher algebraic degree, it should be investigated what is the best achievable linearity and uniformity.

8-bit S-boxes:
Using the extension of the Boolean function in F 3,2,4 to an 8-bit Boolean function, it is possible to build 221 bijective S-box whose linearity and uniformity, both are 256. 40 of these S-boxes are a parallel combination of two 4-bit S-boxes or one 3-bit and one 5-bit S-boxes and the other 181 S-boxes are provided in [Ras22].
The only previously known 8-bit S-box with a latency complexity of less than 5 is χ 8 which is a non-bijective S-box with latency complexity 3, linearity 64 and uniformity 128. In Skinny-8 and CSS S-boxes, and also in their inverses, there are four coordinates whose latency complexity is 3 or 4, but the other four coordinates it is higher than 4. For the latency complexity of 3, we introduced new 84 quadratic and bijective S-boxes with linearity 128 and uniformity 64.
Latency Complexity vs. Minimum Achievable Uniformity and Linearity: As reported, up to 8-bit S-boxes, there is no n-bit S-box with a latency complexity of 2 whose both linearity and uniformity are smaller than 2 n ; therefore, to achieve such a property, we need to use S-boxes with a latency complexity of at least 3.
For latency complexity of 3, while there are 3-and 4-bit S-boxes with the minimum linearity and uniformity, for larger S-boxes (with respect to the input size), this is not achievable. Generally, for n-bit S-boxes with 3 ≤ n ≤ 8 and latency complexity 3, the minimum achievable linearity and uniformity are 2 n−1 and 2 n−2 , respectively, (except for 5-bit S-boxes in which the minimum achievable uniformity is 6).
For latency complexity of 4, it is possible to achieve the minimum linearity and uniformity for 5-bit S-boxes which are 8 and 2, respectively. For 5-, 6-, and 7-bit S-boxes, the minimum achievable linearity and uniformity is 2 n−2 and 2 n−4 , respectively, and this is probably the case for 8-bit S-boxes and it is interesting to be investigated.

Hardware Implementation of a Low-Latency Structure
While the proposed structure in Figure 2 helps us to study the Boolean functions with latency complexity d, it does not promise the lowest latency in a real hardware implementation. In the following, we explain the reasons behind this statement. Then we describe our approach for optimizing the suggested structures produced by the algorithm in Subsection 4.4 to find a circuit with the lowest latency in an ASIC hardware implementation. We apply our approach to find efficient implementations for previously introduced S-boxes that are minimized with respect to the latency and then its area.

Optimizing the Low-Latency Structure for a Boolean Function
The fact that each structure produced by the algorithm in Subsection 4.4 (which usually follows the structure in Figure 2) do not promise the lowest latency in reality is because we modeled the latency, a complicated hardware parameter, with an over-simplified metric, the latency complexity. Here, we describe three main reasons that cause the latency difference between the structures suggested by the algorithm in Subsection 4.4 for a given Boolean function. Figure 3: Two different circuits with the minimum gate depth for implementing f = x 0 ∧ (x 1 ∨ x 2 ) used in Example 3.

Different Structures for the Same Function:
We recall that for the given Boolean function the structure suggested by the algorithm in Subsection 4.4 might not be unique. We emphasize that this algorithm already reduces the trivial equivalent structures that are explained in Subsection 4.2. The following example is a good instance of that different structures (with the minimum gate depth) for the same function can have different latency values.
Example 3. Let f (x 0 , x 1 , x 2 ) be a 3-bit Boolean function with f = x 0 ∧ (x 1 ∨ x 2 ). The latency complexity of this function is two and it can be implemented using two different structures based on x 0 ∨ (x 1 ∨ x 2 ) and (x 0 ∧ x 1 ) ∧ (x 0 ∧ x 2 ) equations. We depict the corresponding circuits for each of these structures in Figure 3.
While for the second circuit, both of the sub-circuits have gate depth 1, in the first circuit, the sub-circuits for gate depth 1, have different latency complexity; for one of them (x 0 ) is zero and for the other one is 1. Besides, in the first circuit, the input x 0 is repeated only once, while in the second circuit, the input x 0 is repeated twice. This means that in the first circuit, the INV gate for variable x 0 has fan-out number 1, but in the second circuit, the BUF gate for the same variable has fan-out number 2.
These differences in the latency of sub-circuits and in the number of fan-out numbers for the INV and BUF gates for the input variable x 0 , cause that the first circuit to have a lower latency than the one for the second circuit.
In the aforementioned equations, inputs of the equations are chosen directly from inputs of the combinatorial circuit. If inputs of the equations each comes from other sub-circuits, i.e., f = f 0 ∧ (f 1 ∨ f 2 ), with f 0 , f 1 , f 2 and f each being an n-bit Boolean function, then the latency of two circuits based on realizing ¬f 0 ∨ (f 1 ∨ f 2 ) and (f 0 ∧ f 1 ) ∧ (f 0 ∧ f 2 ) can be much bigger than the case for f = x 0 ∧ (x 1 ∨ x 2 ).
The above example shows one of the biggest differences in the latency of different circuits for the same function, which is because of the shorter gate depth in one sub-circuit than in the other one. However, there might be small differences in the latency of different circuits, but with the same gate-depth value for both sub-circuits. Therefore, for a given function if there are possible structures such that the gate depth in one of the sub-circuits is smaller than the other one (such as the first structure in the above example), is preferred over other suggested structures. Otherwise, if in all of the structures, the gate depth of the sub-circuits are the same, we must consider all the structures.
Gates with Higher Fan-Out Number: In our structures, we assumed that from each variable x i there are two wires coming to the combinatorial circuit; one goes to a BUF gate and the other one goes to an INV gate. The fan-out number of these two gates is dependent on the number of times that x i or ¬x i are used in the input of NAND or NOR gates in the depth level 1. If in a combinatorial circuit, for some variable there is a BUF or INV whose fan-out number is high, it will increase the latency of the circuit which is in contrast with our assumption that the INV and BUF gates in the gate level 0 of the proposed structure in Figure 2 is much smaller than the latency of the rest of the circuit. Therefore, we should reduce the fan-out number of these BUF and INV gates to reach a lower latency.
The suggested structures by our algorithm, all are based on 2-bit NAND and NOR gates with fan-out number 1. In some structures, it might be possible to use such gates but with a higher fan-out number to reduce the fan-out number for BUF and INV gates. Consider the case that there are two sub-circuits with outputs of f 0 and f 1 with f 0 = f 1 for all input values. Then instead of having two separate sub-circuits for each of f 0 and f 1 , we can keep one of these sub-circuits just by increasing the fan-out number of the latest gate in this sub-circuit. Thereby, we reduce the fan-out number of several BUF and INV gates, just by increasing the fan-out number of a single gate in a middle depth level. This kind of simplification, not only possibly reduces the latency of implementation, it reduces its corresponding area.
Note that within the suggested structures by our algorithm, there might be different ones that lead to the same updated circuit after these simplifications. In this case, we omit the repeated circuits for the next step.

Different Gate Types in ASIC Libraries:
The proposed structures, all are based on only 2-bit NAND and NOR gates. However, depending on the ASIC technology used for the hardware implementation, there are other logic gates with a higher fan-in number that might be more efficient (with respect to the latency and the area) than its representation with the basis of 2-bit NAND and NOR gates. Therefore, since we exclude the gates with a higher fan-in number, the corresponding circuits for the suggested structures do not necessarily provide the lowest possible latency.
In [LMMR21,Section 2], the authors studied the latency behavior of logic gates and their combinations in the CMOS hardware. There, it is explained in detail that compared to the other gates with the same fan-in number, NAND and OAI gates are the most suitable gates to achieve a low-latency implementation. Hence, we need to adapt the circuits for each suggested structure to apply the other low-latency gates to find the implementation with the lowest possible latency for the corresponding function. We suggest not considering only those 2 gates (with the best latency behavior) but to consider also similar gates (with good latency behavior) such as NOR and AOI gates.
Thereby, for each suggested structure (remaining from the previous step), we suggest trying all possible replacements for the aforementioned gates with the corresponding sub-circuit (in the basis of NAND, NOR, and INV gates) and evaluate latency of the updated circuit. We emphasize that each of these replacements does not necessarily reduce the latency of the implementation, but for achieving the lowest latency, we need to check all combinations of these possible replacements. In Figure 4, we provide the corresponding sub-circuits (in the basis of 2-bit NAND, NOR and INV gates) for NAND, NOR, OAI and AOI gates with fan-in number of 3 and 4.
The possible improvements in this step by using the gates with higher fan-in numbers generally depend on the transistor-level design of gates and the corresponding conditions used in the given library's technology. While a single replacement of a gate with a higher fan-in number by the corresponding sub-circuit, in one technology, can improve the latency does not necessarily mean it is the same in other technology. Even in the same technology, a possible improvement by replacing an specific gate type with its corresponding sub-circuit in one circuit does not insure an improvement for replacing the same kind of gate in another circuit. Therefore, we suggest trying all possible replacements of the gates with higher fan-in numbers separately for each targeted technology.
Including XOR and XNOR Gates: In our studies, to have a better metric for the latency of a circuit, we exclude XOR and XNOR gates from the basis of defining the latency complexity. However, similar to the previous step, we can use equivalent sub-circuits for these two gates to simplify the corresponding circuits for the suggested structures.
In Figure 5, we provide the corresponding sub-circuits for 2-bit XOR and XNOR gates.

Figure 4:
The equivalent sub-circuits for NAND, NOR, OAI and AOI gates with fan-in number of 3 and 4 in the basis of 2-bit NAND, NOR and INV gates. As depicted in the figures, for implementing f 0 XOR f 1 or f 0 XNOR f 1 , before any of these replacements, we need to have separate sub-circuits for each f 0 , f 1 , ¬f 0 , ¬f 1 functions together with 3 NAND or NOR gates. But, after the replacement, we need to have separate sub-circuits only for f 0 and f 1 functions together with an XOR or XNOR gate. Note that latency of an XOR or an XNOR gate is higher than the latency of corresponding equivalent structures (see Example 2). But due to reducing the fan-out number of BUF or INV gates in the depth level 0, it can probably reduce latency of whole circuit. However, this replacements can reduce the area of implementation significantly.

Vectorial Boolean Functions:
To find the circuit with the minimum latency for implementing a vectorial Boolean function, we can use a similar approach as for a Boolean function. First, using the algorithm in Subsection 6.1 for each coordinate function, we find all the possible structures with minimum gate depth. Then, for each combination of the structures for the coordinate functions, we repeat the aforementioned steps for replacing the gates with a higher fan-out number, or XOR and XNOR gates, or other different gate types provided by the ASIC library. In other meaning, for an n-bit to m-bit vectorial Boolean function, if there are N C0 , N C1 , . . . , and N Cm−1 possible structures, respectively   for each coordinate function, then we must try all m−1 i=0 N Ci combinations, and check for possible replacements. Completing this approach will find a circuit for implementing the given function with minimum possible latency.
Note that all of these searches to find all possible circuits (that realize achieving the latency complexity) for implementing a (vectorial) Boolean function can be automated completely.
In the following, we consider the hybrid cubic-quadratic 6-bit S-box with latency complexity 4, linearity 16 and uniformity 4 which is presented in Subsection 5.2 as an example to show our approach for replacing the gates with good latency and higher fan-in number in the structure of Boolean functions boxes with low latency complexity.
Example 4 (6-bit S-box with latency complexity 4, linearity 16 and uniformity 4). Consider (y 0 , y 1 , y 2 , y 3 , y 4 , y 5 ) = S(x 0 , x 1 , x 2 , x 3 , x 4 , x 5 ) as this 6-bit bijective S-box with input bits x i and output bits y i with 0 ≤ i < 6. The coordinate functions of this S-box is equivalent to only two Boolean functions, namely f 0 and f 1 presented as follows: Therefore, to find an optimized circuit for low latency implementation of this S-box requires finding latency-optimized circuits for each of these representative Boolean functions. Running the algorithm in Subsection 4.4 to find low-latency structures of these functions returns 32 and 18 different simplified circuits for implementing f 0 and f 1 , respectively. Note that this simplifications are only based on the basic and trivial logic equations and do not include the simplifications based on the larger (by fan-in or fan-out) and different gates. In Figure 6, we present one of these simplified circuits for each of f 0 and f 1 .
About possibility of applying the larger or different gates and replacing them with their corresponding sub-circuits, considering the gates with fan-in number 3 or 4, there are 7 and 6 possible replacements for the structures of f 0 and f 1 shown in Figure 6, respectively. These replacements are the sub-circuits of (g i,2j , g i,2j+1 , g i+1,j ) with 1 ≤ i ≤ 3 and 0 ≤ j < 2 4−i excluding (g 10 , g 11 , g 20 ) for the structure of f 1 .  Note that to find the lowest possible latency for each of these structures in an specific ASIC library, we need to consider all possible combinations for all replacements of previously mentioned low-latency gates with fan-in number of 3 or 4. In this example, we start with replacing (g 30 , g 31 , g 40 ) with an OAI22 gate in both structures for f 0 and f 1 . To do this, we need remove (g 30 , g 31 , g 40 ) and replace it with an OAI22 gate (denoted by OAI22 0 ) together with a single INV gate for each of the four inputs to this gate. However, we can omit these four INV gates by complementing all of g ij gates with 0 ≤ i < 3 and 0 ≤ j < 2 4−i . Precisely, all g ij gates with i ∈ {1, 2} change to NAND gates, except g 10 in the structure for f 1 which changes to a NOR gate, and all BUF gates in the depth level 0 change to INV gates and vice versa.
Again, we can replace each of (g 1,2j , g 1,2j+1 , g 2,j ) sub-circuits with 0 ≤ j < 4 (excluding (g 10 , g 11 , g 20 ) for the structure of f 1 ) by an OAI22 gate, or by an AOI21 gate for 0 < j < 3 in the structure of f 1 . Note that all of these gates changed to a NAND gate after replacing OAI22 0 gate. Similarly, we need to complement the g 0j gates with 0 ≤ j < 16 (except for some of them in the circuit of f 1 ). This leads us to an optimized circuit shown in Figure 7.
One step forward, by considering XOR or XNOR gates, it is possible to do a further simplification on some of the corresponding suggested circuits for the representative function f 0 . One of such simplified circuits is shown in Figure 8. We synthesized latency of this S-box in NanGate 15 nm and 45 nm OCL with typical operating conditions in two different types of behavioral and structural to show efficiency of our method for simplifying the low-latency structure. While in the structural mode, we used one of the simplified structures and simply synthesized the structure, in the behavioral mode, we used the look-up table representation of the S-box and let the synthesizer to optimize the circuit by using compile_ultra -incremental command for several times. We recall that the optimization in the behavioral mode is strongly related to the methods used in the synthesizer that in our case, it is a Synopsys Design Compiler.
The result of these syntheses are provided in Table 4 and as you can see the latency of the structure found by our method can have about 25% lower latency compared to the latency of the circuit found by the synthesizer itself in the behavioral mode.
It is noteworthy that we chose the structure used for the synthesis in the following way. First, in the target library, for both f 0 and f 1 Boolean functions, we tried all the possible structures and evaluate their latency. Then, for each function, we choose and used the structures with the lowest latency in the corresponding coordinate function of the S-box. Hence, this it is a local optimization and not generally optimized. Even for each chosen structure for f 0 and f 1 , there are different possibilities for their input variables to realize the coordinate functions of the S-box. Precisely, there are 16 and 4 possible choices for the inputs of f 0 and f 1 , respectively, to realize the corresponding coordinate. This means for the given structures for f 0 and f 1 , there are 2 20 different possibilities for the inputs of each coordinate function. We only considered the first possibility and used it in the structure given to the synthesizer.
Example 5 (7-bit quadratic S-boxes with latency complexity 4, linearity 32 and uniformity 8). To provide stronger argument on the efficiency of our method for simplifying the lowlatency structure, we implement all 134 7-bit quadratic S-boxes found in Subsection 5.2 in both behavioral and structural modes. Table 8 depicts the result of these syntheses. We recall that in the structural mode, we only try one of the simplified structures of the S-box which might be not the one with lowest latency. In some cases, specially in the case for 15 nm OCL, for the behavioral mode the synthesizer finds a circuit with a lower latency than the one for the circuit we used in the structural mode. In all of these cases, we checked the circuits found by synthesizer and realize that they are one of the possible simplifications of the low-latency structure for the corresponding S-box. That means if we check for all possible simplifications for the low-latency structure of the S-box, we will also meet the circuit found by the synthesizer.
We conclude this example with that trying only a single structure suggested by our method can improve the latency of a given S-box by 22% or 6% in average in the 45 nm and 15 nm OCLs, respectively.

Conclusion and Future Works
In this paper, we mathematically studied the latency of (vectorial) Boolean functions. We introduced the latency complexity metric to measure the latency of Boolean functions. We presented efficient algorithms for 1) finding all Boolean functions with low-latency complexity, 2) determining the latency complexity of the (vectorial) Boolean functions, and 3) finding all the circuits with the minimum latency complexity for a given Boolean function. Then, we presented another efficient algorithm to build bijective S-boxes with low-latency complexity while the previous method of building S-boxes was not suitable for our case.
As a result, for latency complexity 3, we found n-bit S-boxes with 3 ≤ n ≤ 8 whose linearity are 2 n−1 and uniformity are 2 n−2 (except for 5-bit S-boxes that the minimum achievable uniformity is 6). Besides, we found several 5-, 6-, and 7-bit S-boxes with latency complexity 4 whose linearity are 2 n−2 and uniformity are 2 n−4 . Our research has left several possible future works that we point out some of them: • Since determining the gate depth complexity and accordingly latency complexity of a Boolean function is an NP-hard problem, presenting an efficient algorithm to find an upper-bound for the latency complexity is useful. Such an algorithm would present a low-latency implementation of the Boolean function (not necessarily with the minimum latency as of the latency complexity), which is interesting for a designer to reduce the latency of the implementation.
• Another future work is to complete our search for building 6-bit bijective with higher algebraic degree S-boxes with latency complexity 4 and repeat the search for 7and 8-bit S-boxes. Then we can further investigate the relation between latency complexity and minimum achievable linearity and uniformity of the S-boxes within this latency complexity.
• Our algorithm for building bijective S-boxes is not only suitable for finding lowlatency S-boxes. With simple modifications, this algorithm is applicable to build S-boxes with other properties and up to other equivalences. An interesting question in this area is to find all 6-bit bijective S-boxes with minimum linearity, 16, up to the affine equivalence.  Table 7: Latency complexity of known S-boxes together with their uniformity, linearity and algebraic degree. In the columns for algebraic degree and latency complexity, in the case of bijective S-boxes, the second arrays are corresponding values for the inverse S-box.     (4,2,2,4,2,2,5,6),(2,5,2,6,2,2,3,4) (x,3,3,x,4,3,x,x),(3,x,3,x,3,3,5,x)