A Framework to Improve the Implementations of Linear Layers

. This paper presents a novel approach to optimizing the linear layer of block ciphers using the matrix decomposition framework. It is observed that the reduction properties proposed by Xiang et al. (in FSE 2020) need to be improved. To address these limitations, we propose a new reduction framework with a complete reduction algorithm and swapping algorithm. Our approach formulates matrix decomposition as a new framework with an adaptive objective function and converts the problem to a Graph Isomorphism problem ( GI problem). Using the new reduction algorithm, we were able to achieve lower XOR counts and depths of quantum implementations under the s-XOR metric. Our results outperform previous works for many linear layers of block ciphers and hash functions; some of them are better than the current g-XOR implementation. For the AES MixColumn operation, we get two implementations with 91 XOR counts and depth 13 of in-place quantum implementation, respectively.


Introduction
In recent years, lightweight cryptography has gained significant attention with the advent of the Internet of Things (IoTs) and Radio-Frequency Identification (RFID) technologies.These new rapidly developing applications demand high device performance and energy efficiency.As a result, there has been a growing interest in reducing the complexity and energy consumption of hardware implementation while ensuring the security of the cryptographic system [DEMS16, BJK + 20].
To make the implementation of symmetric ciphers as lightweight as possible, researchers have focused on optimizing some metrics, including gate equivalents, latency, circuit size, energy consumption, and so on.Since the linear components of lightweight cipher are often equivalent to a series of XOR operations, the most intuitive requirement is to minimize the XOR counts [Paa97].Similarly, in quantum circuit design, the corresponding metrics involve the depth, the width (the number of qubits), and the gate count (the number of quantum gates).The quantum logic gates contain Pauli gates, Hadamard gates, CNOT gates, and others.For the linear layer, the depth of the quantum circuit, the number of qubits, and the number of CNOT gates are essential metrics to estimate the complexities in the standard quantum circuit model [NC01].
In designing the primitives of lightweight cryptography, many researchers concentrate on constructing a low-cost linear matrix while ensuring security and meeting software and hardware performance requirements.Various works such as [SKOP15], [KLSW17], [CTG16], and [LS16] have explored the design of matrices containing special structures like circulant, Hadamard, Toeplitz, or involution matrices, aiming to reduce the number of XOR operations.Moreover, Ascon [DEMS16], the winner of the NIST lightweight cryptography cipher competition, employs the permutation which costs two binary XOR operations per bit.Recently in CRYPTO 2023, Solane et al. [EHDRM23] constructed a new linear layer called the "Twin column parity mixer", which requires only 3.2 XOR operations per bit and has a bitwise differential branch number of 12 (4 for linear branch number).
Meanwhile, it has been another focus to optimize the implementation of existing linear layers.To further delve into reducing the XOR counts, researchers have established three essential measurement tools in linear layer optimization to formulate the implementation: d-XOR, g-XOR, and s-XOR counts.The g-XOR and s-XOR counts also correspond to two different ways to implement the linear layer, and their definitions will be introduced in detail in the following sections.Many researchers have designed various heuristic algorithms to optimize the implementation based on these metrics.
Furthermore, finding an optimal implementation with g-XOR counts is equivalent to solving the problem Shortest Linear Program (SLP), which is NP-hard [BMP13].Boyar and Peralta proposed a heuristic algorithm (named BP algorithm) that minimizes the distance vector [BP10].Their algorithm defines a base set and updates the distance vector according to the current base set.In the process of updating, the smaller the distance vector is, the closer the current base set is to the aimed matrix.The works on [BFI19, ME19,TP20] have focused on improving the efficiency in different cases, such as dense matrices, by modifying the tie-breaking phase in the BP algorithm.Lin et al. [LXZZ21] introduced a new framework that optimizes the implementation, achieving 91 XOR counts for AES under the g-XOR metric.Their framework mainly relies on several reduction rules and iteratively searches for the subsequence in the implementation that satisfies the rules.Qun et al. [LWF + 22] introduced the strategy of backward search based on the Constant Matrix Multiplication (CMM) problem to minimize the depth in implementing the linear layer.In optimizing the s-XOR counts, Xiang et al. [XZL + 20].proposed a reduction algorithm based on reduction properties, achieving the best result of 92 XORs under the s-XOR metric for AES.Under the s-XOR metric, the implementation can be immediately transformed into an in-place quantum circuit.Moreover, some researchers take other potential metrics into their counts instead of only considering the XOR counts [LWF + 22, HS22].
Regarding quantum circuits, Grover's algorithm can provide at most a quadratic speedup on finding the key for a symmetric cipher instead of a classical exhaustive key search [Gro96].Thus, many works focus on improving the quantum implementations of some block ciphers.In [GLRS16], a quantum circuit on AES requiring around 3000 to 7000 qubits was proposed for Grover's attack.Furthermore, the works on [LPS20, ZWS + 20, JNRV20] have improved by reducing the number of qubits and lowering the circuit depth.Huang and Sun [HS22] proposed a new framework for generating a quantum circuit for the round function of block ciphers, and the linear layer is taken from the result of Xiang et al.The work in [Max19] achieved an out-of-place quantum implementation with 92 XOR counts, a depth 22, and a circuit width 318.Zhu and Huang [ZH22] extended the ideas from Xiang et al. to quantum circuits by utilizing the properties of move-equivalence and exchange-equivalence and obtained results in the in-place quantum implementation of the AES MixColumn operation with depth 28 and keeping 92 XOR counts.Recently, at ASIACRYPT 2023 [LPZW23], Liu et al. proposed some generic techniques to improve AES quantum implementations, including searching for a new AES MixColumn in-place implementation with 98 XOR counts and depth 16 without ancilla.
Despite many efforts dedicated to optimizing the XOR counts and the quantum depth of linear layer circuits [DBBV + 21, MZ22], the circuit implementations of linear components still can be better optimized.

Our Contribution
Our research presents a more comprehensive framework for implementing the linear layer of a block cipher using heuristic search.This framework mainly employs the method of elementary matrix decomposition, which is able to convert the sequence directly into a specific in-place implementation of XOR operations without auxiliary registers.
We first identify a gap in the work by Xiang et al. and propose several definitions for describing the reduction of type-3 sequences.Moreover, our approach introduces the concept of equivalent sequences, which allows us to describe the original reduction properties introduced by Xiang et al. within a novel framework.Using the equivalent relationship, we model the problem of optimizing type-3 sequences as a GI problem, which avoids the limitations of previous work and enables the framework to reduce type-3 sequences.The framework also introduces the concept of the objective function and prime sequence, which make the framework flexible enough to optimize the implementation from different intentions, e.g., reducing the XOR counts and the quantum circuit's depth.Furthermore, we propose a low-complexity algorithm to determine whether elements in the sequence can be swapped to adjacent positions based on the pairwise exchange nature of the sequence.The new techniques enable the search for implementations with lower XOR counts and lower depths of quantum circuits.
We achieved superior results in many reversible matrices using our newly proposed reduction algorithm.In particular, we have obtained a new implementation of the AES linear layer with 91 XORs under the s-XOR counts metric, which is equivalent to the result obtained under the g-XOR metric [LXZZ21].Additionally, we have applied our search algorithm to reduce the quantum circuit depth for linear layers by setting the objective function.As a result, we have obtained a new AES MixColumn in-place quantum implementation with 98 XOR counts and depth 13 without ancilla, representing the best result for our known.These results can be applied to the newest AES quantum implementations in [HS22] and [LPZW23].

Organization
The paper is structured as follows.Section 2 introduces the notation used in this paper and offers some relevant knowledge and background.Section 3 describes the method of matrix decomposition and points out the incompleteness of the method.Our new framework to optimize the s-XOR sequence is introduced in Section 4. Section 5 shows the detailed results including achieving lower XOR counts and lower depth of quantum implementation of some linear layers.At last, Section 6 makes a conclusion and discusses future work.

Preliminaries
In this part, we introduce the notation in this work and offer some relevant knowledge about algebra and graph theory.

Notation
the general linear group of invertible n × n matrices over the finite field F 2 E 1 i the i-th type-1 elementary matrix in sequence E 3 i the i-th type-3 elementary matrix in sequence E(i ↔ j) the specific type-1 matrix is equal to performing row exchange between the i-th row and the j-th row E(i + j) the specific type-3 matrix is equal to performing row addition over F 2 from the j-th row to the i-th row A m×n a matrix with m rows and n columns with entries from F 2 A i,j the entry in position (i, j) of matrix

Relevant Knowledge
Linear Layer The linear operation in a substitution-permutation network block cipher can be written as a matrix multiplication like the following form: L is the matrix form of a linear operation, U is the input bit vector and O is the output bit vector.This work only considers L as a non-singular square matrix.

Elementary Matrix
Definition 1.An n × n square matrix of type-1, type-2, and type-3 is a matrix obtained from identity matrix I n×n by performing an elementary row (or column) operation of row swapping, row multiplication with non-zero constant, and row addition respectively.
Note that the method of Gauss-Jordan elimination can be used to transform any nonsingular matrix into an identity matrix by performing a series of elementary operations.Also, every elementary transformation can be described as the multiplication of the elementary matrix in Definition 1. Formally, we have the Theorem 1:

Theorem 1. [Art11] A is non-singular if and only if A is the product of elementary matrices.
Because we are dealing with matrices over the binary field, we can just consider the multiplication of type-1 and type-3 matrices.The general matrix multiplication is not commutative, however, for multiplication of type-1 and type-3 elementary matrices the following property (Property 1) holds.
According to Theorem 1 and Property 1, we can decompose a non-singular matrix into a product sequence of elementary matrices, where the preceding part is the product of type-1 matrices and the subsequent part is the product of type-3 matrices.Since row additions correspond to XOR operations in the implementation of the linear function, reducing the number of type-3 matrices in multiplications is the key to optimizing the implementation.Next, we will introduce three metrics in XOR counts to measure the cost of implementing a specific matrix.

Relevant metrics
Definition 2. d-XOR counts.[KPPY14] The d-XOR counts of a matrix M in GL(n, F 2 ), denoted by wt d (M ) is where ω(M ) denotes the number of ones in the matrix M .
We only use wt d (M ) to estimate the cost of implementation.And the following g-XOR counts and s-XOR counts correspond to the optimal number of XOR operations for two different implementations.

Definition 3. g-XOR counts. [XZL + 20]
Consider a matrix M n×n over F 2 , the implementation of M can be viewed as a XOR sequence made of x i = x j1 ⊕ x j2 , where 0 ≤ j 1 , j 2 ≤ i for i = n, n + 1, ..., n + t − 1 (t is the length of the corresponding XOR sequence).In the implementation, x 0 , x 1 , ..., x n−1 are the input bits, and the output bits are a subset of x i (i > n − 1).The minimal number of such operation sequences that compute the n bits output is defined as g-XOR counts.

Definition 4. s-XOR counts. [JPST17]
For a non-singular matrix M ∈ GL(n, F 2 ), the minimal number of t is called s-XOR counts (sequential XOR-count) of M , such that where P is a permutation matrix and E 3 k is a type-3 elementary matrix.The difference between g-XOR and s-XOR is that operation under the s-XOR metric stores the result within the input lines.The s-XOR sequence expression directly corresponds to a series of XOR operations, where the k-th operation is Consequently, s-XOR representation offers significant advantages in quantum circuit implementation, such as avoiding additional auxiliary qubits.Moreover, in-place implementation tends to result in a lower T-depth [Max19].In classic circuits, we only need one instruction to execute x i = x i ⊕ x j , while the g-XOR sequence requires an extra copy instruction and an additional register on platforms that has invariably 2-operand instructions, typically for some micro-controllers of RISC architectures.
While implementing a linear layer using the s-XOR matrix provides numerous benefits, the s-XOR counts are consistently higher than the g-XOR counts [XZL + 20].Regarding the relationship between d-XOR and g-XOR counts, s-XOR is not always less than d-XOR.In the work of [Köl19], a matrix in GL(7, F 2 ) is presented, which exhibits its s-XOR that is larger than the d-XOR.Furthermore, researchers employ diverse approaches when studying these metrics, and the results from both perspectives often complement each other.In our heuristic search work, we have also achieved fewer counts as existing results under the g-XOR metric in many matrices.
After introducing some basic metrics and the matrix decomposition theorems, it is possible to shift our focus to another field of mathematics, namely graph theory.What unites these two domains is the profound interaction between algebraic structures, represented by matrices, and the combinatorial structures of graphs.For instance, matrices derived from graphs, such as adjacency matrices or incidence matrices, can be subjected to various decomposition methods, providing valuable structural insights into the underlying graphs.Furthermore, graph isomorphism enables us to explore matrix similarity, thereby unifying these two areas under the broad umbrella of mathematical structure analysis.The concepts of graph, bipartite graph, and graph isomorphism problem are as follows.
Undirected Graph and Bipartite Graph.An undirected graph is a concept in graph theory [W + 01] that consists of a set of vertices and the edges that connect those vertices.In an undirected graph, the edges have no direction, meaning that all the edges are bidirectional and that each edge can be traversed from two connected vertices.
An undirected graph G can be presented as a composition of vertices V and a set of edges E. The vertex set V can be represented as V = {v(1), v(2), ..., v(n)}, where n is the total number of vertices.The edge set E can be represented as Another way of representing a graph is by using its adjacency matrix.An adjacency matrix is an n × n matrix, where n is the total number of vertices.If there exists an edge between vertex v(i) and v(j), the elements in the i-th row, j-th column, and j-th row, i-th column of the adjacency matrix are 1; otherwise, they are 0. The adjacency matrix can be expressed as: 1, if there is an edge between vertices v(i) and v(j), 0, otherwise.
In a bipartite graph, all vertices can be divided into two disjoint sets: V 1 and V 2 .Every edge in the graph connects a vertex in V 1 to a vertex in V 2 .The bipartite graph has no edges between vertices within the same part.Formally, a graph G = (V, E) is said to be bipartite if there exists a partition It is important to note that in the following discussion, we consider the bipartite graph as a special case of an undirected graph.For convenience, we denote the vertices set V of a bipartite graph G and i-th vertex as where m 1 and m 2 represent the sizes of the two vertex sets respectively.By exchanging the order of the vertices, without loss of generality, we can ensure that the first m 1 vertices are the vertices in V 1 , and the last m 2 vertices are the vertices in V 2 .When we consider the bipartite graph with m 1 = m 2 = m, we can use square matrix of dimension m to describe the graph.To distiguinsh it from adjacency matrix, the matrix is called relation matrix and it can be expressed as: 1, if there is an edge between vertices v(i) and v(j + m), 0, otherwise.

Graph Isomorphism Problem
For undirected graphs, an isomorphism is a bijective mapping between the vertex sets of two graphs that preserves both the adjacency and the direction of the edges.Formally, if we have two finite undirected graphs G = (V (G), E(G)) and H = (V (H), E(H)), we say G and H are isomorphic if there exists a bijection In other words, isomorphisms are adjacency-preseving bijections between the sets of vertices, and the graph isomorphism problem asks to determine whether two given graphs are isomorphic [Bab18].Despite being in the NP class, the graph isomorphism problem is not known to be NP-complete or in P class [For96].However, many powerful quasi-polynomial algorithms are sufficient to solve our problem [HBD17].In our work, we employ Nauty and Traces1 [MP14] to solve the GI problem, which is well-suited to handle the vertex-colored graph.

Method of Matrix Decomposition
In this section, we first commence with a detailed presentation of Xiang et al.'s methods to set the stage for further discussions.For further details on the matrix decomposition method to optimize the implementation of the linear layer, we refer the reader to [XZL + 20].
The work of Xiang et al. presents several ingenious methods to decompose a matrix to a sequence of type-1 or type-3 elementary matrix multiplication s : 1. Gauss-Jordan method shows that there exists a sequence of elementary row transformations that converts the original invertible linear matrix L to identity matrix I such that s • M = I, then we have M = s −1 .Type-1 and type-3 matrices are involutional matrices, so attaining a sequence decomposition of M is easy.The method is referred to as strategy-1.
2. The method described above is based on row swapping and row addition.Similarly, we can decompose the matrix L through column swapping and column addition to another sequence s ′ .Strategy-2 is the designation for this method.
3. The strategy-3 in their work is motivated by reducing the number of type-3 elementary matrices in the decomposition sequence.They first reduce the number of ones in matrix L through row addition as much as possible.If there are multiple local optima, it will choose one operation randomly.When performing row addition of any two rows that cannot reduce the number of "1" in the matrix, the program is turned to strategy-1 or strategy-2 to avoid an infinite loop.Combining strategy-1 or strategy-2 with strategy-3 results in strategy-3-1 and strategy-3-2, respectively.
Assume we get a sequence s through the strategy-3-1 or strategy-3-2 has the form: In order to identify potential reductions, the method exhaustively examines all possible combinations of two and three type-3 matrices in the given sequence.If these matrices can be swappped to adjacent positions, the algorithm checks whether they satisfy seven distinct reduction properties.Only if both conditions are met, a reduction will be made accordingly.
To introduce the properties, the notation E(i + j) is employed to denote the type-3 elementary matrix with an additional "1" in position (i, j) as compared to the identity matrix.Similarly, the notation E(i ↔ j) is used to represent the type-1 elementary matrix, which exchanges the i-th row and the j-th row of the identity matrix.Then we have: Among the properties, i, j, and k are all integers and are different from each other.
Among the properties, i, j, k, l are all integers and are different from each other.
We found that in addition to the reduction properties proposed by Xiang et al., there are more reduction properties.For example, the simplest one: And the extra properties which are similar to Property 2: The more complicated properties (noticed that none of them can be deduced from the existing properties): When the number of parameters in the reduction property is much greater than 3 (e.g., introduce variables i, j, k, l to describe the property when the number equals 4), it will create more than 50 reduction properties like above, which means the Property 2 are incomplete.Furthermore, the properties proposed by Xiang et al. have other problems besides completeness; that is, they have redundant properties.Consider two equations, R3 and R4, in Property 2, the right-hand formulas are equal, and the left-hand formulas can be obtained by exchanging E(i + k) and E(j + k) through swapping properties.As a result, we need to use a new mathematical method to revisit these properties so that there is no redundancy and a complete description of the features that can be reduced.
Apart from obtaining a comprehensive reduction algorithm, it is crucial to determine whether non-adjacent type-3 matrices in one sequence can be swapped to adjacent positions.The work of Xiang et al. did not propose an explicit algorithm, and the judgment logic in their code is as follows: for E 3 x and E 3 y (x < y) in sequence, judging whether they can reach the adjacent position only judges whether E 3 x can move left one by one until E 3 y−1 or E 3 y can move right one by one until E 3 x+1 .However, this logic needs to be more balanced with the complexity of the sequence and account for the potential difficulties in pairwise exchanges.This problem is more complicated when the width of the circuit increases, as the following example shows: Example 1. Suppose the sequence s consists of two subsequences s 1 and s 2 , i.e., s = s 1 •s 2 .The subsequences s 1 = E(y + x)E(y + t 0 )...E(y + t m ), s 2 = E(u 0 + y)E(u 1 + y)...E(u n + y)E(x + y) such that t i , u i , x, y are different integers.In this situation, we can find E(y + x) and E(x + y) satisfy the R7 in reduction properties.However, the sequence s cannot be reduced through the algorithm proposed by Xiang et al.We only need to move the E(y + x) to the tail of s 1 and move the E(x + y) into the head of s 2 , and then it can apply the reduction properties.
To overcome this limitation, we design a new algorithm that accurately determines whether a given matrix can be pairwise exchanged to adjacent positions.This algorithm provides a more reliable approach to address the challenge of swapping properties.
By incorporating this new algorithm, we can enhance the overall effectiveness and robustness of the reducing process.This ensures that the identified properties can not only be reduced but also seamlessly transferred to adjoining positions.In Section 4 of our research, we present an efficient algorithm that can completely solve the pairwise exchange problem in O(|s| 2 ) time complexity.

A Framework to Optimize s-XOR Sequence
In this section, we attempt to redefine the problem, including the definition of matrix decomposition.Starting from a new definition, we derive a complete reduction algorithm within the framework of matrix decomposition.
sequence is either a type-1 or a type-3 matrix.Applying the Property 1, the sequence S can be transformed to Matrix Decomposed Ordered Sequence, which has the following form: ..E 3 0 • I If a sequence only contains type-1 elementary matrices, it is called a type-1 sequence.Similarly, if a sequence only contains type-3 elementary matrices, it is referred to as a type-3 sequence.We note the front part of s as type-1 sequence of s, written as ..E 3 0 which means the type-3 sequence of s.In case the sequence is empty (or |s| = 0), it is replaced by the identity matrix I.
Based on the above analysis, the E 1 elementary matrix in the circuit corresponds to the permute operation, while E 3 means the XOR operation of two lines of the circuit.The order of type-3 sequence and XOR operation in the circuit are the same.However, since the cost of permutation in the circuit is negligible and the aim is to reduce the number of type-3 matrices in decomposed sequences, the research mainly focuses on type-3 sequences and ignores the details of type-1 sequences.In the following section, the notation |s| is employed to denote the length of type-3 sequences in the matrix decomposed ordered sequences s.Definition 6.For two type-3 sequences s 1 and s 2 , they are equivalent if and only if there exist two permutation matrices p L and p R , such that the following equation holds: At the same time, we use s 1 ∼ s 2 to describe the equivalence relationship.
The permutation matrix can be written as a type-1 sequence, giving rise to an alternative way of expressing the equation: s l • s 1 • s r = s 2 .Specifically, the type-1 sequence s l (s r ) equals permutation matrix p L (p R ) corresponding respectively.
It is easy to check that the above relation is reflexive ) and therefore it is an equivalence relation.Consequently, we can explore its properties from the perspectives of graph or set theory.For two equivalent sequences s 1 , s 2 , we assume their values equal to matrix A, B ∈ GL(m, F 2 ) respectively.Let A = {a i,j } m×m and B = {b i,j } m×m be the relation matrices of the bipartite graph G A and G B .Under these definitions, we introduce the following lemma: Lemma 1.Given two type-3 sequences s 1 and s 2 , if s 1 and s 2 are equivalent then the bipartite graph G A and G B are isomorphic where A = s 1 • I and B = s 2 • I.
Proof.Assume the s 1 and s 2 are the type-3 sequences on GF (m, F 2 ), then matrix A and B are square matrices with size m.It is clear that if s 1 and s 2 are equivalent, there exists vertices permutation σ L and σ R such that p L • A • p R = B where p L and p R are the matrix forms of the permutation.The permutation σ L can be obtained from p L through σ L (i) = j when the element in p L with position (i, j) equals 1.
Then it holds that B i,j = A σ R (i),σ L (j) for i, j = 1, 2, ...m.According to the definition of bipartite graph mentioned above, it means for all In other words, we are able to construct a bijective mapping It holds that for any two vertices v(i) and v(j) in G A , they are adjacent if and only if their image under permutation f , i.e., f (v(i)) and f (v(j)) are adjacent in G B .
However, this lemma does not hold conversely because the isomorphic mapping between bipartite graph G A and G B may cross the vertex partition set V 1 and V 2 .Then, we cannot transform this situation to the form of Definition 6.We show a simple example: Example 2. Given two type-3 sequences s 1 , s 2 , the corresponding values of s 1 and s 2 matrices are as follows.Figure 1 shows the bipartite graphs G A and G B generated by s 1 and s 2 .
Actually, it exist p L and p R such that p L • s 1 • p R = (s 2 • I) T .Furthermore, in many other situations, the isomorphic mapping will map vertices in V 1 to V 2 , and also map vertices in We can use two techniques to handle this situation: first, using a directed graph instead of an undirected one.When we only consider transferring the invertible matrix to the directed graph, which means if matrix A has elements a i,j = 1, then the directed bipartite graph has an edge from i ∈ V 2 to j ∈ V 1 .The directed edge will restrict the mapping within V 1 and V 2 since the in-degree of the vertices in V 2 is zero where greater than zero in V 1 .
The second technique, which has been actually applied in our work, is to color the graph.In graph theory, vertex-colored graphs will consider the color of a vertex as an extra variable.Formally, a graph with |V | nodes is said to be colored if each node in the graph is labeled with a positive integer not greater than |V | [JKMT03].Going back to example 2, we can color the points in V 1 as 1 (blue) and, simultaneously, color the points in V 2 as 2 (red.)The newly generated colored undirected graphs are not isomorphic anymore.Because for two vertex-colored graphs, the isomorphic mapping should preserve the edge and the vertex colors at the same time.Figure 2 shows vertex-colored in Example 1.We note the colored graphs as with colour 1 (blue), and coloring the V 2 with colour 2 (red).After the above discussion, we can get the following theorem: Proof.According to Lemma 1, we only need to prove the necessity of the theorem.Assume G A is isomorphic to G B , then it is able to construct permutations σ L and σ R from a bijective mapping f : V (G A ) → V (G B ). Due to the vertex-coloured graphs, f can be separated to two part now, one's preimage and image are v(i) where i ≤ m, the other is the opposite.Then this theorem can be proved similarly by the method of Lemma 1. Definition 7. A type-3 sequence s, s is a composite sequence if there exists another type-3 sequence s ′ such that s ∼ s ′ and f (s ′ ) < f (s).If not, we define s as a prime sequence.f is the objective function we want to minimize which is mapping the sequence s to a ∈ R.
In order to better explain the meaning of prime sequence and objective function, we take the search for the minimum number of XOR in implementing a reversible linear matrix as an example up to Section 5.2.In this situation, the objective function f (s) = |s|, where |s| is the number of type-3 matrices in the sequence s.
Equivalence relations enable the creation of equivalence classes in collections.
To study what kind of sequence can be optimized, we introduce the concept of prime sequence and composite sequence.If we denote S l,n is the set of all sequences with a length of l and square matrix size of n (n also equals the width of the circuit).And for given objective function f , we denote P n f as the set of all prime sequence with square matrix size of n.Under the same objective function f and square matrix size n, the subscript and superscipt can be omitted without ambiguity.In the given setting, the notations P and P n f are interchangeable.
In the case of optimized s-XOR counts, it is evident that for sequence s such that |s| = l cannot be decomposed to a sequence s ′ with shorter length if and only if s ∈ P.Moreover, we can design a trivial exhaustive algorithm to search all P.However, it is important to note that the size of P can become excessively large as the variable scales up.In such cases, the use of equivalence classes can be a more effective solution.The set P is able to divided into P = P0 ∪ P1 ∪ ... ∪ Pt , where the set's elements Pi are equivalence classes by the relationship we defined.So, we can only store one element of every equivalence class as a representative, and the designated set is noted as Pn f ( P for short).Algorithm 1 is a trivial algorithm for an exhaustive search for Pn f within set S 1,n ∪ S 2,n ∪ ... ∪ S K,n where K is a big enough integral number to cover all prime sequences.Due to the size of S K,n is n K (n − 1) K , and the complexity of practical isomorphism algorithms on the worst-case is O(2 n ).Therefore the complexity of Algorithm 1 can be estimated as O(n 2K • 2 n • | P|).In theory, to find the equivalent sequence that has smaller length, we can exhaustively search P with objective function f (s) = |s|.And for any s ∈ S l,n we can check whether exists p ∈ P such that s ∼ p.However, it is unrealistic to search the whole P in GL(n, F 2 ) for the large n, since the size of P can be roughly estimated by the following corollary.
Proof.The work in [Köl19] proved a similar theory.The difference is that we consider two permutation matrices simultaneously.Essentially, ∀p ∈ P is the optimal type-3 sequence decomposition for corresponding reversible matrix M ′ ∈ GL(n, F 2 ) under objective function f .Assume M ′ has two matrix decomposed sequences s 1 , s 2 and exist two prime sequence p 1 and p 2 such that s 1 ∼ p 1 , s 2 ∼ p 2 for p 1 , p 2 ∈ P. According to the definition of prime sequence, we have So exists one and only one equivalence class such that the element in this class is equivalent to any matrix decomposed sequences of a certain invertible matrix.In other words, we can find one and only one equivalence class representing a specific reversible matrix.
Then for any two different invertible matrices M 1 and M 2 , M p • M 1 = M 2 where M p is a permutation matrix.Assume p 1 , p 2 are the prime sequences representing M 1 and M 2 , respectively.It exists the permutation matrices In other words, we can use the same equivalence class to represent the matrices M 1 and M 2 .For one invertible matrix M 1 , we can get n! different invertible matrix M 2 by performing n! different row permutation, which are belonged to the same equivalence class.It suggests that the upper bound of the number of distinct equivalence classes can be estimated by dividing the total number of invertible matrices by n!.When we consider the row permutation and column permutation simultaneously, it will create some reversible matrix repeatedly.As the result, we have the property The corollary demonstrates that if we want to get the complete P which are the optimial sequence of matrix in GL(n, F 2 ), the data size is enormous when n = 16 or n = 32.To tackle this challenge, instead of focusing on finding the global optimum, we need to focus on searching for local optima.We search for reductions which are optimal for every subsequence with length l m (where l m is a parameter denoting maximum considered length of a subsequence that can be optimized), which can be swapped to an adjacent position.
The crucial ingredient of the algorithm is the function QueryBetterSequence(s), QBS(s) for short.It takes a sequence s of type-3 elementary matrix multiplications from GL(n, F 2 ) of length is equal to l m as input and returns a quadruple (flag reduce , p, p L , p R ).If flag reduce is true, the function could reduce s to p such that f (p) < f (s) and s = p L • p • p R .Otherwise, there was no such sequence p satifies that f (p) < f (s).
In Algorithm 2, we rely on an query function QBS() and swapping check algorithm SwapCheck(), which we will elaborate on subsequently.Essentially, the algorithm involves iterating over all feasible discontinuous subsequences of length l with pointer array ptr and the function MovePointer() to update the pointer array.Specifically, the function MovePointer(ptr) attempts to move the last element of the pointer array ptr.If the current pointer cannot move (that is, reaches the boundary or equals the value of last pointer minus 1), it tries to move the next pointer, and so on until all pointers cannot be moved.If the subsequence can be swapped to continuous and can be reduced to a new sequence with smaller length, the original sequence s is updated accordingly.
In order to create the function QBS() in the program, it is crucial to leverage the property of equivalence and the algorithm for solving GI problem.For given length l m , after generating the P through Algorithm 1 then the given sequence s ∈ S lm,n can be optimized by the following steps.If the sequence s is equivalent to p ∈ P and f (p) < f (s), we can Algorithm 2 Reduction algorithm for s ∈ S l,n Input: The sequence s to be reduced; function QBS();the length l m of subsequence in iteration; Output: reduced sequence from s 1: initialize ptr := {0, 1, 2, ..., l m − 1} 2: initialize loop := true 3: while loop do 4: (loop, ptr) := MovePointer(ptr) 14: end while 15: return s optimize sequence s by using sequence p instead since we have the following equation: As a result, we can build the query function assumed above by precomputing the GI problem to obtain p L and p R .We use an example in Appdendix A to demonstrate the process.
The function SwapCheck(ptr, s) in Algorithm 2 is able to judge the specific elements in the sequence whether can be swapped to adjacent locations through Property 3.And if the swapping flag is true, it will also return the indices of the elements after swapping.However, we have to claim that the efficiency of this part of the algorithm dramatically affects the efficiency of our entire reduction algorithm.Although the exchange algorithm used by Xiang et al. is incomplete, its efficiency is fully guaranteed.For type-3 sequence s = E 3 0 E 3 1 ...E 3 n , the pointer array is ptr = {i 0 , i 1 , ...i lm−1 } such that i 0 ≤ i 1 ... ≤ i lm−1 and insert position index is pos insert .In our actual use, the exchange algorithm can be divided into three methods: Method 3. The generalized swapping algorithm (Algorithm 3).
To solve the problem that for given sequence s and target element index array a (assume the size is m), we only need to consider the element among in index array a which are not our targets.It means we need only consider the elements in the interval Subsequently, a pointer is moved in an orderly fashion, with the current element being marked as "t".If t happens to be an element in ptr, no further action is required.Otherwise, there are only two possible options: move it to the left outside or right outside the s ′ .One essential guideline is to keep the left side of s ′ as all target elements and try to move t to the left first.If this fails, the next step is to move t to the right.When moving t to the right, we only need to determine whether all elements on the right side can be exchanged.
If not, we must continue the comparison based on the depth-first traversal method.In the worst-case scenario where each element is to be moved, at most |s ′ | comparisons are made; hence, the complexity of the overall algorithm can be estimated as O(|s| 2 ).The Algorithm 3 pseudocode can be found in Appendix B.
We must point out that when applying Method 3, overly complex matrices could have efficiency issues (when the matrix size is 32 bits or more).The reason is that the worst case in Algorithm 2 is that we always perform one optimization when we traverse all possible pointer values.Assume the complexity of the SwapCheck() algorithm is t 0 and t 1 times reduction has been made, then the complexity of the algorithm be estimated as O(l lm • t 0 • t 1 ).Thus, striking a balance and using different methods based on different matrices is imperative.
In conclusion, we have proposed a more general reduction framework for optimizing the objective function of the type-3 matrix sequence by solving GI problem.This framework precomputes the function QueryBetterSequence(s) to judge whether the type-3 sequence with length l m can be reduced.It applies the elimination strategy-3-1 or strategy-3-2 raised by Xiang et al., which continuously creates sequences.Moreover, it will check all subsequences which can be swapped to adjacent positions for each sequence.This approach ensures that all subsequences of length l ′ (l ′ ≤ l m ) remain optimal, thereby avoiding the limitations of Xiang et al.'s original algorithm highlighted in Section 3, including insufficiency and redundancy.

Applications
In this section, we apply our improved algorithm to a variety of invertible matrices and get enhanced implementations under XOR counts and the quantum circuit depth, respectively.For optimizing XOR counts, we conduct a comprehensive comparison with existing algorithms, including those proposed by Xiang et al., Paar, and Boyar.Our new algorithm has achieved the best results we know so far in most of the matrices provided, including 16 × 16 and 32 × 32 matrices.Notably, we improve the AES MixColumn implementation with 91 XORs (Table 4) from the result of 92 [XZL + 20], which is equal to the implementation under g-XOR counts [LXZZ21].
In the previous section, we introduced the definition of the objective function, which we regarded as the length of the type-3 sequence for ease of understanding.However, the definition of the objective function is very flexible.For example, we can change the objective function to the quantum circuit depth corresponding to the type-3 sequence.We apply our algorithm to many ciphers' linear layers as well, and Table 5 provides the second AES MixColumn implementation with a depth of 13, which is a considerable improvement from the previous implementation of 28 [ZH22].While considering some complex matrices, our algorithm still faces efficiency issues, including most of the 64 × 64 matrices and high d-XOR counts matrices in the size of 32 × 32.

Low XOR counts implementation under s-XOR metric
We applied the algorithm described in Section 4 to various matrices, including the AES linear layer, to search for an implementation of the linear layer with fewer XOR operations.For each matrix, we obtain the corresponding result by running the process on a 64-core computer for over five days.It is evident that, in most cases, there was a significant improvement compared to the original results.However, it is essential to note that while this method offers a more comprehensive approach than the one proposed by Xiang et al., it does have efficiency limitations.Mainly, when the initial sequence generated by [XZL + 20] involves approximately 500 or more XOR operations, the efficiency gap between the two algorithms becomes pronounced.In such cases, when the matrix size exceeds 16 × 16, we must switch from the swapping strategy described in Section 4's method 3 to method 2.Moreover, the parameter l we set here is 4.
In order to compare efficiency, we have included results from recent studies in Table 2. To distinguish between the two methods of implementation, namely s-XOR and g-XOR implementations, we have divided them into separate sub-tables for comparison. 1 Results with (*) indicates that they are also the state-of-the-art known to us under g-XOR counting.

Optimizing the depth of quantum implementations of linear layers
The method in section 4 is a universal optimization framework that enables the adaptation of the objective function to search for low-depth implementations of the linear layer in quantum circuits.This is achieved by setting f (s), the objective function, to be equal to the depth of sequence s, then the P store the type-3 sequence in which the elements are the lowest depth of sequences to get the corresponding matrices.In the 5.2 section, we set the f (s) = M • depth(s) + |s| where M is a big number that can consider the depth of the circuit first, and XOR counts second.Specifically, M = 10 ⌈log 10 |s|⌉+2 in our experiment.
In the case of a type-3 sequence, it corresponds to an XOR operation sequence.While the order of this operation sequence should be maintained, it is not absolute.The only requirement is to keep the values of x i and x j on which this operation depends the same.From this perspective, a type-3 sequence can be viewed as a graph, where each operation represents a node.These nodes contain x i and x j and are connected to the previous nodes that contain either x i or x j .The nodes in each layer share the same quantum circuit depth, and our objective function is transformed into the depth of this graph.The following example shows how the function f works on a quantum circuit: Example 3. Consider a type-3 sequence: E(4 + 5)E(7 + 5)E(1 + 2)E(2 + 4)E(1 + 3)E(0 + 4)E(1 + 2).We initiate a counter array with zeros to store the depth.We then traverse the entire sequence from the beginning.The first operation, E(1 + 2), checks that the depths stored in counter[1] and counter[2] are both 0. Consequently, the E(1 + 2) depth is set to 1.The same procedure is followed for E(0 + 4).However, when we reach E(1 + 3), we find that counter[1] is 1 and counter[3] is 0. In this case, we take the maximum value and increment the depth of E(1 + 3) by one, resulting in depth 2. By following this process, we can determine the depth of the entire sequence, which in this example is 3. Figure 3 shows how it works.
We apply this approach to the AES cipher's linear layer and obtain a quantum circuit with depth 13, which is in Table 5 and the other linear layer results are in Table 3.A comparison with the method proposed by [ZH22] reveals that our implementation exhibits high parallelism.In most layers, the number of included parallel operations exceeds 5, whereas in Zhu et al.'s method, a considerable portion of the circuit layers only includes less than three parallel operations.However, we have to claim that the result in Table 3 does not keep the best XOR counts, and the implementations might use several extra XORs to achieve high parallelism.As mentioned in the introduction, many works use the result of the s-XOR sequence to construct an AES quantum circuit since the s-XOR is an in-place implementation of a linear layer and does not need ancilla qubits.It always leads to lower width and full depth in the circuit.Taking the work in [HS22]

Conclusion
In this work, we present a new flexible framework of heuristic search for optimizing the XOR counts under s-XOR and the depth of the quantum circuit.With the new definition and introduction of the new equivalent relationship, we solve the reducing problem by solving the GI problem instead of using reducing properties.The new approach avoids the previous limitations and achieves superior results in many reversible matrices.For AES MixColumn, we achieve an implementation with 91 XORs which equals the best implementation under the g-XOR metric.Furthermore, compared with previous work, the depth of quantum depth has been reduced from depth 16 to depth 13 without ancilla.The results show the new approach is effective and has the potential to improve the existing quantum circuits.As a future work, it would be interesting to study the more effective way to create original type-3 sequences instead of using strategy-3-1 or strategy-3-2.Moreover, the research about solving SLP or SLPD problems can be potentially inspired by this work.

A An example in optimizing sequence
To better explain how to optimize the sequence through solving GI problem, we take a matrix M ∈ GL(4, F 2 ) and try to optimize the number of XOR counts in implementation.Assume the value of M is: Moreover, we get a Matrix Decomposed Ordered Sequence of M through strategy 3-1 or strategy 3-2 in Section 3: s = E(0 ↔ 1)E(1 ↔ 2)E(3 + 0)E(0 + 2)E(2 + 1)E(0 + 2).We first consider using the reduction properties proposed by Xiang et al. to optimize the sequence s.As discussed above, we can ignore the type-1 matrices that consume zero XOR operation.For type-3 subsequence s ′ = E(3 + 0)E(0 + 2)E(2 + 1)E(0 + 2) in s, the algorithm iterates all subsequences in s ′ with length of 3 and length of 2. Furthermore, we have Table 6, which shows that the sequence s ′ cannot be reduced: Table 6: All cases of subsequences in s ′ with length of 3 and length of 2 Case Problem E(0 + 2)E(2 + 1)E(0 + 2) Cannot be reduced through Property 2 E(3 + 0)E(0 + 2) E(0 + 2)E(2 + 1) E(2 + 1)E(0 + 2) other cases Cannot be swapped to adjacent position Now, we use the method in Section 4 to analyze the sequence s ′ .In the setting of optimizing the sequence length, we will first let the objective function f (s) = |s| and parameter n = 4.Then, the framework will generate all equavalence classes on GL(4, F 2 ) with Algorithm 1. Assume we obtain one prime sequence p = E(1 + 2)E(0 + 2), and its corresponding value M p ∈ GL(4, F 2 ): When the algorithm iterates all subsequences in s ′ with length of 3 and length of 2, it will calculate the corresponding matrix (the first case in Table 6): It is found that the bipartite graphs G Mp and G M1 are isomorphic.And with the help of Theorem 2 and GI problem solver, we can get two permutation matrices p L and p R such that M s = p L • M p • p R , where: Then we have: After substituting the new M s sequence with length of 2 into original sequence s, we will get: s = E(0 ↔ 1)E(1 ↔ 2)E(3 + 0)E(2 + 1)E(0 + 1).
Notice that the optimization process does not use any reduction property.

Figure 1 :
Figure 1: G A and G B There are no permutations p L and p R such that p L • s 1 • p R = s 2 holds.Because the number of ones in each row in matrix A is not in one-to-one correspondence with matrix

Theorem 2 .
For two sequence s 1 and s 2 , s 1 ∼ s 2 iff G A is isomorphic to G B where A = s 1 • I and B = s 2 • I.

Figure 2 :
Figure 2: G A and G B where the vertices filled with diagonal dashed lines serve as a representative of "blue" vertices while the vertices filled with dots serve as a representative of "red" vertices.Therefore, we transform the process of determining the equivalence of two sequences into the process of determining the isomorphism of two colored graphs.The latter problem can be easily solved with the help of Nauty and Traces[MP14].

(
flag swap , pos insert ) := SwapCheck(ptr, s) 6: (flag reduce , p, p L , p R ) := QBS(s ′ ) 7: if flag reduce and flag swap then 8: Remove items of s with index in ptr 9: Insert p at pos insert 10: Update s based on p L and p R 11: Reset ptr := {0, 1, 2, ..., l m − 1} 12: as an example, they proposed low-depth quantum circuits for AES, which applied the linear layer raised by Xiang et al.[XZL + 20].They used #Q estimator to get the implementation depth to 30.If taking our new implementation, it can decrease the full depth from 2198 (with T-depth-4 S-box) to 2072 and from 2312 (with T-depth-3 S-box) to 2186 while using extra 216 XOR operations.The number of additional CNOT gates is less than 0.1% of the total number of CNOT gates in the original quantum circuit.

Figure 3 :
Figure 3: Conversion of quantum circuits from a type-3 sequence The size of P is bounded by the following inequality: f Input: The size of general linear group n; the objective function f .1: initialize P := ∅; 2: generate S := S 1,n ∪ S 2,n ∪ ... ∪ S K,n , (K is a big integral) 3: for all s ∈ S do

Table 2 :
Implementation cost of cipher linear layers under different methods.

Table 3 :
Quantum circuit depth comparison for implementing cipher linear layers under different methods

Table 5 :
The quantum implementation of AES MixColumn with depth of 13 and 98 XORs