Exploiting Weak Diffusion of Gimli: Improved Distinguishers and Preimage Attacks

The Gimli permutation proposed in CHES 2017 was designed for cross-platform performance. One main strategy to achieve such a goal is to utilize a sparse linear layer (Small-Swap and Big-Swap), which occurs every two rounds. In addition, the round constant addition occurs every four rounds and only one 32-bit word is affected by it. The above two facts have been recently exploited to construct a distinguisher for the full Gimli permutation with time complexity 264. By utilizing a new property of the SP-box, we demonstrate that the time complexity of the full-round distinguisher can be further reduced to 252 while a significant bias still remains. Moreover, for the 18-round Gimli permutation, we could construct a distinguisher even with only 2 queries. Apart from the permutation itself, the weak diffusion can also be utilized to accelerate the preimage attacks on reduced Gimli-Hash and Gimli-XOF-128 with a divide-and-conquer method. As a consequence, the preimage attacks on reduced Gimli-Hash and Gimli-XOF-128 can reach up to 5 rounds and 9 rounds, respectively. Since Gimli is included in the second round candidates in NIST’s Lightweight Cryptography Standardization process, we expect that our analysis can further advance the understanding of Gimli. To the best of our knowledge, the distinguishing attacks and preimage attacks are the best so far.


Introduction
preimage resistance, especially for those selected in the second round in NIST's Lightweight Cryptography Standardization process.
Our Contributions. Leveraging the symmetry of Gimli, we propose a distinguisher by tracing both the symmetry in a single internal state and the symmetry between two different internal states. In this way, a distinguisher for 18-round Gimli permutation can be achieved with only 2 queries. There seems to be a flaw to extend this 18-round distinguisher to full rounds. Therefore, we turn to improving the full-round distinguisher proposed in [GLNP + 20], where only the symmetry in a single internal state is traced. By exploiting a new property of the SP-box, we could construct a similar full-round distinguisher as in [GLNP + 20] with time complexity 2 52 while the bias is still kept significant.
In addition, the divide-and-conquer method seems to fit well with the weak linear layer of Gimli. Consequently, we are motivated to develop a divide-and-conquer method to accelerate the exhaustive search for preimages of reduced Gimli-Hash and Gimli-XOF-128. For our preimage attack on 5-round Gimli-Hash, 10 rounds of the Gimli permutation are investigated and we have to exhaust a message space of size 2 256 in less than 2 128 time in order to gain advantages over the generic preimage attack. For our preimage attack on 9-round Gimli-XOF-128, 9 rounds of the Gimli permutation are investigated and a message space of size 2 128 has to be travsersed in less than 2 128 time. Without a dedicated analysis of the linear layer and SP-box, the above two attacks are almost impossible. Our results are summarized in Table 1.
To verify the correctness of our attacks, we have implemented the preimage attack on 5-round Gimli-Hash, the preimage attack on 9-round Gimli-XOF-128 and the distinguishing attack on the full Gimli permutation by reducing the size of the Gimli state. The source code is available at https://github.com/LFKOKAMI/SmallGimli.git. The implementation of the distinguisher for the 18-round Gimli permutation is also included.

Preliminaries
In this section, we will present some notations, the description of the Gimli permutation and its applications to hashing. Meanwhile, some useful properties of the SP-box discussed in [LIM20] will be introduced as well. 5. 0 n represents an all-zero string of length n.

Notation
6. SP represents the application of the 96-bit SP-box.
7. r represents the size of the outer part of the Gimli state.
8. c represents the size of the inner part of the Gimli state.
10. f −1 represents the inverse of the Gimli permutation.   For convenience, denote the internal state after r-round permutation by S r and the input state by S 0 . In other words, we have

Description of Gimli
where 0 ≤ i ≤ 5. Moreover, the six 32-bit round constants are denoted by To represent a column of the Gimli state S r , we denote the (j + 1)-th column of the

Linear Layer
The linear layer consists of two swap operations, namely Small-Swap and Big-Swap. Small-Swap occurs every 4 rounds starting from the 1st round. Big-Swap occurs every 4 rounds starting from the 3rd round. The illustration of Small-Swap and Big-Swap can be referred to Figure 2.

Gimli-Hash
How Gimli-Hash compresses a message is illustrated in Figure 3. Specifically, Gimli-Hash initializes a 48-byte Gimli state to all-zero. It then reads sequentially through a variable-length input as a series of 16-byte input blocks, denoted by M 0 , M 1 , · · ·. Figure 3: The process to compress the message Each full 16-byte input block is handled as follows: • XOR the block into the first 16 bytes of the state (i.e. the top row of 4 words).
• Apply the Gimli permutation.
The input ends with exactly one final non-full (empty or partial) block, having b bytes where 0 ≤ b ≤ 15. This final block is handled as follows: • XOR the block into the first b bytes of the state.
• XOR 1 into the next byte of the state, position b.
• XOR 1 into the last byte of the state, position 47.
• Apply the Gimli permutation.
After the input is fully processed, a 32-byte hash output is obtained as follows: • Output the first 16 bytes of the state (i.e. the top row of 4 words), denoted by H 0 .
• Apply the Gimli permutation.
• Output the first 16 bytes of the state (i.e. the top row of 4 words), denoted by H 1 . Figure 3, the state after M i (i ≥ 0) is injected is denoted by S i and the 256-bit hash value is the concatenation of (

As depicted in
). Formally, the following relations hold: In our preimage attacks on Gimli-Hash, two consecutive message blocks will be utilized. To distinguish the states where different message blocks are processed, we further introduce the following notations: when processing M i , denote the internal state after the r-round permutation by S r i and the input state by S 0 i . In other words, we have

Gimli-XOF
In addition to Gimli-Hash, another application of the Gimli permutation called "extendable one-way function" (Gimli-XOF) is specified in the submitted Gimli document [BKL + 17]. For completeness, we briefly introduce the construction of Gimli-XOF recommended by the designers for lightweight applications.
Construction. At the squeezing phase, different from Gimli-Hash which generates a fixed-length output of 32 bytes, Gimli-XOF works as follows to generate t bytes of output: 1. Concatenate t 16 blocks of 16 bytes, each of which is obtained by extracting the first 16 bytes of the state and then applying the Gimli permutation.
2. Truncate the obtained 16 t 16 bytes to t bytes. At the absorbing phase, the so-called two-way fork [BKL + 17] is adopted, as specified below: 1. Read the message byte by byte (imaging that there is a device). Xor the byte at the current position and then increase the current position. If the current position exceeds the end of the block (each block can absorb at most 16 bytes per time), apply the permutation and set the current position back to the first byte.
2. When reaching the "end of data", xor 1 into the state at the current position and apply the Gimli permutation.
Obviously, the difference between Gimli-Hash and Gimli-XOF at the absorbing phase exists in the padding rule.
To apply our technique, the parameter t is set as 16. In other words, the Gimli permutation is used to generate 128 bits of output. For simplicity, Gimli-XOF with a 128-bit output is denoted by Gimli-XOF-128.

Properties of the SP-box
Suppose (OX, OY, OZ) = SP (IX, IY, IZ). Several properties have been discussed in [LIM20] and we list some useful ones for our attacks. Property 1. [LIM20] If (IY ≪ 9) ∧ 0x1fffffff = 0, OX will be independent of IX. Property 2. [LIM20] A random triple (IY, IZ, OX) is potentially valid with probability 2 −15.5 without knowing IX.  In addition to the above mentioned properties, we provide some extra meaningful properties of the SP-box.
Proof. This can be easily observed from the expressions to calculate , as specified below: Since IY ⊕ IY = 0 and IZ ⊕ IZ = 0, the following four relations must hold: The motivation to investigate Property 5 and Property 6 is to construct the distinguisher for the 18-round Gimli permutation. To improve the full-round distinguisher proposed in [GLNP + 20], we will extend Property 6 to Property 9. The motivation to do such an extension will be clear when the improved full-round distinguisher is described. Therefore, Property 9 will be detailed in Subsection 4.2.
Proof. First of all, consider the generic time complexity to recover the pair (x 0 , x 2 ). For each guessed value of x 0 , (x 1 , y 1 , z 1 ) can be determined. Since (y , z ) are known, based on Property 4, the correctness of the computed (y 1 , z 1 ) can be immediately checked without knowing x 2 . According to Property 4, the tuple (y 1 , z 1 , y , z ) is valid with probability 2 −32 . Since there are at most 2 32 values of x 0 , after all the possible values of x 0 are traversed, one can expect only one solution of x 0 which can make the tuple (y 1 , z 1 , y , z ) valid. Once the tuple is valid, x 2 can be uniquely determined based on Property 4. Consequently, the generic method is a simple exhaustive search for x 0 , which requires 2 32 time. In our following method, x 0 can be efficiently exhausted with the guess-and-determine technique.
For simplicity, let v = x 0 ≪ 24. First of all, consider the relations between (x 0 , y 0 , z 0 ) and (y 1 , z 1 ): It can be easily observed that when (y 0 , z 0 ) are known, each bit of (z 1 , y 1 ) can be expressed as follows: are known values over GF (2), which can be calculated according to (y 0 , z 0 ). For convenience, let y = y 1 ≪ 9, z = z 1 , x = x 2 ≪ 24. Then, each bit of (z, y) can be expressed as follows: where γ i , α i and β i (0 ≤ i ≤ 31) are known values over GF (2), which can be calculated according to (y 0 , z 0 ).
Consider the relations between (x, y, z) and (y , z ), as specified below: We rewrite the expression of y as follows: By involving z into the expression of y , we can obtain that As it can be derived that For simplicity, let Y = y ⊕ z . Considering the expression from the bit level, we can derive the following 32 equations: In the above equation system (Eq. 1∼32), (z , Y ) are known and (y, z) are linear in the unknown x 0 . Our aim is to recover (y, z) in order to recover the unknowns (x 0 , x 2 ). The procedure to solve the above equation system is described as follows: Step Step 2: The expression of y[i] is as follows: Step 4, the cost of guessing can be evaluated as 2 0.4 . As a result, the time complexity to traverse all solutions of the above equation system is 2 5+1+4+0.4 = 2 10.4 . On the other hand, we do not construct any coefficient matrix nor use Gauss elimination when solving the above equation system. The unknown variables can be calculated step by step by considering the corresponding expressions, which is very efficient.
As explained at the beginning of the proof, since x 0 can be exhausted in 2 10.4 time, (x 0 , x 2 ) can be recovered in 2 10.4 time and the expected number of solutions is 1.
Property 8. Given a random constant value of OX and N uniformly distributed pairs of (IY, IZ), when N is sufficiently large, the expectation of the number of the solutions of IX is N .
Proof. Consider the expressions to compute OX as shown in Equation 33 .
Denote the probability that there are 2 s solutions of IX for a given random triple (IY, IZ, OX) by P r(s). Therefore, As a result, the expectation of the number of solutions of IX denoted by E can be formulated as follows: In addition, according to Property 2, a random triple (IY, IZ, OX) is valid with probability 2 −15.5 . Thus, we can expect N solutions of IX when N is sufficiently large, e.g. N = 2 32 . According to experiments, when N = 2 32 , about 2 32 (slightly greater than 2 32 ) solutions of (IX, IY, IZ) can be obtained to match a given OX.
As mentioned in the proof, a random triple (IY, IZ, OX) is valid with probability 2 −15.5 based on Property 2. Therefore, it would be meaningful to study how many solutions there will be for (IX, OY, OZ) when there are a large number of uniformly distributed triples (IY, IZ, OX).

Improved Distinguishers for Gimli
A well-known powerful distinguisher for the Keccak permutation is the so-called zero-sum distinguisher [AM], where the attacker starts from a middle round and chooses a set of values for the intermediate state so that the sum of the inputs and outputs are all zero when computing backwards and forwards. In addition, the common differential distinguisher [BS90] tries to capture some undesirable behaviour of the output difference for a certain input difference. Benefiting from the internal differential [Pey10], which has been applied to the cryptanalysis of Keccak [MPS13,DDS13], we propose a new distinguisher called hybrid zero-internal-differential (ZID) distinguisher for Gimli. Such a new distinguisher is inspired from the zero-sum distinguisher [AM], differential distinguisher [BS90] and internal differential [Pey10], as illustrated in Figure 4. Specifically, we start from a middle round and choose two different intermediate internal states of a specific format. Then, we carefully trace both the symmetry in each internal state and the symmetry between two different internal states generated by the two intermediate internal states.

Deterministic Hybrid ZID Distinguisher for 18-Round Gimli
We begin with the hybrid ZID distinguisher for 18 rounds of the Gimli permutation, which only requires 2 queries to the 18-round Gimli permutation. Starting from S 9 , we choose two different values denoted by (A 9 , B 9 ) for S 9 such that the second column and the fourth column share the same values in (A 9 , B 9 ) while the first column and the third column are swapped in (A 9 , B 9 ). In addition, there are extra conditions on state words in the first row of the first and third columns to eliminate the influence of the constant addition. Formally, the conditions are specified below: where c 2 is the round constant used to compute S 9 in the Gimli permutation.
As illustrated in Figure 5, we can trace the evolutions of the internal difference in both directions for A 9 and B 9 , respectively. The following relations inside (A 17 , B 17 ) can be derived, i.e. the last two rows of the second column and the fourth column are swapped for (A 17 , B 17 ).
In addition, we have the following relations inside (A 0.5 , B 0.5 ), i.e. the last two rows of the first column and the third column are identical in both (A 0.5 , B 0.5 ).
As a result, one could construct a distinguisher for 18 rounds of the Gimli permutation, whose data and time complexity are both 2. Such a 18-round distinguisher has been experimentally verified. Note that for a random permutation, it requires at least 1 + 2 2 = 5 queries to find (A 0 , A 18 , B 0  in 2ω consecutive queries where both A 0 and B 0 are not allowed to repeat, our hybrid ZID distinguisher would succeed with probability 1 while a generic method for a random permutation would succeed with probability 2 −2ω . This explains the meaningfulness of our 18-round distinguisher. Note that the multiple-of-8 distinguisher [GRR17] for 5-round AES holds with probability 2 −3 for a random permutation while it holds with probability 1 for 5-round AES. Anyway, our distinguisher obviously shows that the symmetry of the Gimli permutation is an issue in the design, which enables us to trace a probability-1 undesirable property covering 18 rounds.

Experiments.
We have implemented the distinguisher for the 18-round Gimli permutation and an example of (A 0 , A 18 , B 0 , B 18 ) is given in Table 2. Indeed, we tested 10000 times and each time we could obtain the desired tuple (A 0 , A 18 , B 0 , B 18 ) with only 2 queries. [1], we can achieve it with only 2 2 = 4 queries, i.e. first compute forwards from A 0 to A 24 and then compute backwards from B 24 to B 0 . Therefore, the generic time complexity of the above full-round distinguisher is 4 while our way requires 2 128 attempts. Therefore, our full-round distinguisher is indeed not a reasonable one, though it did reveal a probabilistic property of the full-round Gimli permutation.

Improving the Full-Round Distinguisher
Since our hybrid ZID distinguisher cannot reach full rounds, we turn to improving the distinguisher in [GLNP + 20] by extending Property 6. For completeness, we first give a brief description of the full-round distinguisher proposed in [GLNP + 20]. It can be found that both the distinguisher in [GLNP + 20] and our hybrid ZID distinguisher exploit a very similar structure underlying the Gimli permutation. Specifically, the procedure to construct the full-round distinguisher in [GLNP + 20] is as follows: Step 1: Fix the pattern of S 9 and we again use A 9 to represent the value of S 9 for consistency. (1 ≤ i ≤ 2). As the format of A 9 is the same with that of our 18-round distinguisher, we reuse Figure 5 to explain the full-round distinguisher in [GLNP + 20].
Step Supposing there are g(< 96) bit conditions on A 0 in the new setting, the generic time complexity to find such (A 0 , A 24 ) would be 2 g . If we could find such a pair in less than 2 g time, a distinguisher is obtained.
The motivation to construct such a distinguisher is that the relations in A 0.5 are not fully exploited in [GLNP + 20]. To exploit such relations, we extend Property 6 as follows.
In this way, under the condition that OY = OY , OZ = OZ and OX
Assume that the property holds for w = k (1 ≤ k < 31), i.e. the following relations hold for 0 ≤ i ≤ k − 1.
We now prove that it also holds for w = k + 1.
When w = k + 1 ≤ 3, we have As  As where 0 ≤ i ≤ w − 1 and the indices are considered within modulo 32. As A 0 20] would require 2 32+w queries, while it requires 2 3w+2 queries for a random permutation. To obtain a significant bias, w = 20 is chosen. In this way, we could find the desired (A 0 , A 24 ) in 2 52 time while it requires 2 62 time for a random permutation. Thus, we succeed in constructing a distinguisher for the full-round Gimli permutation with time complexity 2 52 .

Experiments.
We have implemented the improved full-round distinguisher by reducing the size of the state word from 32 bits to 16 bits. In this case, the SP-box is accordingly adjusted, as specified below: In our experiments, the six 16-bit round constants are randomly generated. In this way, as displayed in Table 3, we could find a desired pair (A 0 , A 24 ) where there are 48 conditions on A 24 and 10 × 3 + 2 = 32 conditions on A 0 . The time complexity of a generic method to find such a pair is 2 32 while we can find it with time complexity 2 16+10 = 2 26 . The correctness of the estimation of the time complexity has been confirmed via experiments.

Preimage Attacks on Reduced Gimli-Hash
As can be observed from the above distinguishers for the Gimli permutation, we take many advantages of the weak diffusion. Different from Keccak [BDPA11b], in which the diffusion is strong, the diffusion of Gimli is rather weak. As pointed out by the designers, the avalanche effect requires 10 rounds of the Gimli permutation. Therefore, the divide-and-conquer method may work well to accelerate the preimage finding procedure.

The Generic Preimage Attack on Gimli-Hash
The generic preimage attack on Gimli-Hash is based on a meet-in-the-middle method, as depicted in Figure 6.
outer part match Phase 2 Phase 1 Phase 3 Figure 6: Framework of the generic preimage attack Specifically, consider five message blocks (M 0 , M 1 , M 2 , M 3 , M 4 ) and utilize them to find a preimage for a given hash value. In other words, consider the following sequence of state transitions:  Complexity Evaluation. Obviously, the time complexity at Phase 1 is 2 128 since a 128-bit value needs to be matched. For Phase 2, the time and memory complexity are both 2 128 . At Phase 3, the time complexity is 2 128 since 2 256 pairs need to be generated in order to match the 256-bit inner part of S 2 . Consequently, the time and memory complexity 4 of the generic attack on Gimli-Hash are both 2 128 .

The Preimage Attack with Divide-and-Conquer Methods
Our attack procedure is slightly different from the generic one. To gain advantages, Phase 1 has to be finished in less than 2 128 time. In addition, at Phase 2, we only choose 1 random value for (M 3 , M 4 ) by considering the padding in S 4 . In this way, the inner part of S 2 is fixed and only takes one value. Then, at Phase 3, instead of only choosing 2 128 values for (M 0 , M 1 ), our aim is to exhaust all the 2 256 possible values of (M 0 , M 1 ) in less than 2 128 time to match the 256-bit inner part of S 2 obtained at Phase 2. Finally, compute M 2 in the same way as in the generic attack. Since (M 0 , M 1 ) can take 2 256 possible values, it is expected that Phase 2 is only performed for only a few times. Obviously, the main obstacle in our method is how to achieve Phase 1 and Phase 3 efficiently, i.e. in less than 2 128 time. In the following description of our preimage attack on 5 rounds of Gimli-Hash, Phase 1 is called Finding a Valid Inner Part and Phase 3 is called Matching the Inner Part. If the two phases can be finished in less than 2 128 time, advantages over the generic attack are obtained.
Specifically, when the Gimli permutation is reduced to n rounds, Finding a Valid Inner Part is equivalent to the following problem: Given the outer part of S 0 and S n (n ≤ 24), how to find a solution of the inner part of S 0 to match the given outer part of S n ?
For Matching the Inner Part, since two message blocks need to be considered, we distinguish the states by S 0 and S 1 as depicted in Figure 3 for convenience. Then, it is equivalent to the following problem: Given the inner part of S 0 0 and S n 1 , how to find a solution of the outer part of S 0 0 and S 0 1 to match the given inner part of S n 1 ?

The Preimage Attack on 5-Round Gimli-Hash
In this section, how to mount the preimage attack on 5-round Gimli-Hash will be introduced. We only focus on Finding a Valid Inner Part and Matching the Inner Part.

Finding a Valid Inner Part
As illustrated in Figure 7, the corresponding procedure can be divided into 4 steps, as shown below. Step 1: Guess Step 2: Guess Step 3: Guess Figure 7: Generate a valid inner part for the preimage attack on 5-round Gimli-Hash Step

Complexity Evaluation. At
Step 1, the time and memory complexity are both 2 64 . At Step 2, it is necessary to match a 128-bit value of S 3 [1, 2][0, 1] based on a meet-in-themiddle method. Therefore, it is required to try 2

Matching the Inner Part
Before describing how to match a given inner part by utilizing two message blocks, we will pre-compute some tables in order to reduce the whole complexity.
Pre-computing Tables. As shown in Figure 8, based on Property 1, the following facts can be observed:  Step 1: Guess Step 2: Compute ? Step Step 3 and those stored in T 6 . As for the pre-computation, the time complexity and memory complexity are 2 64 and 2 64+1 = 2 65 , respectively. Consequently, taking the complexity to find a valid inner part into account, the time complexity and memory complexity of the preimage attack on 5-round Gimli-Hash are 2 96 and 2 64 × 2 = 2 65 , respectively.
To demonstrate the correctness of our preimage attacks, we provide a practical preimage attack on 2-round Gimli-Hash in Appendix A.

Experiments.
To further confirm the correctness of the time complexity of the preimage attack on 5-round Gimli-Hash, we have implemented our methods to find a valid inner part and to match a given inner part by reducing the size of the state word from 32 bits to 8 bits. In this case, the SP-box is accordingly adjusted, as specified below: According to the experiments, we may repeat the whole procedure for only a few times in order to find a valid inner part or to match a given inner part. In each repetition, the number of attempts to find a valid inner part is upper bounded by 2 16 and the number of attempts to find a valid inner part is upper bounded by 2 24 , thus confirming our estimation.

Preimage Attacks on Round-Reduced Gimli-XOF-128
When the above preimage attack on Gimli-Hash is extended to more rounds, we are faced with an obstacle caused by the degrees of freedom, i.e. at least two message blocks are needed and they should be traversed in less than 2 128 time to match a given hash value.
As can be observed in our method, benefiting from the weak diffusion of the linear layer of Gimli, we can efficiently utilize the divide-and-conquer technique to divide the space of two message blocks into several smaller ones and then find solutions in each smaller space via an exhaustive search. Finally, the solutions in each smaller space are combined and further verified to match the given hash value. When it comes to more rounds, it is difficult to divide the space of two message blocks into smaller ones. Thus, turning the exhaustive search in a large space into the exhaustive search in several smaller spaces cannot be applied anymore. In addition, to control two consecutive message blocks when the number of rounds of the Gimli permutation is reduced to n, the difficulty is almost equivalent to an attack on 2n rounds of the Gimli permutation, by allowing the attacker to control a 128-bit value in the intermediate state.
To test how far our divide-and-conquer method can go for reduced Gimli, we consider another application of the Gimli permutation to hashing, namely the "extendable one-way function", which has been specified in the submitted Gimli document. Considering the existing preimage attacks on SHAKE-128 [GLS16] and Ascon-XOF-64 [DEMS19], we believe it meaningful to investigate the preimage resistance of Gimli-XOF-128. In addition, since the size of one message block is 128 bits when neglecting the padding rule, the attacker only needs to focus on how to efficiently exhaust one message block rather than two message blocks in less than 2 128 time. In other words, the attack on n rounds of Gimli-XOF-128 is equivalent to an attack on n rounds of the Gimli permutation.
Similar to the method to turn the 6-round semi-free-start collisions into collisions in [LIM20], to efficiently mount the preimage attack on reduced Gimli-XOF-128, some conditions will be added. Specifically, when the target is n rounds of Gimli, an equivalent problem to find the preimage of Gimli-XOF-128 can be described as below: . As will be shown, the main idea to finish the two tasks is almost the same. Therefore, in our description, we will start from Matching the Outer Part and then move to Fulfilling Conditions.

The Preimage Attack on 9-Round Gimli-XOF-128
The two phases of the preimage attack on 9-round Gimli-XOF-128 will be described in this section. First of all, some tables will be pre-computed to reduce the whole time complexity. An illustration of our preimage attack on 9-round Gimli-XOF-128 is shown in Figure 9.

Known
Conditional Guessed Futher guessed Known after guess SP Figure 9: Illustration of the preimage attack on 9-round Gimli-XOF-128 Obviously, the time and memory complexity to construct the four tables are 2 64 and 4 × 2 64 = 2 66 , respectively.
Matching the Outer Part. After pre-computation, how to find a solution of the outer part of S 0 under the condition that S 0 satisfies Equation 40 can be specified as follows.
Step Complexity Evaluation. Taking the first three steps into account, the time and memory complexity to construct T 12 are 2 96 and 2 64 , respectively. For Step 4, the time complexity is 2 56+32 = 2 88 . As the matching probability in S 5 [0][0, 1, 2, 3] is 2 −128 and there are 2 64+56 = 2 120 pairs, it is expected that the whole procedure will be carried out for 2 8 times. Therefore, by taking the pre-computation into account, the time complexity and memory complexity are 2 104 and 2 66 , respectively.

Fulfilling Conditions
As In the experiment, S 0 [1][·] is set as all 0 and S 0 [2][·] is assigned to a random value in each iteration. In each iteration, the goal is to exhaust all 2 32 values of S 0 [0][·] and check whether S 8.5 [0][·] can be matched. According to the experiments, the 2 32 values can be traversed with time complexity 2 16+8 = 2 24 . After only a few times of iterations, we can always match the given value of S 8.5 [0][·], thus confirming our estimation.
As the basic idea is to exhaust S 0 [0][·] in both the phases to fulfill the outer part and to match the outer part, our experiments obviously demonstrate the correctness of the time complexity of our divide-and-conquer method.

Conclusion
Due to the weak diffusion of the Gimli permutation, a novel hybrid zero-internal-differential distinguisher is constructed for the 18-round Gimli permutation, which requires as few as 2 queries. Moreover, by considering the distinguisher for the full Gimli permutation from a different perspective, based on a novel property of the SP-box, we could reduce the time complexity of the full-round distinguisher in [GLNP + 20] to 2 52 from 2 64 . To further exploit the weak diffusion, we propose a divide-and-conquer method to accelerate the preimage finding procedure for both Gimli-Hash and Gimli-XOF-128. As a result, the theoretical preimage attack on Gimli-Hash can reach up to 5 rounds, while it can reach up to 9 rounds for Gimli-XOF-128. To the best of our knowledge, our distinguishing attacks and preimage attacks are the best thus far.

A The Preimage Attack on 2-Round Gimli-Hash
In this section, how to mount a preimage attack on 2-round Gimli-Hash with a practical time complexity is explained. It should be emphasized that like the generic preimage attack, our preimage attack is over 5 message blocks.

A.1 Finding a Valid Inner Part
For better understanding of our attack, it is better to refer to Figure 10. The corresponding attack procedure is described as follows. Step . Based on Property 2, there will be 2 (32−15.5)×2 = 2 33 possible combinations between T 20 and T 21 . Similarly, there are 2 33 combinations between T 22 and T 23 . Consequently, the time complexity and memory complexity to find a valid inner part are 2 33 and 2 17+1 = 2 18 , respectively. As illustrated in Figure 11, the corresponding procedure to match a given inner part by utilizing two message blocks can be described as follows.

A.2 Matching the Inner Part
Step