Eﬀicient MILP Modelings for Sboxes and Linear Layers of SPN ciphers

. Mixed Integer Linear Programming (MILP) solvers are regularly used by designers for providing security arguments and by cryptanalysts for searching for new distinguishers. For both applications, bitwise models are more reﬁned and permit to analyze properties of primitives more accurately than word-oriented models. Yet, they are much heavier than these last ones. In this work, we ﬁrst propose many new algorithms for eﬃciently modeling any subset of F n 2 with MILP inequalities. This permits, among others, to model diﬀerential or linear propagation through Sboxes. We manage notably to represent the diﬀerential behaviour of the AES Sbox with three times less inequalities than before. Then, we present two new algorithms inspired from coding theory to model complex linear layers without dummy variables. This permits us to represent many diﬀusion matrices, notably the ones of Skinny -128 and AES in a much more compact way. To demonstrate the impact of our new models on the solving time we ran experiments for both Skinny -128 and AES. Finally, our new models allowed us to computationally prove that there are no impossible diﬀerentials for 5-round AES and 13-round Skinny -128 with exactly one input and one output active byte, even if the details of both the Sbox and the linear layer are taken into account.


Introduction
In symmetric-key cryptography, a popular technique for proving resistance against classical attacks is to model the behaviour of the cipher as a Mixed Integer Linear Programming (MILP) problem and solve it by some MILP solver. This method was applied for the first time by Mouha et al. [MWGP11] and by Wu and Wang [WW11] for finding the minimum number of differentially and linearly active Sboxes and provides in such a way a proof of resistance against these two classical attacks. Since then, the use of MILP not only by designers but also by cryptanalysts has increased, the advantage being that it is relatively easy to translate the cryptanalytic problem into linear constraints and use the available solvers to solve it.
For Substitution Permutation Networks (SPN), it is possible to get easy and relatively small models when searching for properties at the word level. For example, MILP is often used to search for a lower bound on the number of active Sboxes by exploring truncated differentials. However, wordwise models do not apply to all kind of ciphers (e.g. Present [BKL + 07], Gift [BPP + 17]) and most importantly they are much less accurate than models at bit level. For example, modeling the propagation of the differences bitwise and looking inside the Sboxes could yield longer impossible differentials than those discovered when only truncated differences are analyzed [ST17b].
Sun et al. [SHW + 14a, SHW + 14b] were the first to propose bit-oriented modelings for SPN ciphers. A non-trivial problem in doing so is to find an efficient representation of the valid differential propagations through an Sbox. Indeed, Sboxes are non-linear Boolean vectorial functions, therefore modeling their differential properties with R-linear inequalities is not natural. Several approaches have been suggested to solve this problem. In the original works of Sun et al. [SHW + 14b, SHW + 14a] two different methods were notably proposed for modeling an Sbox. The first of them is a geometrical approach that consists in representing all possible transitions (a S − → b) through an n-bit Sbox as points (a, b) ∈ R 2n and then computing the H-representation of the convex hull of this set, that is all the geometric faces of the smallest convex containing it. The second method, based on a logical condition approach, consisted in representing by linear inequalities some conditional differential properties of the Sbox. Unfortunately, the problem of these two methods is that they are not efficient for large (e.g. 8-bit) Sboxes.
To solve this problem for large Sboxes, Abdelkhalek et al. [AST + 17] observed that generating a minimal number of constraints in logical condition modeling can be converted into the problem of minimizing the product-of-sum representation of Boolean functions. This last problem is well-studied and algorithms for solving it exist, for example the Quine-McCluskey (QM) [Qui52,Qui55,McC56] or the Espresso [BSVMMH84] algorithms. In this way, Abdelkhalek et al. managed for the first time to generate linear constraints for 8-bit Sboxes, notably for the Sboxes of AES [aes01] and Skinny-128 [BJK + 16]. While the number of linear constraints for the Sbox of Skinny provided in [AST + 17] is as low as 372, the same method yields 8302 linear inequalities for the Sbox of AES, a modeling that is often too heavy to be used in practice.
Efficiently representing the Sboxes is a crucial part of the modeling process. But, a bad modeling of the diffusion layer can render the optimization process very slow or even impractical. Indeed, with the exception of some ciphers, e.g. Present [BKL + 07], where the linear layer is just a bit-permutation, the diffusion is usually ensured by XOR gates. Yet, the XOR operation, while linear in F 2 models very badly in R. So, bitwise modeling of heavy linear layers, that need many XORs to be represented, can lead to impractical systems with many linear inequalities or with many dummy variables.
The use of dummy variables in MILP models is a popular approach that needs to be discussed. To the best of our knowledge, there are three ways of using dummy variables: linear layer branch number modeling in wordwise modelings as introduced in [MWGP11], modeling x 1 ⊕ · · · ⊕ x n = 0 with a dummy integer variable t as x 1 + · · · + x n = 2 · t and finally using a dummy binary variable a for recording an intermediate state in the computation, for example modeling the above computation as x 1 ⊕ · · · ⊕ x k ⊕ a = 0 and a ⊕ x k+1 ⊕ · · · ⊕ x n = 0. For Sbox modelings, when the construction of the Sbox is known and well suited, only the latter can be helpful. However in general, in order to minimize the running time of the MILP solver, it is important to minimize the number of integer variables. In this paper we study bitwise modelings which need by definition many MILP integer variables. We hence restricted ourselves to not introducing dummy variables and leave the use of dummy integer variables for improving performance as an open problem.

MILP Modeling for Boolean functions and Sboxes
The methods to be introduced in this section can be applied to characterize any set P ⊂ {0, 1} m with R-linear inequalities. As any such subset P ⊂ {0, 1} m can be seen as the support of some Boolean function f P : F m 2 → F 2 operating on m bits, the inequalities representing P model the constraint f P (x 0 , . . . , x m−1 ) = 1. However, as our main target is the modeling of differential, linear or other behaviours through an Sbox S : F n 2 → F n 2 , and such behaviours can be represented (as will be explained below) by some Boolean function, we will describe some of our techniques, without loss of generality, through the spectrum of the differential behaviour of an Sbox. In what follows, a (differential) transition x → y through the Sbox will be seen as a vector of F m 2 , involving m = 2n binary variables and will be represented, depending on the context, either as (x 0 , . . . , x n−1 , y 0 , . . . , y n−1 ) or as (x 0 , . . . , x m−1 ). The Hamming weight of a vector x ∈ F m 2 , i.e. the number of its non-zero bits, will be denoted by wt(x). Finally, we will often use the notation [a, b] to represent the set of all integers k such that a ≤ k ≤ b.

Modeling Boolean functions and DDTs
For many problems, as for example the search for good differential characteristics, bitwise modelings are more often adapted than wordwise ones, as they are more precise and permit to follow information propagation at the bit level. In such models, a binary variable is assigned to each bit of a differential characteristic. A variable has the value 1 if the corresponding bit has a difference in the characteristic -in which case it is called activeand 0 otherwise. In a MILP model, in order to follow how the information is propagated through the different components of the cipher, each different layer has to be efficiently modeled. For Sboxes, this is typically done by looking at the Difference Distribution Table  (DDT), that is a 2 n × 2 n table given by When one only cares whether a differential propagation is possible or not, it is enough to model the Boolean function However, if one wants to take into account the different integer values of the DDT, the authors of [AST + 17] propose to encode this information by modeling instead the Boolean functions F 2n As one can see, modeling a DDT is equivalent to modeling some specific Boolean function. From now on, we will focus on modeling general Boolean functions but all our examples and applications will be focused on DDTs. In all these examples we will be only interested in whether differential propagations are possible or not, without caring about their concrete probability. However, one has to remember, that all these methods can be applied to any other Boolean function and in particular to other cryptanalysis techniques whose properties can be described through some table, as are the linear and the boomerang cryptanalysis [Wag99].

State of the art
Given the truth table of a Boolean function f , the question is how to efficiently model the constraint f (x) = 1 by a system of R-linear inequalities. This problem can then be divided into two sub-problems: Problem 1 How to generate a (possibly large) set of inequalities on variables x 0 , . . . , x m−1 that correctly models f ?
Problem 2 How to choose a (typically much smaller) subset of this set of inequalities that still correctly represents f but leads to more efficient MILP models?
For differential cryptanalysis, the above general problem corresponds to modeling the fact that (x 0 , . . . , x n−1 ) → (x n , . . . , x 2n−1 ) is a possible transition in a DDT. To solve Problem 1, two different approaches were proposed in 2014 by Sun et al. [SHW + 14a, SHW + 14b]. The first is a geometrical one and consists in computing the H-representation of the convex hull of the set of possible transitions. The second one is based on logical condition modeling. Below, we briefly explain these two methods.
The method of the H-representation of the convex hull consists, as its name suggests, in computing the H-representation of the convex hull of all possible points a ∈ F m 2 such that f (a) = 1 seen as vectors of R m . Taking then the (m − 1)-dimensional faces of the convex hull yields a correct set of inequalities. The H-representation can be for example computed through an algebra computer system such as Sage [The20] and gives a system of linear inequalities excluding all impossible points (e.g. all points a such that f (a) = 0). The second method proposed by Sun et al. and called logical condition modeling is based on the idea that each impossible point can be removed by a single inequality in a simple way. Indeed, consider an impossible point a = (a 0 , . . . , a m−1 ). Then, the inequality only discards this point a. Lets take as example the DDT of PrintCipher. As it can be seen from Table 1, (0x1,0x6) is an impossible transition through the DDT. By writing the input and output in a bitwise manner we have that (100) (011). The above formula gives the inequality −x 0 + x 1 + x 2 + y 0 − y 1 − y 2 ≥ −2 that is satisfied by all points in F 6 2 but (0x1,0x6). This method can then be applied to all impossible transitions x → y and yields easily a system of inequalities containing as many constraints as the number of impossible transitions through the DDT, or as the number of zeros of the Boolean function in the general case.
However, as mentioned in [AST + 17], both these methods that provide a solution for Problem 1 have the disadvantage of not being efficient for modeling 8-bit Sboxes. For the first one, computing the H-representation of the convex hull for such big Sboxes is nearly impossible, while the second method yields a very high number of initial inequalities with by construction no hope for a correct subset for Problem 2. For example, the Skinny-128 Sbox has 54067 impossible transitions and this number of corresponding inequalities is too high to represent the Sbox for any related MILP problem.
[AST + 17] made an important step forwards in the logical condition modeling direction, by translating the problem of searching for good inequalities for 8-bit Sboxes into the classical problem of minimization of the product-of-sum representation of the related Boolean function and by using the Quine-McCluskey (QM) algorithm to solve it. This method permits to solve at once the two steps of the Sbox modeling: Find many good inequalities (the prime implicants in the QM vocabulary) and keep among them a good representative set. In the case of QM this representative set corresponds to the minimal number of equations. The problem however of QM is that in practice it needs high memory ressources and it can be slow. For this reason, Abdelkhalek et al. used another algorithm, called the Espresso algorithm, a heuristic method for minimizing the number of terms in a product-of-sum representation. Espresso is not guaranteed to find the minimum, and it usually doesn't in the case of 8-bit Sboxes, but its solutions are good enough to be used in practice.
No matter the method used for solving Problem 1, one must choose among the initial set of inequalities a good representative set for representing the support of the Boolean function. This is what we called as Problem 2. As mentioned in [ST17a], determining how many and which inequalities to keep is not an evident decision. This step is however necessary, as in both methods from [SHW + 14a, SHW + 14b] and all the new methods that we are going to present, the number of generated inequalities is high and has an important impact on the optimization time. For example, in the convex-hull method, the number of linear inequalities that Sage returns is typically quite high, containing notably many redundant ones. The authors of [SHW + 14b] applied then a greedy algorithm for solving Problem 2. At each step, this algorithm adds to the solution set the best possible inequality, that is the inequality removing the highest number of points among those that have not been removed yet. A nice approach for solving this step was later given by Sasaki and Todo in [ST17a]. They proposed to model the problem of minimizing the set of inequalities that remove all the impossible propagation points as a MILP problem itself and solve it by some solver. More precisely, their method consists in assigning a binary variable z i to each inequality found by solving Problem 1. Then for each impossible point p, it is required that at least one inequality removing p is chosen with i s.t. ineq. i removes p z i ≥ 1. Finally the MILP solver is used for minimizing i z i and i z i = 1 gives a solution of Problem 2.
It appears however in [ST17a] that the smallest subset of inequalities found by this approach that correctly models a DDT, will not necessarily provide the overall best performance when running a complete cipher modeling. Moreover, this auxiliary MILP problem can be too heavy when the initial set of inequalities is large. In our experiments, we have found the greedy approach from [SHW + 14b] to provide better subsets for performance even if they are a bit larger. We hence used it in the applications. However, we consider the minimization approach from [ST17a] in solving Problem 2 to be a good benchmarking indicator for methods solving Problem 1.
In the remaining part of the section we analyze in depth the problem of efficiently modeling a Boolean function with MILP inequalities and we concentrate on Problem 1. Indeed our goal is to generate efficient algorithms for modeling a Boolean function by using the smallest number of inequalities as an indicator for the quality of the method.
We propose different methods for doing so. The first method, based on generating better inequalities from the H-representation is applicable for up to 12-bit Boolean functions and gives better results than all previous methods for up to 6-bit Sboxes. Unfortunately, this method is not applicable for 8-bit Sboxes. For this reason, we develop other methods for larger Boolean functions and Sboxes. With these methods, described in Sections 2.4 and 2.5 we manage to model 8-bit Sboxes with a much smaller number of inequalities than what was done before.
In the remaining part of this section, and to facilitate comprehension, all methods will be described through the DDT modeling application. Nevertheless, none of these techniques is DDT-specific and they can be applied directly for modeling any Boolean function.

Convex hull techniques for up to 6-bit Sboxes
When the computation of the H-representation of the convex hull of all possible transitions in a DDT is computationally feasible, this H-representation provides by definition a set of inequalities modeling the possible differential transitions through the Sbox. However, as we will show, it is possible to compute many other linear inequalities from this initial set by simply adding up some of them. The reason for doing so is to generate potentially better inequalities than the ones directly given by the convex hull, where better means that the new inequalities remove more impossible transitions than the initial ones do.
Indeed, if a possible differential transition z = (x, y) ∈ {0, 1} m , with m = 2n satisfies the k inequalities C 1 , . . . , C k : c 0 z 0 + · · · + c m−1 z m−1 + b ≥ 0, with ∈ [1, k] then it obviously also satisfies the inequality produced by simply summing up the initial k inequalities and denoted in the sequel as Of course, most of the inequalities produced by randomly summing k inequalities from the H-representation of the convex hull, do not present any interest, as they will very probably be satisfied by the whole space {0, 1} m . In order to produce meaningful new linear inequalities from the H-representation of the convex hull, we noticed that if k hyperplanes of the H-representation share a vertex on the cube {0, 1} m , (i.e. a possible transition), then the addition of the k corresponding inequalities will probably yield an interesting new constraint, given that its hyperplane intersects with the cube at least on this particular vertex. By "interesting" we mean here that the new inequality C new will remove a different (potentially larger) set of impossible transitions than the original inequalities. This idea is illustrated in Figure 1 (Appendix A) and described by Algorithm 1.
Algorithm 1 takes as input a set S pos corresponding to the possible transitions through an Sbox S and a parameter k that indicates the number of inequalities to be added together each time. Then it starts by generating the convex hull corresponding to S pos by using for example the inequality_generator() function of the sage.geometry.polyhedron class of the Sage computer algebra system [The20]. This gives us an initial set of inequalities C set of the form c 0 x 0 + · · · + c m−1 x m−1 + b ≥ 0. Then, for every point p in S pos we add together any k inequalities C 1 , . . . , C k that this point satisfies with a zero left-hand side, i.e. p belongs to the hyperplanes defined by those inequalities. We add in C set the new inequality only if it is meaningful, in the sense that it removes a new set of impossible transitions.
In practice, Algorithm 1 is rather fast for 4-bit Sboxes -a few minutes for k = 2 and a few hours at most for k = 3 -and the set of constraints C set obtained that way does not have too many elements, which allows fast minimization in solving Problem 2. For example, Sage returns 327 linear inequalities for the Sbox of Present. By applying Algorithm 1 with k = 2 we get a little bit less than 500 inequalities, and for k = 3 we get a little bit less than 700 inequalities.
We applied Algorithm 1 to solving Problem 1 for different Sboxes from the literature and benchmarked it by solving Problem 2 with the method of [ST17a] to obtain a minimal modeling. The results are summarized in Table 2. We took k = 2 to run Algorithm 1, apart for Twine, Pride, Serpent S3 and Serpent S7 where taking k = 3 gave slightly better results. The case k = 1, corresponds to the method of [ST17a]. Even as already said, the exact minimum is not necessarily what one has to take to minimize the solving time of the global problem, it is however a good indicator of the quality of the method used to solve Problem 1 and permits to compare the different methods between them. We nonetheless also give the results with the greedy approach from [SHW + 14a] and k = 1 for Algorithm 1 Compute a set of inequalities from possible transitions. for all {C 1 , . . . , C k } ∈ P(H set ) such that p belongs to the hyperplanes of C 1 , . . . , C k do 6: if C new removes a new set of impossible transitions then 8: end if 10: end for 11: end for 12: end procedure completeness. As can be seen from Table 2 but also for all the Sboxes we tested for n ≤ 6, running the algorithm with k > 1 always gave better results than with k = 1.
Unfortunately, for larger Sboxes, and notably for n = 8 computing the convex hull is computationally hard. For this reason, we describe in the following sections, alternative methods that can be used for modeling 8-bit Sboxes.

Logical condition techniques for 8-bit Sboxes
In this first section, we show that one can easily derive simple inequalities to remove spaces of the form a⊕Prec(u) inside the DDT. Let again m = 2n where n is the bit-size of the Sbox.
Otherwise x ⊕ a u, so there exists some ∈ supp(a) I such that x = 1 − a .
• If ∈ I, x = 1 and i∈I x i ≥ 1.
Here m = 2 × 4 = 8. For better visualizing points in the DDT, we will see F 8 2 as F 4 2 × F 4 2 . Further, we will use the following bit ordering. For a point [α, β] ∈ F 4 2 × F 4 2 , the index 0 will correspond to the least significant bit (LSB) of α, 3 will correspond to the most significant bit (  We can verify from the DDT in Appendix B that all these points correspond indeed to invalid transitions through the DDT.

An efficient algorithm for searching for spaces of the form a ⊕ Prec(u)
Given a set of impossible transitions P, Algorithm 2 finds all subsets of the form a⊕Prec(u) excluding those that are subsets of others. For each a ∈ P, it builds spaces a ⊕ Prec(u) ⊆ P by progressively incrementing the weight of u, with u such that a ⊕ u ∈ P and supp(u) ∩ supp(a) = ∅ and checking for all v u, wt(v) = wt(u) − 1 whether a ⊕ Prec(v) has already been identified as a subset of P.
We show next that Algorithm 2 and the Quine-McCluskey algorithm are strongly related. The Quine-McCluskey algorithm has two steps. Given a set of points P, the first step finds a set S of subspaces a ⊕ Prec(u) ⊆ {0, 1} m such that ∀(a ⊕ Prec(u)) ⊆ P, ∃E ∈ S : a ⊕ Prec(u) ⊆ E, and ∀E, F ∈ S, E ⊆ F.

end for
We start by grouping all u such that a ⊕ Prec(u) ⊆ P and supp(a) ∩ supp(u) = ∅ by their weight in sets U.

9:
for all p ∈ P do 10: This is exactly what does Algorithm 2. The second step of QM then searches for a minimal set S ⊆ S, in the sense that: This formulation of the Quine-McCluskey algorithm is very different from the one encountered in the literature and in the best of our knowledge, it is the first time that this algorithm is presented in this way. We use this presentation as it is well suited for understanding the link between the two methods.
The important thing for us is that we principally need this first step to find good modelings. This corresponds to solving Problem 1. The second step of the QM algorithm, corresponds to providing a solution for Problem 2, by minimizing the number of terms with the objective of finding a good circuit for a Boolean function, but not necessarily a good MILP modeling. Moreover, this second step is computationally harder than the first one and acts as a bottleneck when using QM as a black box inequality generator for MILP modelings. Indeed, it is much faster to use Algorithm 2 alone for solving Problem 1 together with a greedy algorithm or a MILP-based algorithm for solving Problem 2. This is not only faster but can also provide a significantly lower number of inequalities.
We notably applied Algorithm 2 for generating an initial set of inequalities for the 8-bit Sboxes of AES and Skinny-128. We then obtained 70336 initial inequalities for AES and 8829 for Skinny-128. The running time was of around 15 minutes for AES and 2 hours for Skinny-128 where 90% of the time was spent not for finding new inequalities but for removing not interesting ones (lines 26-28 in Algorithm 2). The difference in these running times is explained by the fact that as the DDT of Skinny-128 is very sparse, there are many more possible spaces a ⊕ Prec(u) than for the DDT of AES. After this, to find a representative set among these initial inequalities (Problem 2 ) we set up a minimization problem and solved it with Gurobi. While for Skinny-128 the problem was solved in just a few seconds providing us with the global minimal (which is thus the same as the QM algorithm), for AES the problem didn't reach the minimum even after 1900975 seconds of search on a 8-core laptop. However, the solution provided by Gurobi, even if not the minimal one is much better than the one given by Espresso, as it can be seen in Table 3. Furthermore, according to the authors of [AST + 17], QM itself cannot be applied to AES because of its memory complexity. While for Sboxes with sparse DDTs, as the one of Skinny-128, the method of this section provides a quite compact modeling, for Sboxes with low differential uniformity (i.e. highest value in the DDT), as the one of AES the number of obtained inequalities is quite high for practical applications. For this reason, we provide in the following sections new methods for modeling large Sboxes that outperform in most of the cases the methods provided up to now.

Modeling an Sbox with inequalities issued from balls B(d, c)
We saw in the previous section that each set of impossible transitions of the form a⊕Prec(u) provided a single inequality for removing all points in the set. What we learned in particular from this is that it is interesting to group impossible transitions in sets having a particular algebraic description and get inequalities out of them. In this section, we investigate a different type of sets and show how to use their mathematical description to get nice inequalities. More precisely, we show that points lying in a ball B(d, c) of radius d centred at a point c ∈ F m 2 , can be removed together by a simple inequality.

Definition 1.
A ball of F m 2 of radius d centred at c ∈ F m 2 is the subset of all points whose Hamming distance from the center c is at most d: To illustrate this idea, we start with an example of the most simple case -balls of radius 1. It can be checked that all five points of the above ball can be removed by We can construct similar examples for any ball of dimension d > 1. As we will show now, for any dimension m and any point c ∈ F m 2 , it is possible to construct an inequality that removes all points of the ball B(d, c).
holds if and only if x ∈ B(d, c).
Proof. Notice here that

Distorted balls
When searching for inequalities removing impossible transitions for a DDT, we have to be sure that the corresponding ball does not contain any possible transitions that we would mistakenly remove. In Sboxes used in practice, notably those with a low differentially uniformity, as the number of non-zero coefficients is usually large, removing entire balls does not usually work, as the number of balls for which all points correspond to impossible transitions is typically very small. While the above method works well for sparse DDTs as the one of Skinny-128, for the Sbox of Present, no plain ball, even of radius 1, can be removed with this method. We will show now, that we can still extract an inequality from a ball for which we have removed from its edge all possible transition points. We call such a ball distorted.
Example 3. Consider again Example 2. Inequality (1 − x 0 ) + x 1 + x 2 + x 3 ≥ 2 removed all five points of the ball B (1, (1, 0, 0, 0)). Suppose now that we want to remove all the above points except from (0, 0, 0, 0) and (1, 0, 1, 0). Then, intuitively it is enough to increase a little bit the coefficient a 0 corresponding to point (0, 0, 0, 0), that is the coefficient of (1 − x 0 ), to be sure that when x 0 = 0 then a 0 (1 − x 0 ) ≥ 2. As for the other points of the ball x 0 = 1, this change does not have any impact on them. In the same way, we can increase the coefficient a 2 before x 2 , to be sure to keep (1, 0, 1, 0). One can check that 2(1 − x 0 ) + x 1 + 2x 2 + x 3 ≥ 2 removes indeed the three remaining points of the ball.
The next proposition formalises the previous example to distorted balls of dimension d.
2 be a ball of radius d from which we remove the set of points Q = (c ⊕ Prec(q)) S(d, c) for some q ∈ F m 2 . Here p ∈ Q represents a possible transition towards the edge of the ball. We define a ∈ Q m such that -If x ∈ Q then x ⊕ c q and there exists some j such that x j = c j , q j = 0 and a j = 1. Then Remark 1. For d = 1, Proposition 3 implies that for any ball of radius 1 and for any subset Q of points on its edge, it is possible to create an inequality that removes the points in the ball minus the set Q. For d > 1 the situation is a little bit different, as removing some points from the edge can also remove other points that we would like to keep. Despite this, inequalities created this way are usually interesting as they can permit to remove different points together.
The inequalities provided up to now can remove points in a single ball. While this provides a simple and fast algorithm by itself by simply going through all balls into a DDT and writing down the corresponding inequalities, in practice we only used this method in combination with a more powerful method that we detail in the next section. This new method permits us to remove points belonging to the union of three different balls of radius 1. We call this process merging.

Merging three balls of radius d = 1
This method has similarities with the adding inequalities method used in Algorithm 1, except that we combine inequalities obtained with the distorted balls approach. We start by computing distorted balls of radius 1 centred on impossible transitions. However, simply adding inequalities obtained from neighbouring distorted balls often results in "bad" inequalities, in the sense that they do not discard many impossible transitions. To get a more efficient method, we found that when slightly changing one of the distorted balls, adding 2 to the right-hand side of the sum of the three inequalities gives a better inequality. We now present in detail how to get such an inequality.
Let I be the set of all impossible transitions through the Sbox and let a, b and c be three distinct points in I such that Let B (1, a), B(1, b) and B(1, c) be the corresponding balls of radius 1. We denote by the possible transition points inside each ball. Finally, consider the sets A proof for Proposition 4 and an example on the Sbox of Present is provided in Appendix C. The method is summarized in Algorithm 3.
The sets R a , R b and R c in Algorithm 3 correspond to the points inside each ball that have to be kept when writing down the equation for the corresponding distorted ball. More precisely, R a = P a ∪ {c}, R b = P b ∪ Q 1 and R c = P c ∪ Q 2 ∪ Q 3 . It is interesting to note that there is no symmetry between b and c: if one changes the roles of b and c, then the new inequality created will remove a different subset of points from the three balls, giving thus a different inequality for our collection.
We applied Alg. 3 together with Alg. 2 and Proposition 3 to create a large set of inequalities for the Sboxes of Skinny-8 and AES and applied a MILP minimization problem to find a small set of inequalities to represent each Sbox. The results can be visualized in Table 4. The resulting MILP problem took only a few seconds to terminate for Skinny-8 while the optimization could not be terminated for AES. Even though, the number of inequalities that we got by stopping the optimization process after a few days of computation provided us with a much lower number than the best previous result. Of course, by pushing the optimization further, it is possible to get even less inequalities for AES but one has to remember that obtaining the absolute minimum does not usually lead to the quickest solving time.
Last, we also applied the combination of the three above methods (Alg. 2, Alg. 3 and Proposition 3) to all 4-bit Sboxes of Table 2 and for all of them we obtained the same number of inequalities as with Algorithm 1.
for all p ∈ P a do 7:

Comparing different techniques for Sbox modeling
Using 5 or 6-bit Sboxes inside a cipher is less common than using 4-bit and 8-bit ones. However, some designs, e.g. Keccak, Ascon or Fides among others, use such Sboxes for different reasons each: Good masking properties or optimal resistance against differential cryptanalysis among others. Indeed, APN permutations, that is permutations whose DDTs are only composed of 0s and 2s and that have the best possible differential properties, exist for 5 and 6 bits. We tested our algorithms on some Sboxes of this size, mainly for permitting comparison between the different developed methods. Indeed, these sizes, not as small as 4 bits and not as large as 8 bits are ideal for permitting all of the algorithms to run (even the ones based on the computation of the convex hull) and for providing non-trivial comparisons. The results are summarized in Table 5. The first method, that we note as Convex Hull, corresponds to the method based on the H-representation of the convex hull provided by Sun et al. in [SHW + 14a].
As one can see from the above table, Algorithm 1 and the combination of Algorithm 2, Algorithm 3 and Proposition 3 are the ones giving always the best results. Algorithm 1 is almost always better but it has the disadvantage that it cannot be applied to larger dimensions.

Linear layer modeling
As seen in the previous section, modeling in the context of MILP valid propagations through 8-bit Sboxes may lead to large systems of R-linear inequalities. One would think that modeling a linear layer is much easier. While this is true for simple linear layers as the ones of Present or Gift that consist in simple bit-permutations, modeling other linear layers can also lead to large systems of R-linear inequalities. This is in particular due to the difficulty of efficiently modeling the XOR operation that is the major component of most diffusion layers. This is explained in the next subsection.

XOR modeling
Modeling a linear layer is often related to how the XOR operation is modeled. The following proposition gives an idea of how difficult it can be to efficiently model a linear layer without dummy variables.
to characterise the set of its solutions in F n 2 . Proof. First, we can exhibit such a set of inequalities. Indeed, for each a ∈ F n 2 such that a 0 ⊕ a 1 ⊕ . . . ⊕ a n−1 = 1, we have seen in Section 2 that we can write down an inequality that only eliminates a from the set of possible solutions. Since there are exactly 2 n−1 such points, we have 2 n−1 inequalities modeling x ∈ F n 2 x 0 ⊕ . . . ⊕ x n−1 = 0 . Let us suppose now that there exists an inequality c · x + d ≥ 0 that eliminates at the same time two points a and b such that a i = 0 and b i = 1 for some i ∈ [0, n − 1]. Let e be the vector (0, . . . , 1, . . . , 0) with e i = 1. Then, This means that if an inequality eliminates two different points on the cube {0, 1} n , it necessarily also eliminates two points with Hamming distance 1.
Moreover, two points a and b such that wt(a ⊕ b) = 1 cannot both be solutions of x 1 ⊕ . . . ⊕ x n = 0. This explains why we need as many R-linear inequalities as points to eliminate, i.e. 2 n−1 . Therefore, a naïve way to model a linear layer is to model each XOR operation in the way showed in Proposition 5. We will often refer to this as the naïve method. The rest of this section is dedicated to the presentation of more efficient ways for modeling general linear layers.

General modeling
When modeling a mathematical operation for MILP with a system of linear inequalities, the input and output variables play the same role inside each inequality. This shows that modeling a matrix M means modeling the kernel of A = (M |I), where I is the identity matrix. Indeed, for any matrix M with entries in F 2 , One can then model the equation given by each row of A with the XOR modeling. But as we have just seen, the number of constraints for modeling one XOR operation grows exponentially with the number of involved variables. Since our goal is to model the kernel of A and since it is known that for any invertible matrix P ∈ GL n (F 2 ), Ker(P · A) = Ker A, the idea is to find a matrix P ∈ GL n (F 2 ) such that the rows of P · A have minimum Hamming weight and induce thus a minimal number of XOR operations. More precisely, this means finding an invertible matrix P that minimizes where wt(P · A) i, corresponds to the Hamming weight of the i-th row of P · A. The question is now how to find such a matrix P . A first idea would be to search for minimum weight codewords in the linear code generated by the rows This matrix has branch number 4 which means that for all x ∈ F 4 2 , wt(x · (M |I)) ≥ 4. Hence we cannot hope for a more minimal modeling of this linear operation based on the XOR modeling than the one given by (M |I). However, let us take the example of the Skinny MixColumns operation given by the matrix M Skinny below. In that case, the code generated by (M Skinny |I) has minimum distance 2 and it is equivalently generated by the matrix A Skinny obtained by adding in F 2 the fourth line to the first line of (M |I). While the naïve XOR modeling of (M Skinny |I) would have needed 2 3 + 2 + 2 2 + 2 2 = 18 inequalities, using the above matrix for the XOR modeling only requires 14 inequalities.
To demonstrate that this representation is more efficient in practice compared to the naïve approach, we computed the time it takes for the Gurobi Optimizer [GO20] to reach the minimum number of active Sboxes over several rounds of Skinny-128 for the two different modelings of MixColumns. In this experiment, in order to emphasize the impact of the linear layer modeling and to avoid correlations with the modelings of the other parts of the cipher, we used a very simple modeling for the 8-bit Sbox. This Sbox modeling, introduced in [AST + 17] under the name of arbitrary Sbox mode, needs only 2n inequalities for the whole Sbox but only models the following behaviour: if at least one input bit is active then at least one output bit has to be active and vice versa. The timings (in seconds) of this experiment can be visualised in Table 6. We emphasize that the reported timings is the time for the solver to reach what we already know to be the minimum number of active Sboxes. Indeed, as even with the improved modeling the number of inequalities still remains high, our MILP solver takes too long for terminating and thus proving that the found upper bound on the minimum number of active Sboxes is tight. The minimum number of active Sboxes for Skinny has been computed by its designers in [BJK + 16] thanks to wordwise modelings. It is obvious from this table that the new modeling of the Skinny linear layer reduces importantly the solving time. We propose now a new algorithm, Algorithm 4, that given the matrix A = (M |I), finds a matrix P that minimizes Eq. (4). First, the matrix P is initialized to the identity matrix. Then, the algorithm proceeds in a row-wise manner and searches at each step to replace the current row with a better one. To start with, it searches to replace the first row of A with a codeword of the form m · A, m ∈ x ∈ F n 2 x 1 = 1 and wt(m · A) < wt(A 1, ).
After this first step, the matrix P is updated as The algorithm then searches for a replacement for the second row of the matrix P · A in the same way and updates the matrix P if a lower weight codeword has been found. Algorithm 4, whose time complexity is n2 n−1 multiplications of a n-bit binary vector with a matrix with n rows and an arbitrary number of N columns, finds notably the previous result for Skinny (14 inequalities). We also applied Algorithm 4 to the Aes MixColumns. The naïve XOR modeling gives 2176 inequalities for this matrix and Algorithm 4 does not improve this quantity.
We develop now a different idea that permits in some cases to significantly improve the number of inequalities needed for modeling the linear layer, compared to both the naïve approach and Algorithm 4. Algorithm 4 Given a binary matrix A of size n × N returns P ∈ GL n (F 2 ) minimizing Eq. (4).

Changing the Sbox modeling for improving the linear one
The idea of this approach consists in changing the Sbox modeling. Indeed, if we find an invertible block-diagonal matrix (with blocks having the size of the Sbox) where b is the number of words on which MixColumns operates (e.g. b = 4 for the AES, Skinny or Midori), then changing the modeling of the Sbox S into the modelings of allows for a new, potentially better XOR modeling of the MixColumns operation with the equation (M |I)Q = 0. Once a convenient matrix Q has been found, modeling (M |I)Q = 0 can be done using Algorithm 4. In summary, with this method, the problem of finding a good XOR modeling for the linear layer boils down to finding a matrix P ∈ GL n (F 2 ) and a block-diagonal matrix Q ∈ GL 2n (F 2 ) such that P (M |I)Q minimizes (4).
We propose Algorithm 5 for finding such matrices P and Q. The idea of this algorithm is to iterate alternating searches for P and Q, hoping that modifying one of P or Q will allow further optimization. This algorithm then takes as parameter the number of desired iterations p ∈ N. In practice however, for all of our experiments, we have never needed p ≥ 2.
Applying Algorithm 5 on the AES MixColumns operation with parameter p = 1 gives P and Q such that the quantity of Equation (4) initially at 2176 drops down to 1088. However, Algorithm 5 does not give better results for Midori or Algorithm 4 for Skinny.
Matrices P and Q for the AES are given in Appendix E.
To measure at which point such a change in the modeling of the linear layer can be an improvement for the running time, we ran an experiment for the AES similar to the Algorithm 5 Given a matrix A of size n × N , find P ∈ GL n (F 2 ) and block-diagonal Q with 2b blocks such that (P, Q) minimizes n i=1 wt((P · A · Q) i ).
end for 10: return P, Q 16: end procedure one above for Skinny-128. One important difference is that wordwise modelings for the AES have not been able to provide tight lower bounds for the minimum number of active Sboxes because of byte multiplications needed in the MixColumns operation. For example in [MWGP11], the authors used the branch number of the MixColumns operation to obtain lower bounds. In Table 7, we hence give the lowest upper bound on the minimum number of active Sboxes reached with the improved linear layer modeling after the given number of seconds and the same upper bound reached by the naïve linear layer after the given number of seconds. For a discussion on the complexity of Algorithms 4 and 5, see Appendix D.

Modeling of affine equivalent Sboxes
As seen in this section, to apply Algorithm 5, instead of modeling the cipher's original Sbox S, one will need to model one or more Sboxes that are linearly equivalent to S. Therefore, it is necessary to ensure that the modeling of these linearly equivalent Sboxes is not (much) worse than the modeling for S. This leads to the following more general question: "How does the modeling of an Sbox gets affected by affine equivalence?" In our experiments with AES, the only needed affine equivalent Sbox -Sbox AES • Q 0 where Q 0 is given in Appendix E -had a very similar modeling to the original one Sbox AES , leading notably to almost the same number of final inequalities. It is however not clear for us what happens in the general case and we believe that this constitutes an interesting open problem. For example, it could be useful to know for any given Sbox whether one can compute an affine equivalent Sbox that can be modeled with much less inequalities. A related question is whether a lower bound on the number of needed inequalities in the modeling can be found accross the equivalence class of an Sbox.

Other applications
Besides AES, Skinny and Midori we also applied Algorithms 4 and 5 to the linear layers of some other block ciphers. The obtained results are summarized in Table 8.

Impact of the new modelings on the solving time
In this section we analyze the impact of our new modelings on the running time of a MILP optimization problem. We chose to perform our experiments on the problem of deciding whether a differential (δ in , δ out ) is possible, which in practice consists in finding a possible differential characteristic, if this one exists. This kind of computation is used in practice in impossible differential cryptanalysis. We will provide more details on the full problem in Section 5. The goal of these experiments is to quantify in terms of time the impact of both the Sbox and the MixColumns modelings. For this, we considered two different modelings for the Sbox: the logical condition modeling method of Section 2.4 and a combination of the  methods of Sections 2.4 and 2.5. For the linear layer we also considered two cases: in the first case we model the MixColumns matrix via the naïve XOR modeling and in the second we model it with the improved XOR modeling given by Algorithms 4 and 5. We then considered all four different combinations of these Sbox and MixColumns modelings. For launching the experiment, we chose a random set of input and output difference pairs (δ in , δ out ) and asked Gurobi to verify whether the differential (δ in , δ out ) is possible. We ran this experiment for Skinny-128 and the AES and present in Table 9 the average computation times for each cipher. Quantiles and other statistics are available in Appendix G. For the AES, the computation time is dramatically higher when the input and output are sparse (have much more zeros than ones) than when they are dense, hence the need for two different tables. Our intuition for this fact is that there are many more possible differential characteristics inside the differential when the input and the output are dense, which makes it easy for the solver to find one in a few seconds. For the application on AES with sparse ends, we also indicate to our models which bytes are active and inactive in the first two and last two rounds. This reduces the running time of the solver by approximately a factor 10.
As we can see, for Skinny-128 the change of modeling for the linear layer has a much bigger impact on the running time than the change of the Sbox modeling. There are in our opinion three possible reasons for that: First, the number of rounds under study for Skinny, 32, is high enough for the modeling of the linear layers to have impact. Then, as one can see from Table 4, the difference between the two Sbox modelings is not significant. Finally, the number of inequalities per Sbox is rather low (< 400), which gives the linear layer modeling more impact as well.
For the AES, the inverse phenomenon happens: the linear layer modeling does not have an important impact whereas the Sbox modeling divides the running time by 2. Again, we think that a possible explanation of this is that the number of rounds, 5, is too low for the linear layer to a have an important impact, the difference between the two Sbox modelings is significant (see Table 4) and the number of inequalities per Sbox is rather high (≈ 3000) making that the Sbox computations have a higher proportion in the global process.

Applications on impossible differential cryptanalysis
Impossible differential cryptanalysis is a powerful and well-studied cryptanalytic technique introduced independently by Knudsen [Knu98] and Biham et al. [BBS99]. Its principle consists in finding impossible differentials, i.e. an input difference δ in and an output difference δ out that cannot be connected. Any impossible differential attack starts with the discovery of an impossible differential (δ in , δ out ) covering a maximal number of rounds. Traditionally this was done with the U-method [KHS + 03] or its extensions [LLWG14,WW12]. However, in 2017, Sasaki and Todo [ST17b], showed that MILP can be successfully used to prove resistance against impossible differential attacks or for discovering new impossible differential distinguishers. For proving (partial) resistance against impossible differential cryptanalysis with MILP, and as briefly explained in Section 4, one can choose a set of input and output pairs and then solve a MILP differential propagation problem with the given cipher model for each of those pairs. The chosen set is typically composed of all possible inputs and outputs with exactly one active byte, i.e. for which exactly one byte has a difference. When all of those computations result in a valid differential transition, showing thus that the input and output differences can be connected, we consider that resistance against impossible differential cryptanalysis has been partially proven, where partially applies to the fact that the input and output spaces are restricted. However, as seen in Section 4, a single computation with a fixed input and output difference can be too long for permitting to do such kind of proofs for a large meaningful enough set of input/output pairs. To overcome this problem Sasaki and Todo introduced in Section 5 of [ST17b], the Differential Possibility Equivalence technique. This technique reduced the number of MILP instances to solve and permitted to drastically reduce the overall running time of this process. In what follows we briefly present this technique. Then we show how to further improve this method for decreasing the running time further. We applied this technique to provide partial resistance against differential cryptanalysis for 5-round AES and 13-round Skinny-128.

The Differential Possibility Equivalence technique
We consider for the rest of this section that an R-round MILP model is composed of R non-linear layers and begins and ends with a non-linear layer. It is then also obviously composed of R − 1 linear layers.
Suppose that we search for R-round impossible differentials for an SPN cipher. Suppose further that the pairs of input and output differences are restricted in a set S and that we have computed a possible differential characteristic for R rounds x 0 R − → y 0 . The idea of the Differential Possibility Equivalence technique is to exploit this possible differential characteristic to discard other possible transitions (x, y) ∈ S without computing at each time a new R-round differential characteristic x R − → y from scratch. To do so, one chooses r in and r out such that r in + r out < R and gets the differences x and y in the already computed path One then discards all (x, y) ∈ S such that the transitions x rin − − → x and y rout −−→ y are possible, which means computing two smaller differential paths on r in and r out rounds. Sasaki and Todo introduced this technique in [ST17b] with x and y respectively right before and right after the first and last non-linear layers (hence with r in = r out = 1), finding x and y by directly looking at the DDT of the Sboxes used in those non-linear layers. However, it can be interesting to try the same with other values for r in and r out , using MILP models for checking differential paths x rin − − → x and y rout −−→ y. The choice of r in and r out for finding a minimal running time depends on the cipher and the running times for computing differential characteristics for R, r in and r out respectively rounds. Moreover, one should avoid to check differential transitions x rin − − → x and y rout −−→ y when they have a low probability to be possible. In the search for impossible differentials with exactly one active input and one active output byte, it appears that it is rather efficient to restrain the checks for (x, y) such that x shares the same active byte as x 0 and y shares the same active byte as y 0 . Indeed, for the AES with R = 5, r in = r out = 2, the computation of one R-round differential characteristic x 0 R − → y 0 allows to discard all the other pairs (x, y) with the same active bytes. For Skinny-128 with R = 13, r in = r out = 2, for 2 checks x 2 − → x or y 2 − → y, one possible transition gets discarded on average.

Applications to Skinny-128 and AES
In [ST17b] the authors used MILP for proving the maximal number of rounds for which impossible differentials exist for many different designs, if the input and output differences are restricted inside one word (bit, nibble or byte, depending on the design). Most of the analyzed ciphers are based on 4-bit Sboxes while for the analyzed designs using 8-bit Sboxes the details of the Sboxes are not taken into account. The only exception is Midori-128 but whose Sboxes are constructed from two 4-bit ones.
Our new modelings for both big Sboxes and linear layers and the above generalization of the Differential Possibility Equivalence technique permitted us to complement the work done in [ST17b] for 8-bit based ciphers, by taking in particular the Sbox details into account. For both 5-round AES and 13-round Skinny-128 1 , for each pair of input/output bytes (i, j), we ran our models with the Differential Possibility Equivalence technique on the set with 255 × 255 elements with byte i active in input and byte j active in output. Those computations provide us with proofs for partial resistance against impossible differential cryptanalysis.

Conclusion
In this paper, we presented new techniques for improving MILP models simulating differential properties through Sboxes and MixColumns operations. We found better modelings for various Sboxes with sizes ranging from 4 to 8 bits with three different approaches: adding inequalities given by the H-representation of the convex hull of possible transitions, packing impossible transitions in affine subspaces of the form a ⊕ Prec(u) or packing them in (possibly "merged") distorted balls. Our techniques for Sbox modeling are very general: they basically give efficient algorithms to represent any subset of {0, 1} n with a small number of inequalities. Those techniques could then have applications beyond differential cryptanalysis. We also introduced a link between the modeling of F 2 -linear operations and the search for a basis with minimal weight codewords inside F 2 -linear codes. We gave algorithms based on this link to reduce the inequality cost of MixColumns modelings. We then gave insights on how those techniques impact the performance of computing a differential characteristic with a fixed input and output for Skinny-128 and the AES. Those two cases respectively demonstrate the benefits of better modelings for the MixColumns operations and for the Sboxes. Finally, we applied those techniques to proving partial resistance of 13-round Skinny and 5-round AES against impossible differential cryptanalysis with the specific properties of the Sboxes fully taken into account.
Lastly, our techniques could be applied in a straightforward manner to the search for best differential characteristics in various ciphers as explained in [AST + 17] and to similar problems of linear cryptanalysis. The optimal way of solving Problem 2 to improve performance remains an open problem.

B DDT of Present
We also denote the basis vectors as i ↓ e i = (0, · · · , 1 , · · · , 0) for all i ∈ [0, m − 1]. Finally, j b and j c will denote the indices such that a ⊕ b = e j b and a ⊕ c = e jc .
We then have the two following lemmas: Lemma 1. The points a, b, c and a ⊕ b ⊕ c do not belong to Q.
-If b ⊕ e i ∈ P b , x ∈ Q 3 ⊆ Q.
-If b ⊕ e i ∈ Q 1 , a ⊕ e i ∈ P a and x = (a ⊕ e i ) ⊕ a ⊕ c ∈ Q 2 ⊆ Q.
In conclusion, we always have in this last case that A(x) + B(x) + C(x) ≤ 7. Then we also compute the three sets Q 1 , Q 2 and Q 3 , each one containing possible transitions but also impossible transition points that will unfortunatly not be discarded by the new inequality. We highlight these points in blue unless they are already in red. By following notation we get that Following the technique of Proposition 3 for each of the three distorted balls, we get the following three linear constraints:
Finally, the running time of Algorithm 5 is (p + 1) times the running time of Algorithm 4.

E MixColumns modeling for the AES
Algorithm 5 applied on the AES MixColumns outputted the matrices of Table 11. The lines are represented as hexadecimal numbers with the first column element being the least significant bit (i.e. the rightmost one).

F Analyzed linear layers
We provide here a brief description of the linear layers of the ciphers Anubis, Aria and Saturnin whose modeling for MILP we analyzed in Section 3.
• Anubis is a 128-bit block cipher designed by Barreto   cipher whose diffusion layer consists in applying a 16 × 16 binary matrix (given in Figure 3) to different parts of the state. The quantity of Equation (4) is 2048 and is not improved by our algorithms.