Separable Statistics and Multidimensional Linear Cryptanalysis

. Multidimensional linear cryptanalysis of block ciphers is improved in this work by introducing a number of new ideas. Firstly, formulae is given to compute approximate multidimensional distributions of the encryption algorithm internal bits. Conventional statistics like LLR (Logarithmic Likelihood Ratio) do not ﬁt to work in Matsui’s Algorithm 2 for large dimension data, as the observation may depend on too many cipher key bits. So, secondly, a new statistic which reﬂects the structure of the cipher round is constructed instead. Thirdly, computing the statistic values that will fall into a critical region is presented as an optimisation problem for which an eﬃcient algorithm is suggested. The algorithm works much faster than brute forcing all relevant key bits to compute the statistic. An attack for 16-round DES was implemented. We got an improvement over Matsui’s attack on DES in data and time complexity keeping success probability the same. With 2 41 . 81 plaintext blocks and success rate 0 . 83 (computed theoretically) we found 2 41 . 46 (which is close to the theoretically predicted number 2 41 . 81 ) key-candidates to 56-bit DES key. Search tree to compute the statistic values which fall into the critical region incorporated 2 45 . 45 nodes in the experiment and that is at least theoretically inferior in comparison with the ﬁnal brute force. To get success probability 0 . 85, which is a fairer comparison to Matsui’s results, we would need 2 41 . 85 data and to brute force 2 41 . 85 key-candidates. That compares favourably with 2 43 achieved by Matsui.


Introduction
Linear Cryptanalysis is a statistical approach in the cryptanalysis of symmetric ciphers.It is a known plaintext attack which does not require any special plaintext/ciphertext pairs and therefore is a very important tool in practical cryptanalysis.It was introduced by Matsui in [20,21] as an attack to DES.Davies and Murphy came up with another approach in statistical cryptanalysis in [5].Linear Cryptanalysis exploits the fact that an xor of certain plaintext, ciphertext and key bits is zero with some a priori computed probability p different from 1/2.Such combinations were called linear approximations in [20].The probability itself somehow depends on the cipher key bits.The method is more efficient if p is far from 1/2, one says a linear approximation is more biased in this case.
The attack is characterised by the number of necessary plaintext/ciphertext pairs (data complexity), by the complexity of ranking relevant sub-keys according to the value of a statistic and the size of the final brute force (time complexity), and by success probability.Two variations Algorithm 1 and Algorithm 2 were suggested in [20].Algorithm 1 uses Rround approximations, while Algorithm 2 uses R−1 or R−2-round approximations to attack R-round cipher.In Algorithm 2 an observation on linear approximations may depend on some key bits from the first and the last rounds of the cipher and the linear approximations themselves are generally more biased.So one may recover more cipher key bits at a lower price, in other words, the method requires a lower amount of plaintext/ciphertext pairs and is more efficient.
For 16-round DES, Matsui shows how to determine candidates for relevant key bits or key bit linear combinations by Algorithm 2 with n = 2 43 plaintext/ciphertext 64-bit blocks and success probability 0.85, then 2 43 encryptions are performed to find the correct key [21].The success probability was found experimentally for 8-round cipher with 10 4 attack applications and then extrapolated to 16-round DES.Two 14-round linear approximations were there used together.
The expression linear approximation though well settled in the current literature on cryptanalysis does not seem quite precise.Formally, one can say it measures how ciphertext is being approximated by the plain-text.However this intuition is not very helpful within the subject.We believe that the linear cryptanalysis and its modifications are based on the view that a linear approximation, or generally, any string x of the encryption algorithm internal bits is an ordinary random variable and does approximate nothing.Rather, a priori computed p, or generally, a probability distribution is an approximation to the real probability (distribution).The latter is a complicated function in many (or all) cipher key bits.However the approximate probability (distribution) p commonly depends on a small set of the cipher key bits (linear combinations of the key bits) and that makes the method work.For this reason we put that expression in quotation marks in what follows.A more detailed discussion on the meaning of a "linear approximation" and why using several approximations to the same x does not improve on the cryptanalysis is in Appendix 1 below.
Only few improvements with relation to DES have been published since Matsui's work.In [19] a chosen plaintext linear attack was suggested and in [7] time complexity of the attack's first stage was reduced by using Fast Fourier Transform.It was experimentally found in [11,12] that time complexity of Matsui's attack on DES may be decreased with a better ranking of the values of relevant sub-key bits, though data complexity and success probability remain the same.The success probability was determined experimentally with 21 attack applications, which does not seem enough to justify the figure 0.85.
How to improve Algorithm 1 with more than two "linear approximations" the distribution of which depend on the same key bits was shown in [18].In [2] a framework for using many "linear approximations" considered statistically independent was proposed, though no practical cryptanalysis of 16-round DES was presented, where the sub-keys relevant to the observations on "linear approximations" were considered disjoint as in [21].Linear cryptanalysis was further extended in different ways in [14,1,15], see [17].For instance, [1,15] made use multidimensional analysis instead of one-dimensional.A good survey of publications on using multiple "linear approximations" is in [16].Recently, a series of papers on linear cryptanalysis of PRESENT were published, see for instance [6,4,3].Most of the methods are based on the assumption that the "linear approximations" are statistically independent, which may be true only to some extent.On the other hand, no general methods for computing joint a priori distributions (approximate joint distributions) of multiple "linear approximations" in block ciphers were published before.If a priori distribution is unknown, it looks difficult to predict the success probability of relevant statistical attacks.An attack with low success probability has limited usefulness even if it has a low complexity.The same limitation holds in multidimensional linear cryptanalysis of [15].The present work solves the deficiency by giving formulae to compute multidimensional probability distributions in Feistel ciphers.The method presented in Section 10 of this work is general and based on clear mathematical foundation.It is a direct generalisation of how Matsui calculated probabilities of his "linear approximations" in [20].The approach is applicable to any round ciphers.These formulas may be further analysed to derive useful information on how the key bits involved in the trail affect those distributions.That was done in Section 11 in case of DES.
Similar ideas were earlier used to compute joint probability distributions of some particular bits and study how those distributions depend on the cipher key for DES in [5,9] and for PRESENT in [8].Those methods are based on a number of heuristic assumptions and simplifications.In particular, the calculation in [8] was done for 15-round cipher, where key bits involved in each round on the trail are the same.
In theory, it is possible, as it is suggested in [15,6], to derive an approximate multidimensional distribution from the correlations of linear functions defined on its domain.For instance, that may be a span of a set of strong "linear approximations".The known distributions of those "linear approximations" are not exact, where the accuracy of the approximations and the number of the key bits involved depend on chosen trails.So to get an applicable joint distribution the trails are to be somehow compliant with each other, otherwise that may result in the final multidimensional distribution depends on too many key bits.From the point of view of the present work, the approach may require an approximate description of the cipher under question by using a big trail (appropriate auxiliary event in terminology of Section 10.4) which incorporates all trails for particular "linear approximations".
Another direction is to study joint distributions of correlations between "linear approximations" as functions of randomised cipher key, see the latest version of [4] and references in there.As that seems a difficult line to follow, the analysis is based on various hypotheses on joint behaviour of the above correlations, which are difficult to justify for a specific cipher as well.To get a practical key-recovery attack it is more natural to assume that the cipher key one wants to find is fixed while available plaintexts are randomly generated as it is in the original linear cryptanalysis and other statistical attacks as in [5].We rather need to know how a priori distributions (approximate a priori distributions) depend on the cipher key in exact terms and that is achieved in the present work for Feistel ciphers.
Two open problems related to Algorithm 2 were posed in [2].First, how to merge data (find cipher key) from analysing different "linear approximations" efficiently.Second, how to compute the success probability as a function in the number of available plaintexts and the number of trials in the search phase.A solution to these problems was found in [28].In particular, an attack for 16-round DES with 2 43 data and same amount of the final brute force trials, and with success probability 0.89 was there described.The probability was predicted by theoretical means and the prediction was found correct experimentally for a similar method in case of 8-round DES with 10 5 method applications.The attack uses 10 best 14-round "linear approximations", considered statistically independent.The distributions of those "linear approximations" and observations on them depend on 53 DES key bits.By solving a particular optimisation problem (stated in its generality in Section 8 of the present work) one finds a set of size 2 40 of 53-bit key-candidates at price ≈ 2 40 computations, that is without brute forcing 2 53 values of the statistic.The probability that a correct 53-bit sub-key is in this set is 0.89.The present work is far and away generalisation of [28].Instead of "linear approximations" certain projections (sub-strings of bits or multidimensional linear functions) of the encryption internal states are used.In contrast with [28] we do not here assume the projections are statistically independent.We are able to compute their approximate joint a priori distributions and therefore predict correctly success probability of the attack besides other things.We implemented our method and got improvement over Matsui's result on 16-round DES in data and time complexity while success probability remains the same, see Section 4.

Feistel Cipher and DES
The methods introduced in this paper are general and applicable to many ciphers.In Sections 11 and 12 we show how they work for DES as an example.In this section DES details are given.Let X 0 , X 1 be plaintexts blocks of bit-length r each and K i , i = 1, . . ., R round keys of bit-length s.Then for i = 1, . . ., R the blocks X i−1 , X i is an input to the i-th round of the encryption algorithm, where X i+1 , X i is the output, and In case of DES we have r = 32 and s = 48, and the number of encryption rounds is 16.We keep the notation of [20], in particular, all bit string entries are numbered from right to left, starting with 0. In case of DES the key bits numbered as in its specification: k i , where i = 1, . . ., 63 and i = 0 mod 8. Besides, we ignore the initial permutation.See [29] for DES specification.

The Problem
Let x be a vectorial random variable which incorporates some bits from the encryption first round output and some input bits to the last round as x = (X, Y ) in Fig. 2. Like in Matsui's linear cryptanalysis, an approximate distribution of x may be a priori computed from the encryption algorithm specification.It commonly depends on a relatively low number of the cipher key bits (linear combinations of the key bits) denoted key in Fig. 2, see Section 11 how this dependence looks for some particular vector x 1 in DES.On the other hand, the observation on x depends on the available plaintext/ciphertext blocks and some key bits from the first and the last rounds denoted Key in Fig. 2. Assume one guesses relevant key bits K = (key, Key).If the guess was correct, then the observation follows a priori distribution (correct key assumption).If not, then the observation follows a distribution which is close to the uniform distribution.We assume it is uniform (wrong key assumption) by ignoring the case when the guess on Key was correct but the guess on key was not.In that case the observation usually follows a permuted a priori distribution, at least that is true in [20] and in our experiments with DES, see Section 11.Those assumptions were used by many authors before and their correctness is supported by the experiments with DES in the present work.The setting described here is the setting of the multidimensional linear cryptanalysis which originates from [1].However this work does not suggest any way to compute joint a priori distributions of the encryption algorithm internal bits and so the method was not implemented.
Theoretically, according to [1,15], one can use a Logarithmic Likelihood Ratio (LLR) statistic, which depends on both the distribution and the observation, so it depends on K.That provides the most powerful statistical test to distinguish correct and incorrect values  [24].However the method is not efficient if the size of K (the number of linearly independent combinations in K ) is large.In this case one has to rank 2 | K| values of the key bits involved according to the value of the statistic.
At the same time the distribution of some projections (sub-vectors or generally any functions) h i (x) and observations on them may depend on a much lower number of the key bits Ki = (key i , Key i ).That holds for DES, see Section 11 below, and may hold for other modern block cipher based on small S-boxes.A good key schedule, e.g., round keys are not linear functions of the main key, may reduce the negative effect of small S-boxes in this sense.However that requires a separate investigation.
For DES the values of Ki are linear projections of a K-value.The sub-keys Ki which affect the distributions and the observations for the projections h i may partly coincide or be linearly dependent.In this paper we consider how by observing the values of several projections h i (x) reconstruct a set of K-candidates which contains the correct value with a prescribed success probability.We show that this can be accomplished by solving efficiently an optimisation problem without brute forcing the values of K. Also we answer what the size of the set of K-candidates is.To this end we will use a novel statistic which reflects the structure of the cipher round.The statistic is a linear combination of LLR statistics for different projections and we do not need that they are statistically independent.

Our Contributions
This paper contains the following contributions.
1.An approximate probabilistic description of Feistel ciphers is suggested in Section 10 and a convolution type formula for computing approximate probability distribution of multidimensional random variables x constructed with internal bits of the encryption algorithm is there derived.[26,25].
5. Our approach allows to find the number of necessary plaintext/ciphertext blocks, given desired success probability and the number of K-candidates to brute force.
6.The attack was implemented for 16-round DES, see Section 12 and it provides an improvement over Matsui's results.We used two independent separable statistics, each based on 14 of 10-bit projections with 54 DES key bits involved overall, see the next section for a summary.

Summary of the Attack for DES
We compute approximate a priori distributions of 14-round DES input/output 14-bit vectors x 1 , x 2 in Section 11.The vector x 1 incorporates all variables relevant to Matsui's best "linear approximation" after adding some more variables, then x 2 is produced from x 1 by DES symmetry.
For each x t , t = 1, 2 some 14 of its 10-bit projections h ti = h ti (x t ) are used.A priori distribution of h ti depends on 3 key bits (linear combinations of the key-bits).The observation on the projection is a function of the plaintext/ciphertext bits and round keys from the first and the last rounds.It depends on at most 18 key bits.An LLR statistic LLR ti is constructed for each projection h ti .It depends on a sub-key of size at most 21, see Section 12.1.So at most 2 21 values of the sub-key are ranked by the statistic LLR ti .That gives weights for the values of those sub-keys.
The vector s t = (LLR t 1 , . . ., LLR t 14 ) depends on the cipher key bits and may have two multivariate Normal distributions: one for a correct guess and another one for an incorrect guess on those key bits.Following Neyman-Pearson approach, we construct another LLR statistic to distinguish these Normal distributions.The final LLR statistic S t = 14 i=1 ω ti LLR ti linearly depends on the values of LLR ti .So it is separable, see Section 7.That property is important for a tree search algorithm in Section 8.
Two separable statistics coming from x 1 and x 2 are considered independent, see the next section for a discussion on this assumption.They are used simultaneously.One sets two thresholds z 1 and z 2 (in fact, z 1 = z 2 = z as s 1 , s 2 , and, therefore, S 1 , S 2 are equally distributed), and defines the critical region: S 1 > z 1 , S 2 > z 2 , in order to provide a desirable success probability (e.g., 0.83) of the attack.That defines a statistical test.Overall, there are 54 independent key bits which affect the distributions and observations for all the projections h ti , see Section 12.2.The total weight of a 54 -bit sub-key is a tuple of two real numbers, each of them is a linear combination of the weights of (≤ 21)-bit sub-keys for relevant projections h ti .The 54-bit key candidates that pass the test are computed with the tree search algorithm in Section 8, by creating 2 45.45 nodes in the experiment.Creating a node is a very simple operation.An important feature of the method is that no need to check all 2 54 sub-key values to decide.The number of the DES 56-bit key candidates is 2 41.46 (close to the theoretically predicted number 2 41.81 ).In order to mount the attack one needs n = 2 41.81 plaintext/ciphertext blocks.
To get success probability 0.85, which is a fairer comparison to Matsui's results, we would need n = 2 41.85 and to brute force 2 41.85 keys.

Assumptions
In this section we summarise the assumption underlying the statistic and the attack.
1.A sample of n plaintext blocks uniformly and independently generated, and their encryptions with the same cipher key are available.
2. Correct key assumption.Under correct relevant key bits the distribution of encryption internal vectors (e.g., 14-bit vectors x 1 , x 2 in Section 11 in case of DES) involved in the attack is close to an approximate distribution a priori computed by Theorem 1. Technically, we use X 0 , X 1 , .., X R are uniformly distributed and an event C has happened.Depending on the event, one may compute the exact or approximate probabilities of various events in the encryption algorithm.The exact probabilities depend on all key bits and are difficult to handle, while approximate probabilities may only depend on few key bits.
3. Incorrect key assumption.Under incorrect relevant key bits the distribution of the above vectors (e.g., 14-bit vectors x 1 , x 2 in Section 11 in case of DES) is close to the uniform distribution.Those distributions may vary.So relaxing the assumption in line with [3] may lead to an improvement.
4. We use limit distributions (produced by Central Limit Theorem) of the main statistic S in Section 7.2 in order to compute the success probability and the attack complexity.
5. In this particular cryptanalysis of DES the vectors x 1 , x 2 in Section 11 incorporate different internal bits of the encryption and so considered statistically independent.So two separable statistics considered independent are used.Ideally, we would need a joint distribution of x 1 , x 2 which is a 28-bit vector.Though feasible it would take more time.On the other hand, we wanted to demonstrate that our method is flexible and several independent separable statistics may be used simultaneously.

Separable Statistics
Let x be a vectorial random variable and an observation ν = (ν 1 , . . ., ν m ) on m projections, which are sub-vectors and generally any functions in x, is available.Here ν i denotes a vector of observations on the outcomes of the projection h i (x).We do not assume the projections are statistically independent.In this cryptanalysis ν i is a function in available plaintext/ciphertext blocks and the key bits Key i .By the statistic we mean a function which depends on the observation ν.A statistic S(ν) is called separable if it can be represented as where S i (ν i ) are statistics computed for different h i (x).This property allows analysing data ν in parts by analysing ν i separately.The notion was introduced in [23] to study statistical tests to distinguish discrete distributions.In this cryptanalysis the statistic S i (ν i ) depends on a priori distribution of h i (x) and therefore on the key bits key i besides the observation ν i and the key bits Key i .So S i (ν i ) depends on Ki .The statistic S(ν) depends on available plaintext/ciphertext blocks and the key bits K, which incorporate all Ki .That defines the statistic's domain.In fact, S i are weighted LLR statistics for h i (x).
To get the main statistic S(ν) the Neyman-Pearson approach is applied again in Section 7.2.We will write S( K) and S i ( Ki ) instead of S(ν) and S i (ν i ) to stress the dependence of the statistics on the sub-key K and Ki respectively.So (1) may be written as One decides a value of K is correct if S( K) > z for some threshold z.That defines the critical region.If the distribution of S( K) is known, then the value z is determined by a prescribed success probability.One can also determine the average number of wrong values of K which pass the test as well.That defines the complexity of the final key search.The values of Ki which agree on common key bits or, more generally, common linear subspaces of the key bits are to be combined to get a value of K which falls into the critical region.That is an instance of the optimisation problem described in Section 8.An efficient algorithm to solve it is introduced in Section 8.2.The algorithm implements walking over a search tree by creating new nodes if certain linear inequalities, implications of S( K) > z, are satisfied and takes advantage of the fact that the statistic is separable.The computation cost is much lower than 2 | K| .One may use several statistically independent x, so several statistics of that kind may be used simultaneously.
Another statistic is derived in Appendix 3.That is based on a more direct application of the Neyman-Pearson approach.However it is separable only for statistically independent projections.That is not true for the bunches of the projection ( 28) and ( 29) in this cryptanalysis of DES as all the projections inside each bunch are statistically dependent.Therefore, the second statistic does not fit well within this cryptanalysis and won't be used.

Notation
Let x be a random variable with N outcomes denoted 1, 2, . . ., N .Assume x may have two probability distributions: P = (p 1 , . . ., p N ) and Q = (q 1 , . . ., q N ) for non-zero p i , q i .Let denote outcome frequencies for x after n trials, so that N j=1 V j = n.In other words, V j is the number of times the outcome j was hit.

Main Statistic
Let x follow the distribution P .Then P i = (p i1 , . . ., p iNi ) denotes the distribution of h i (x), where and the sum is over a such that h i (a) = b.Similarly, if x is distributed according to Q, then Q i = (q i1 , . . ., q iNi ) is the distribution of h i (x).For each i and b we have p ib , q ib = 0. We consider the LLR (Logarithmic Likelihood Ratio) statistic for h i According to Neyman-Pearson lemma [24], LLR i provides with the most powerful test to distinguish the distributions P i and Q i by observing independent samples.By a standard argument, see for instance [1], for independently generated samples we get LLR i (ν i ) = n t=1 R it , where R it are independent identically distributed random variables.R it takes the value ln q ib p ib with probability p ib for i = 1, . . ., N i , or with probabilitiy q ib for i = 1, . . ., N i .In this cryptanalysis the independence is provided by the independence of the plaintext blocks, see Section 6.
Let µ iP , σ iP denote the expectation and the variance of R it under condition that ν i follows the distribution P i .By [1], if the distributions P i and Q i are close enough, then µ iP ≈ −µ iQ and σ iP ≈ σ iQ .In this section we will prove a more general statement.
Let s(ν) = (LLR 1 (ν 1 ), . . ., LLR m (ν m )).Then, by the argument above, s(ν) = n t=1 R t , where R t = (R 1t , . . ., R mt ) are independent identically distributed vectorial random variables.The expectation of R t under condition that ν follows the distribution P is µ P = (µ 1P , . . ., µ mP ).Let C P denote the covariance matrix of R t .Let the distributions P and Q be close enough, then µ Q ≈ −µ P and C P ≈ C Q by the following Lemma.
Proof.By definition, µ iQ = Ni b=1 q ib ln q ib p ib and µ iP = Ni b=1 p ib ln q ib p ib .We remark q ib = p ib + ε ib , where ε ib = hi(a)=b ε a .By expanding the logarithm, ln as Then and so µ iQ = O(δ 2 ).Let x have the distribution P .By c ijP we denote an entry of C P , the covariance between R it and R jt .One can see R it takes the values ln with probability p a .By definition, So as µ iQ µ jQ = µ iP µ jP + O(δ 5 ) by the above argument.By (3), ln q ib p ib = O(δ) and by the condition | a | ≤ δp a .Therefore, c ijQ − c ijP = O(δ 3 ).That proves the lemma.
By Central Limit Theorem, for large enough n the vector s(ν) is distributed as a multivariate normal random variable N(nµ P , nC P ) or N(nµ Q , nC Q ).To distinguish between P and Q by observing the value of ν one may distinguish between the normal distributions above.Assume the matrices C P and C Q are invertible.That always happens in our experiments with DES, though the determinants are fairly small.Then the normal distributions have densities.A normalised logarithmic likelihood ratio statistic is Generally, it is a quadratic function in s(ν).As C = C Q ≈ C P the statistic is approximately linear.Really, let µ = µ Q .We take into account that µ P ≈ −µ and by expanding brackets in the expression for S(ν) we get where S i (ν i ) = ω i LLR i (ν i ) for some coefficients ω i , entries of C −1 µ T .Therefore the approximation (5) to the statistic S(ν) is separable.That property will be used in the search algorithm in Section 8.2 and in the cryptanalysis of DES, see Section 12. Denote u = n µC −1 µ T , then u > 0. The expectation of s(ν) is nµ (under Q) and its covariance matrix is nC.So the expectation of S(ν) is ≈ ±u and its variance is ≈ u.So if x follows Q, then S(ν) is distributed approximately as N(u, u).If x follows P , then S(ν) is distributed approximately as N(−u, u).
An heuristic argument to justify the distributions of the statistic S(ν) is given in this section.Though the distributions work well in our experiments with DES, it is an open problem to get a rigorous proof.

Optimization Problem
In this Section we give an algorithm to solve a particular optimization problem.This algorithm is used to construct a set of K-values such that S( K) > z in Section 9.
Let A i , i = 1, . . ., m be matrices of size r i × n over binary finite field and of rank r i which are relatively low in comparison with n.Note that, in this section, n represents the number of variables not the number of plaintext blocks.Let X = (x 1 , . . ., x n ) be a vector of unknowns of length n.We consider a system of inclusions (a system of MRHS equations according to [25]) where {a i1 , . . ., a iti } are given vectors of length r i over the same field.Let S i be a weight function on the right hand side vectors in (6).If a is a vector of length r i and a / ∈ {a i1 , . . ., a iti }, then we set S i (a) = −∞.The function S i may be vectorial defined over real numbers including −∞, and it should be of the same dimension for every i.Let A be a matrix composed of a basis of the space generated by the rows in all A i .To simplify the notation we assume that rank(A) = n.The problem is to find all values of X such that the following vectorial inequality holds for some vectorial threshold z.One can consider that problem over any field, in other words, the entries of X may take values from any field.The only limitation is the number of vectors on the right hand sides of ( 6) are finite.The problem may be solved by brute force in case of a finite field by trying all values of X.We now suggest a method that works faster.General case rank(A) ≤ n is reducible to the case where rank(A) = n by rewriting ( 6) in new variables Y = AX.

Example of the Problem
Let a system of 3 MRHS equations in variables X = (x 1 , x 2 , x 3 ) with weights be given, where the S i , i = 1, 2, 3, are all of dimension 1.
One is to find all x 1 , x 2 , x 3 such that The solution is x 1 , x 2 , x 3 = 1, 1, 1.

Algorithm
The algorithm is described in terms of linear functions not vectors.Thus A i X are vectorial linear functions and AX is a basis of the linear space generated by the entries in all A i X. Assume a sequence of the subspaces generated by sets of linearly independent basis functions T j such that One can assume that T j−1 is a subset of T j and T r = AX.The choice of (9) affects the time complexity of the algorithm below.In particular, it is important to keep the growth of the dimension stable, for instance, dim 1. (precomputation) For each j, i one defines the subspace T ji = T j ∩ A i X by its basis T ji .One can assume that T ri = A i X.For each value T ji = a i the maximum of S i achieved upon that fixation of T ji is stored.We denote that maximum by d ji (a i ).
If T ji = 0, then the maximum is denoted d ji .Formally, For each j and i one keeps 2 |Tji| ≤ 2 ri real numbers d ji (a i ).
2. We set T 0 = 0. Then we start the search with j = 1 and implement the following recursive step.Let for some j ≥ 1 the value of T j−1 = b be already determined.We will determine a value for T j .Take any value T j = a that extends the value of T j−1 = b.For each i, as T ji ⊆ T j , compute the value T ji = a i and look up Let (10) hold.If j = r, then to find the solution X one solves the system of linear equations a = AX as in this case a i = A i X and S i (A i X) = d ri (a i ), and (7) holds.
Another value for T r is then examined or one backtracks, that is j ← j − 1 and one repeats the step.
If j < r then j ← j + 1 and one repeats the step.If (10) does not hold, then another value for T j is examined or one backtracks.
The algorithm is an adaptation of a gluing type algorithm from [27].It is justified by the following lemma.
Lemma 2. Let 1 ≤ j ≤ r and the value T j = a be an extension of the value Proof.By the definition of a i and d j,i (a i ), we have As the value T j = a is an extension of the value T j−1 = b and, in other words, T j = a implies T j−1 = b, then d j−1,i (b i ) ≥ d ji (a i ) for any i.That implies the statement.
By Lemma 2, the inequality m i=1 S i (A i X) > z implies the inequalities (10) for any 1 ≤ j ≤ r as d ri (a i ) = S i (A i X), where a i = A i X, a = T r = AX.Therefore we won't reject a value of X by the decision rule (10) for any j = 1, . . ., r if it satisfies (7).The search tree is presented in Fig. 3.We demonstrate how it is constructed.To construct the first node one sets x 1 = 0 and checks if

Example of the Problem Solution
This is false, one backtracks, sets x 1 = 1 and checks This is true, one extends x 1 , x 2 = 10 and checks This is false, one backtracks, puts x 1 , x 2 = 11 and checks That is true, so x 1 , x 2 , x 3 = 111 is the only solution to the problem.The complexity is determined by the number of constructed nodes.The tree in Fig. 3 incorporates 6 nodes besides the root and one is to check 6 inequalities.The brute force requires to check 8 inequalities (8).

Application in Cryptanalysis
Let a number of statistically independent vectors x t be given along with their projections h ti (x t ), i = 1, . . ., m t .For instance, x 1 , x 2 are 14-bit vectors ( 24) and ( 27) in the cryptanalysis of DES below.They depend on different internal bits of the encryption and therefore may be considered independently distributed.We use some of their 10-bit linear projections.
Let n plaintext/ciphertext pairs be available.The observation on h ti (x t ) is a string of frequencies ν ti of length N ti .In this cryptanalysis of DES N ti = 2 10 .Let's denote Kti = (key ti , Key ti ), where key ti are key bits which affect a priori distribution of h ti (x t ), and Key ti are those key bits from the first and the last round keys which affect the observation on h ti (x t ).Therefore Kti are linear functions (at least in case of DES) in unknown cipher key bits.Let K be a list of linearly independent functions in all Kti .For DES cryptanalysis with x 1 , x 2 we have rank( K) = 54.
For each possible value Kti one computes the value S ti ( Kti ) = ω ti LLR ti (ν ti , Kti ) by (2).One then combines the values of Kti into a value of K such that for all t and some thresholds z t to be defined later from a prescribed success probability.
One can easily represent all (11) together as a vectorial inequality (7).Therefore the algorithm from Section 8.2 is applicable.
We call a value of K which passes the test (11) a K-candidate.After the test each K-candidate is extended to a key-candidate (56-bit key in case of DES).All such keycandidates are to be brute forced.The algorithm's success is that (11) is true for the correct value of K. We now analyse the success probability of the method and the number of K-candidates.

Success probability and the number of K-candidates
Assume the value of K is correct.Then the value of Kti is correct too.The observation on every h ti (x t ) has a distribution derived from a priori distribution of x t .The statistic S t ( K) on the left hand side of (11) has the normal distribution N(u t , u t ) for every t if x t follows a priori distribution.Here u t = n µ t C −1 t µ T t , where nµ t and nC t are the expectation vector and covariance matrix of the vectorial random variables s t (ν t ) constructed with LLR statistics for h ti (x t ), i = 1, . . ., m t , see Section 7.2.For each t the success is not to miss the correct value of Kt .The probability of success is computed by where N(u t , u t ) denotes a random variable as well.As x t are independent, the success probability of the whole method is then t (1 − β t ).
If the value of K is incorrect we assume that all Kti are not correct.The number of K-values for which the latter is not true is negligible.So one can assume that the observation on every h ti (x t ) is uniformly distributed and the statistic S t ( K) has normal distribution N(−u t , u t ).The fraction of incorrect K which pass the test for one t is The fraction of incorrect K which pass the test for all t is t (1 − α t ) as x t are independent.
The number of K-candidates is on the average So the number of the cipher key values to brute force, that is the number of key-candidates, is 2 56 t (1 − α t ) in case of DES.Assume one wants to brute force 2 s key candidates with maximum success probability.One searches for z t such that t (1 − α t ) = 2 s−56 to maximise the success probability t (1 − β t ).

Multivariate Probability Distribution in Feistel Ciphers
Based on the analysis of the encryption algorithm we get a priori probability distributions of internal bits in Feistel Ciphers, see Section 2 for the definitions.

Notation
Let Y be a bit string of some length, then we denote Y {i, j, . . ., k} = Y

Multivariate Distributions
Assume the plaintext X 0 , X 1 is taken uniformly at random from the set of all 2r-bit strings and the cipher key we want to recover is fixed.The ciphertext X R+1 , X R and any internal bits in the encryption algorithm are then random variables.Given strings of indices (masks) Ω 0 , Ω 1 , Ω R ,Ω R+1 , our goal is to compute a priori distribution of which is to be used in this cryptanalysis below.Then Z is a vectorial random variable of The sought distribution depends on the cipher key and its exact calculation is a very difficult task.Instead, we will construct an approximation to that distribution which depends on a lower number of the key bits as it was done for one-bit "linear approximations" in [20].

Exact Probabilistic Description of a Feistel Cipher
Let X 0 , X 1 , . . ., X R+1 be now random independently generated r-bit blocks and K 1 , . . ., K R fixed round keys of bit-length s.Let's consider the event C: The event C depends on the whole cipher key, so it is difficult to calculate Pr(E|C) by this formula.Instead, a relaxed version of ( 16) will be used.

Approximate Probabilistic Description of a Feistel Cipher
We define a larger event C Γ , which means C implies C Γ , see for instance (17) below and then put Pr(E|C) ≈ Pr(E|C Γ ) = Pr(E,C Γ ) Pr(C Γ ) .That is an approximate description of the cipher.It depends on the event C Γ .Obviously, by taking another event we will have another approximate description of the cipher.As our goal is to compute an approximate distribution of (15), a relevant event C Γ is to be taken.This approach was already implicitly used by Matsui in [20] to compute probability of his "linear approximations", see Section 10.5.The accuracy of so defined approximate descriptions is unclear.It is even unclear how to measure that accuracy.There are two important parameters which play a role: the quality of the distribution and the number of the key bits which affect the distribution (and the number of the key bits which affect the observation in case of Matsui's Algorithm 2).The quality of the distribution may be measured by its Euclidean distance to an uniform distribution.That measure is called quadratic imbalance in [1].Intuitively, a better approximate distribution should depend on a lager set of the key bits, see for instance Section 11.1, where another marginally better approximate distribution of x 1 defined by ( 24) is constructed.However, using such distributions may reduce the efficiency of an attack as they may depend on a significantly larger set of the key bits.At the same time by an informal argument in Appendix 1 any two very good approximate distributions are essentially the same, in particular they essentially depend on the same key bits.That is in accordance with this paper experiments: by computing approximate distributions for the same vector with different trails, one gets an uniform distribution or the distributions which are very close to each other.That may probably mean that using more than one approximate distribution won't provide with any advantage, though that requires further investigation.Anyway, the approach gives good results in practice in the original linear cryptanalysis [20] and in the present work.
For Z defined by ( 15) and a bit string A of the same length, we will derive a formula to compute the exact value of Pr(Z = A|C Γ ) for C Γ defined by We see |Γi| .One says Γ = (Γ 1 , . . ., Γ R ) are output masks for multivariate round approximations (called round sub-vectors here) in R consecutive rounds respectively.Let's denote by ∆ i input masks.The sequence of Γ i , ∆ i defines a trail, see Section 10.6 for definitions.Trails are classically used to compute probability distributions of one-bit "linear approximations" for DES in [20].The approximate distribution of (15) does not depend on the input masks ∆ i in the internal rounds, that is for i = 2, . . ., R − 1, if the trail satisfies some natural conditions, see Section 10.6.Such trails will be called regular.We remark that the probability Pr(Z = A|C Γ ) only depends on the key bits involved in the right hand sides of (17).

Regular Trails
is called a trail.The members of the trail are called round sub-vectors, they are vectorial functions in X i .Index subsets ∆ i , Γ i are called input and output masks for the round sub-vectors ("linear approximation" for the round function in the terminology of [20]), see Fig. 4. The distribution of round sub-vectors are easy to derive from the definition of the round function.Our goal is to compute the joint distribution of some input and output bits (15) for R-round Feistel cipher by using a certain trail.Let K i [Λ i ] and X i [Θ i ] denote the round key bits and input bits relevant to the function For instance, in case of DES the key bits K i [23, . . . , 18] and input bits X i [16, . . . , 11] are relevant to F i [24,18,7,29].We call the trail (19)  where Γ 0 = Γ R+1 = ∅.We recall that the endpoints of the trail are fixed as we are to compute the distribution of (15).It is easy to check the following statement.
For R = 3 a regular trail exists if and only if Ω 3 \ Θ 3 ⊆ Ω 1 and Ω 1 \ Θ 1 ⊆ Ω 3 .Generally, there is a large variety of certain auxiliary events C Γ , or equivalently, regular trails for computing approximations to the actual distribution of (15).Those trails produce generally different distributions, and, in particular, the distributions may depend on different key bits.

Convolution Formula for the Distribution
Assume a regular trail (19), where Γ = (Γ 1 , . . ., Γ R ) are output masks.We now produce a convolution type formula to calculate an approximate distribution of the vector for that trail.That is we give a formula to calculate Pr(Z = A|C Γ ), where C Γ is defined by (17).Lemma 4 below states that the distribution does not depend on ∆ i , where i = 2, . . ., R − 1.To simplify notation, we put Γ 0 = ∅, Γ R+1 = ∅ and denote the probability distribution of round sub-vectors.In DES, if only non-adjacent S-boxes are involved in the trail (19), then by the definition of F i we have q i (b, a, k) = q i (b⊕k[∆ i ], a, 0).We denote the latter by q i (b⊕k Theorem 1.Let X 0 , . . ., X R be distributed independently and uniformly at random, and (19) be a regular trail.Then where the sum is over Proof.By conditional and total probability formulas, where the sum is over

and as the events
are equivalent.We took into account that the event C Γ is defined by . By E 1 we denote the event and by E 2 the event By the definition of a regular trail, no variables in So the events E 1 , E 2 are independent as they depend on different bits of X i , i = 1, . . ., R. We can now split the latter probability into a product.Then As and |Γi| we get That finishes the proof.

Distribution Properties
The conditions of Theorem 1 are satisfied if, for instance, That is an extension of the conditions upon which the distribution of one-bit "linear approximation" was computed by Matsui.To calculate the distribution of [20].We now study properties of regular trails and relevant distributions.
Lemma 4. Let (19) be a regular trail, then the distribution (22) does not depend on ∆ i .
Proof.We have The statement follows from as all other terms in ( 22) do not depend on Lemma 4 implies that to reduce calculation cost one can take ∆ i = Θ i ∩(Γ i−1 ∪Γ i+1 ), i = 2, . . ., n − 1 for a regular trail (19).That produces the same distribution by (22).Also we call a regular trail (19) for all i = 2 . . .R − 1.It is not difficult to see that if the trail is not reduced, then one can construct another trail which gives the same distribution for (21) or the distribution itself degenerates into a distribution of a sub-vector of ( 21).This follows from the fact that the bits (19) may be reduced and (22) gives the same distribution with another trail or the distribution of a sub-vector of Z.
We say Similarly to the proof of Lemma 4, one proves Lemma 5. Let (19) be a regular trail and H i , H i+1 or H i , H i+2 hold simultaneously for some i.Then (22) provides a uniform distribution.

Recurrent Formula
The computation with Theorem 1 might be tedious for n = 14 or 15.So one can use a convolution type formula based on splitting the encryption into two parts.Let 1 < i < R and the approximate distributions of be already computed based on events C Γ , C Γ , where Γ = (Γ 1 , . . ., Γ i ) and Γ = (Γ i+1 , . . ., Γ R ).Then the approximate distribution of Z is computed by Corollary 1.

Pr(Z
Corollary 1 is proved by splitting the product in (22) and summing the first part over A 2 , . . ., A i−1 and the second part over A i+2 , . . ., A R−1 and using the theorem again.
Fig. 5 shows theoretical and empirical a priori distributions for the 10-bit block X 2 [24,18,7,29], X 7 [16,14], X 8 [24,18,7,29] of 6-round DES internal bits.Approximate theoretical distribution was computed with Corollary 1 by using an appropriate trail.This distribution depends on 3 key bits.The empirical distribution was produced by encrypting 2 39 randomly and independently generated 64-bit plaintext blocks for one randomly chosen cipher key.We realise the distributions are oscillating around 2 −10 ≈ 0.000976 and are very close, almost indistinguishable.We got a number of such figures besides this one, all of them look similar.
11 Multinomial Distributions for 14-round DES One of the two best "linear approximations" for 14-round DES found by Matsui in [20] is X 2 {24, 18, 7} ⊕ X 15 {15} ⊕ X 16 {24, 18, 7, 29}.We included all those bits in (24), and added some more bits as intuitively probability distribution on larger vectors reveals more information on the cipher key.However this increases the number of key bits key 1 involved in the distribution and key bits Key 1 from the first and the last round keys involved in the observation.One is to keep the number of the key bits involved relatively low.Adding some new bits does not increase the number of the key bits involved while adding some others really does.For instance, adding X 2 [29] keeps the number of the key bits the same, adding X 15 [16,14,13,12,11] enlarges key 1 by K 15 [23,21,20,19,18] and Key 1 by 30 bits from K 16 .We got 14-bit string [24,18,7,29], X 15 [16,15,14,13,12,11], X 16 [24,18,7,29]).
Approximate a priori distribution of x 1 was computed by using Theorem 1 and Corollary 1 with the trail shown in Table 1.The computation took only a few seconds on a common computer.
The distribution depends on the value of 7-bit string: denoted key 1 .The distribution is a permutation of the distribution, where key 1 is a zero-string by the following Lemma.Proof.We will prove the lemma by applying Theorem 1.To this end we denote where a, b, c, d are 1, 4, 2, 6-bit strings respectively, and X[5, 4, 3, 2, 1, 0] denote all input bits to a respective S-box.Remark that input/output bits are numbered from right to left as in [20].By Theorem 1 with the trail shown in Table 1, where the sum is over 1-bit A 3 , A We transform the expression for Pr(x 1 = A 2 , A 15 , A 16 ) by introducing new variables and applying those properties.We get the probability depends on k, k 15 such that Pr( That implies the lemma. In the known-plaintext attack we do not observe the bits of (24).They are internal to the encryption algorithm and depend on the first and the last round keys.( 24) can be computed by X 2 [24,18,7,29] = X 0 [24,18,7,29] while X 16 [24,18,7,29] is a part of the ciphertext.Thus the observation depends on some plaintext/ciphertext bits, 36 bits of the last round key K 16 and 6 bits of the first round key K 1 .As some key bits repeat, the observation effectively depends on a 39-bit sub-key denoted Key 1 .In theory, one can apply a multidimensional linear analysis developed in [15].Likelihood Ratio statistic will then depend on key 1 , Key 1 : overall 44 key bits and one linear combination of the key bits.That makes 2 45 variants for key 1 , Key 1 to rank by the value of the statistic and won't give any advantage over Matsui's analysis of DES even if one uses Fast Fourier Transform to compute the statistic.
By DES symmetry one gets the distribution of x 2 = (X 15 [24,18,7,29], X 2 [16,15,14,13,12,11], X 1 [24,18,7,29]), (27) which depends on K {4,6,8,10,12,14} [22] ⊕ K {5,9,13} [44], K 2 [23,22,21,20,19,18] while X 1 [24,18,7,29] is a part of the plaintext.As above we can not afford using x 2 directly in multidimensional linear analysis.Instead of x 1 , x 2 , two bunches of their 10-bit projections will be defined in this section.We get overall 28 14-round input/output sub-vectors, whose multinomial distributions will be used to attack 16-round DES later in this paper.As x 1 and x 2 depend on disjoint sets of the encryption algorithm internal bits, they are here considered independently distributed.The observation on two bunches of 10-bit sub-vectors ( 28) and ( 29) below are considered independent too.A natural question to ask is how to find the best possible strings of bits (as x 1 and x 2 for 14-round DES, for instance), which provide with the most efficient key-recovery attack.In the original linear cryptanalysis that is the best "linear approximations" for the full or truncated cipher.That seems a very difficult problem as we do not have a ready measure for this superiority.The problem is not completely solved even in Matsui's linear cryptanalysis as his algorithm in [22] does ignore dependencies between "linear approximations" for different S-boxes in one DES round.So there is a theoretical possibility to find even better "linear approximations".In multidimensional linear cryptanalysis the situation is more complicated as the number of the parameters increases, one of most important is the number of the key bits (or key bit linear combinations ) involved besides the quality of the distribution itself.Related problem is given a string x of the encryption internal bits, find a superior trail to compute an approximate distribution for x.That problem looks easier, an informal argument in Appendix 1 shows that all good trails provide with essentially the same approximation, which essentially depends on the same key bits.

Another Trail
Another approximate distribution of x 1 was computed by using another trail shown in Table 2.It has a negligibly larger quadratic imbalance with uniform distribution.Quadratic imbalance is the Euclidean distance between two distributions, see [1].Therefore the new distribution is more powerful (though marginally) when it comes to decide on incorrect cipher key bits.However we remark that in the trail presented in Table 2 the masks ∆ i , Γ i are generally larger sets than relevant masks in Table 1.So this approximation depends on a significantly larger number of the key bits and therefore trade off seems negative for the efficiency of the attack.For those reasons the distribution is not used in the present analysis.We don't know how to find the best trail but the one in Table 1 works well in practice.By Appendix 1 argument there is essentially only one good approximation to the distribution of x 1 .

First Bunch of 14-round Input/Output Sub-Vectors
Instead of x 1 we use the projections X 2 [24,18,7,29], X 15 [i, j], X 16 [24,18,7,29], (28) for different i, j ∈ {16, 15, 14, 13, 12, 11} except i = 16, j = 11, where the distribution of ( 28) is uniform.When it is not uniform the distribution generally depends on 3 key bits K {3,5,7,9,11,13} [22] ⊕ K {4,8,12} [44], K 15 [i , j ], where K 15 [i , j ] denotes a key-mask for X 15 [i, j] in the 15-th round.However if i, j incorporates 16 or 11 then the distribution depends on 2 key bits and all of them are permutations of the same distribution.For instance, the distribution of X 2 [24,18,7,29], X 15 [16,15], X 16 [24,18,7,29] (28) depends on 12 bits of K 16 and 6 bits of K 1 , that is at most 18 key bits.Therefore one is to examine the values of at most 20 key bits and one linear combination of the key bits.That makes up to 2 21 variants of the observation and distribution on (28) and that number is affordable.In practice, because of repeated key bits from the key schedule, there are between 2 17 and 2 21 variants depending on i and j.

Second Bunch of 14-round Input/Output Sub-Vectors
By DES symmetry, 10-bit projections X 15 [24,18,7,29], X 2 [i, j], X 1 [24,18,7,29], (29) of x 2 may be used for the reason above.The distribution of (29)  12 Implementation Details for 16-round DES Two independent separable statistics constructed from the above projections of x 1 and x 2 are used.The statistics are identically distributed as one-variate normal random variable N(u, u) for u = nµC −1 µ T , where µ and C are computed from a priori distribution of x 1 (same for x 2 ).We fix required success probability 0.85.Then we find the threshold z such that the number of plaintext/ciphertext pairs n equals to the number of 56-bit keys for the final brute force.To this end we are to solve the system where β 1 and α 1 are defined by (12) and (13).In particular t = 2, α 1 = α 2 = 0.99257519589049966079368, Due to bad planning when generating random plaintext blocks to use in our experimental implementation on full DES, we generated 16 independent sets of 2 39 random messages and their corresponding ciphertexts (with a fixed key).Our experimental implementation of the attack was therefore run with n = 7 × 2 39 ≈ 2 41.81 .In this case we get We have published supplemental material in [10].From there one can download: the probability distributions of x 1 and x 2 ; the actual value of the statistics from our experiment S 1 ( K) and S 2 ( K); the weights ω i for i = 1, ..., 14; the vector of the means µ for LLR i (ν i ), i = 1, ..., 14; the covariance matrix C, and its inverse C −1 .
With n = 2 41.81 plaintext/ciphertext pairs the expectation of LLR 1 for correct k 1 is 3.23905, for incorrect −3.23905.Experimental value for the correct key is 1.57123, it is presented by the vertical line in Fig. 6.There are 23370 values higher than that.We remark that using only h 1 in the cryptanalysis is not efficient enough.One is to brute force 2 36 × 23371 > 2 50.5 key-candidates before finding the correct 56-bit DES key.That won't give any advantage over Matsui's results.Similar is true for other 27 projections.They are taken in an order defined by how many Ki those key bits are relevant to.We say a key-bit x relevant to Ki if the rank of Ki (as a set of linear functions) drops upon the fixation of x by a constant.For instance, x 2 relevant to 14 (maximal number) of Ki , etc.To construct the search tree one first chooses a sequence T 1 , T 2 , . . ., T 54 , where T j+1 is produced from T j by adding one unknown key bit, which is relevant to the most of Ki and which is not in T j+1 .The choice is not unique.We use the order defined by (30).That is T 1 = {x 2 }, T 2 = {x 2 , x 19 }, T 3 = {x 2 , x 19 , x 60 }, ... The choice of T j affects significantly The number of examined values of T j (tree nodes at level j), j = 38, . . ., 54, in log 2 scale are presented in Fig. 7. Overall number of nodes is 2 45.45 << 2 54 .So the complexity of finding K-candidates is much lower than brute forcing all values of K.The final number of K-candidates is 2 39.46 , so the number of 56-bit DES keys to brute force is 2 41.46 again close to what was predicted by our theory.Constructing one node requires few bit xor's and few additions with low precision real numbers, see Section 9.So search tree complexity (constructing 2 45.45 nodes) is lower in bit operations than final brute force of 2 41.46 DES keys.In fact, our implementation works slower than that as the source code is not as optimized as DES libraries are and we need to access external memory where precomputation results, that is the numbers d ji (a), are kept.At the same time DES encryption is very straightforward.One needs to keep around 2 31 low precision real numbers (less than 9GB of memory).

Possible Improvements
There are several direction in improving the method practically and theoretically.
1. Obviously, one may get better result by using larger strings x = (X, Y ) of the encryption internal bits, see Fig. 2.
2. There are several practical ways to reduce the number of nodes in the search tree, e.g. by taking a larger threshold z in (10) for low levels (low j) of the tree.However those methods do not guarantee theoretical success probability as Lemma 2 does not apply any more.By choosing a different z j for each level, each slightly less than the statistic for correct key, we managed to reduce the number of visited nodes to < 2 34 .Clearly the success probability in this case is smaller, but it is an indication that these methods may work in practice.
3. The number of nodes in the search tree may probably be further reduced by choosing another sequence of T j .It is still unclear how to do that in general.
4. Another statistics for the projections h 1 , . . ., h m may be used.For instance, let the key bits key i affect a priori distribution of h i , and Key i affect the observation on h i .

Then we define LLR
We here neglect that Key i and key i may have some key bits in common.Using LLR * i instead of LLR i looks better in practice and in line with Matsui's analysis.However the distribution of (LLR * 1 , . . ., LLR * m ) is unknown and therefore the success probability of the method is difficult to predict.One can probably try to compute it experimentally for a truncated cipher and then extrapolate to the full-round one, as similar was done by Matsui in [21,22].Also one can choose any subset of Ki instead of key i in the definition of LLR * i above.

Conclusion
Detailed contributions of the present work are presented in Section 4. Three main points are summarised below as well.Firstly, the use of separable statistics.Secondly, efficient algorithm for computing the statistic values by solving an optimisation problem based on gluing together partial information on the cipher key, so there no need to compute the statistic for all values of the key bits involved.Thirdly, formulae for joint approximate distributions of internal bits in Feistel ciphers.Only in the end they are combined to a concrete attack on DES improving on Matsui's results.The combination of the first two may be of independent interest in reducing time complexity of key recovery attacks in other ciphers.

Appendix 2. On Matsui's Probability Calculation
In this section we show that the distribution of one-bit "linear approximations" used in [20] may be computed only based on X 0 , X 1 , . . ., X R+1 are independently and uniformly generated and under condition of the auxiliary event C Γ : for some Γ i .We will do that in case of 3-round DES represented in Figure 4 of Matsui's work [20].The general case is similar.We want to compute the distribution of (18) , where the sum is over binary a, b, c, d.We now take into account that f Let F1 , F3 be produced from F 1 , F 3 by the substitution X 1 {15} = c, X 3 {15} = d.Then The probability was split into a product by independence.The last term in the product is 1/4.Therefore,

Appendix 3. Another Statistic
We will use the notation introduced in Section 7.1.Let's denote ν(n) = (ν 1 , . . ., ν m ), a vector of length M = m i=1 N i , a concatenation of ν i (n).We can write where R i are independent identically distributed (as ν( 1)) random variables.Assume that x has the distribution P .Then by µ P and C P we denote here the expectation and the covariance matrix for ν (1).Remark that the symbols µ P and C P are used in Section 7.2 in a different context.We have µ P = (µ P,1 , . . ., µ P,m ), where µ P,i = hi(a)=1 p a , . . ., hi(a)=Ni p a is the expectation of ν i (1).We can split the matrix C P into blocks C ij .Such block represents a covariance matrix for ν i (1) and ν j (1).By the definition of covariance,  If h i (x), h j (x) for i = j are independent random variables, then C P is diagonal, because C ij are zero-matrices.Diagonal blocks C ii are covariance matrices for ν i (1).
By Central Limit Theorem the distribution of ν tends to a multivariate normal distribution N(nµ P , nC P ) with expectations nµ P and covariance matrix nC P .Similarly, if x has the distribution Q, then ν tends to N(nµ Q , nC Q ).To decide which distribution P or Q is correct by observing the value of ν, one can apply the Neyman-Pearson approach.However as the matrices C P , C Q are singular, the normal distributions do not have densities.
A standard solution is to consider a truncation of ν(n) = (ν 1 , . . ., ν m ).For instance, let ν (n) = (ν 1 , . . ., ν m ), where ν i is produced from ν i by dropping one entry of the latter.Recall that ν i is a vector of observations on the values of h i (x).Then by µ P and C P we denote here the expectation and the covariance matrix for ν (1) when x follows the distribution P .If h i (x), h j (x) for i = j are independent random variables, then C P is diagonal and invertible.Similarly, when x follows the distribution Q and h i (x), h j (x) for i = j are independent, then C Q is diagonal and invertible.
Therefore, one constructs an LLR statistic to distinguish two multivariate normal distributions: In that case is a separable statistic.However as the projections (28) are dependent(the same is true for (29)), the statistic S is not applicable within this cryptanalysis.

Figure 3 :
Figure 3: The Search Tree
By induction, Pr(C) = 2 −rR .The exact probability of an event E which happens in the encryption

Figure 4 :
Figure 4: Masks in one Feistel round

Figure 5 :
Figure 5: Theoretical and empirical distributions in DES

C
ij [b, c] = h i (a) = b h j (a) = c p a − hi(a)=b p a hj (a)=c