Vectorized linear approximations for attacks on SNOW 3G

SNOW 3G is a stream cipher designed in 2006 by ETSI/SAGE, serving in 3GPP as one of the standard algorithms for data confidentiality and integrity protection. It is also included in the 4G LTE standard. In this paper we derive vectorized linear approximations of the finite state machine in SNOW 3G. In particular, we show one 24-bit approximation with a bias around 2−37 and one byte-oriented approximation with a bias around 2−40. We then use the approximations to launch attacks on SNOW 3G. The first approximation is used in a distinguishing attack resulting in an expected complexity of 2172 and the second one can be used in a standard fast correlation attack resulting in key recovery in an expected complexity of 2177. If the key length in SNOW 3G would be increased to 256 bits, the results show that there are then academic attacks on such a version faster than the exhaustive key search.


Introduction
SNOW 3G is a word-oriented stream cipher being used as the core of 3GPP Confidentiality and Integrity Algorithms UEA2 & UIA2 for UMTS and LTE networks [UEA06a].It is one member of the SNOW family with two predecessors SNOW 1.0 [EJ00] and SNOW 2.0 [EJ02].SNOW 1.0 was submitted to NESSIE project in 2000 by Ekdahl and Johansson and the improved version SNOW 2.0 was published in 2002 and selected as an ISO standard in 2005.The SNOW ciphers consist of a linearly updated part through an LFSR (Linear Feedback Shift Register) and a non-linear part referred to as a FSM (Finite State Machine).They are all based on operations on 32-bit words, making them quite efficient in both software and hardware environment.SNOW 3G differs from SNOW 2.0 by introducing a third 32-bit register in the FSM and a second 32-bit S-box application to update this register.This presumably makes SNOW 3G a much harder target in an attack compared to SNOW 2.0.
Just as for other stream ciphers, the class of linear approximation attacks, like distinguishing attacks and correlation attacks, are a main threat to the SNOW ciphers.The basic idea for these attacks is to approximate the nonlinear blocks used in the cipher with linear expressions and then derive linear relationships between output values from different time instances.In a distinguishing attack, a cryptanalyst tries to derive some samples from the keystream and find evidence that such a sample sequence is not behaving like a truly random sequence using some statistical tools, e.g., hypothesis testing.When the linear relationship also involves symbols from the LFSR states, some correlation between the keystream and the LFSR states can be explored to recover the key, which is the foundation of a correlation attack.
Several distinguishing attacks and correlation attacks have been proposed on SNOW 1.0 and SNOW 2.0, where the basic idea is to approximate the FSM part.In [CHJ02], a distinguishing attack on SNOW 1.0 with complexity 2 100 was proposed using linear masking to get a binary approximation with a bias 2 −8.3 , which became one reason of the rejection of SNOW 1.0 from the NESSIE project.To resist against this attack, the authors improved the design and proposed SNOW 2.0, which is, however, still vulnerable to some distinguishing attacks.A distinguishing attack based on a linear masking method with complexity 2 230 and an improved version with complexity 2 174 were proposed in [WBDC03] and [NW06a], respectively.In [LLP08] and [ZXM15], correlation attacks on SNOW 2.0 were proposed with complexity 2 204.38 and 2 164.15 .For SNOW 3G, however, no significant attack of this type has ever been published.
The cryptographic security of SNOW 3G has been studied in depth.As an algorithm appearing in a main standard, it has been thoroughly evaluated by the standardization consortium before its adoption.Some evaluating information appears in [UEA06b].There are several side-channel attacks and fault attacks, targeting specific implementations of the algorithm, such as the attacks in [DC09] and [BHNS10].There are also attacks targeting the initialization phase on versions of SNOW 3G with reduced number of initialization rounds [BPSZ10b,BPSZ10a].At FSE 2006, Nyberg and Wallén [NW06b] examined linear distinguishing attacks on SNOW 2.0, but devoted one section to SNOW 3G.The best linear approximation of the FSM they found had a bias of 2 −274 and they also derived an upper bound on 2 −96 for any binary linear approximation.Here the bias as given in the paper was recalculated, now expressed using Squared Euclidean Imbalance, as is commonly used for non-binary linear approximations.Note that [NW06b] considered only binary approximations and the key to improvements is to use approximations over larger alphabets.
In this paper, we give one distinguishing attack and one correlation attack on SNOW 3G by finding efficient linear approximations of the nonlinear part of the FSM.We derive a 24-bit linear approximation by truncating and masking three consecutive keystreams with the bias 2 −37.37 and we further derive an 8-bit approximation from the 24-bit one with the bias 2 −40.97 .The 24-bit approximation is then employed to launch a distinguishing attack requiring a keystream length of around 2 172 .This strongest and largest 24-bit approximation cannot be used in a correlation attack, but the derived 8-bit approximation, which is linear over GF (2 8 ) can be used to give a correlation attack which has complexity around 2 177 .This is to the best of our knowledge the first significant result on attacking the full SNOW 3G.In particular, if the key length in SNOW 3G would be increased to 256 bits, the results show that there are then academic attacks on such a version faster than the exhaustive key search.
The distributions for the noise related to the different linear approximations given in the paper will be publicly available to download when this work has been accepted for publication.
The rest of this paper is organized as follows.We briefly describe the design and structure of SNOW 3G in Section 2 and then elaborate the process of finding linear approximations of the FSM in Section 3. In Section 4, we give the experimental verification of the approximations by running the cipher to get a large number of samples.In Section 5, we give the attacks, a distinguishing attack and a correlation attack, based on the vectorized linear approximations derived in Section 3 and in Section 6, we conclude the paper.

Description of SNOW 3G
In this section, we present a brief description of the SNOW 3G algorithm.We first note that a stream cipher like SNOW 3G takes as input a secret key K, which is a 128-bit value, and an public value known as the IV (initial vector) value, which is also of length 128 bits.
For each such pair of key and IV, (K, IV ), the algorithm produces an output sequence, usually called the keystream, denoted z (t) , t = 1, 2, . ... In SNOW 3G, each keystream symbol is a 32-bit word, so we write z (t) ∈ GF (2 32 ), t = 1, 2, . ... Furthermore, each pair should produce a unique keystream sequence, and the typical operation of such a stream cipher is to produce many different keystreams for many different public IV values, while keeping the key fixed.
The overall schematic of SNOW 3G algorithm is shown in Figure 1.Just as SNOW 1.0 and SNOW 2.0, it consists of a linear part, the LFSR, and a nonlinear part, the FSM.The FSM is used to break the linearity of the LFSR contribution.For more details on the design of SNOW 3G, we refer to the original design document [UEA06a].

Figure 1: The keystream generation phase of the SNOW-3G stream cipher
The LFSR part consists of 16 cells denoted (s 0 , s 1 , ..., s 15 ) each containing 32 bits and giving 512 bits in total.Every value in a cell is considered as an element from GF (2 32 ) and the LFSR sequence is defined by the generating polynomial where α is a root of the polynomial If we denote the state at clock t as (s is updated by: where ⊕ denotes a bitwise XOR operation.Note that α and α −1 are two constants in GF (2 32 ) and the update consequently includes two multiplications in this field.
For the FSM part, it has three internal 32-bit registers R1, R2 and R3, connected by some linear and nonlinear operations.The FSM takes two words from the LFSR part as the inputs, s 15 and s 5 , and outputs a 32-bit keystream word by xoring with s 0 , giving the following formula for the generation of the keystream, Here 32 denotes integer addition modulo 2 32 .Then the registers in FSM are updated through the following steps: where S 1 , S 2 are substitution boxes (S-boxes) composed of four bytewise substitutions followed by the MixColumn operation of Rijndael.Below we give the details of how they are constructed by using Little-endian style.The 32-bit registers in FSM could be expressed as four parallel bytes.Let W = (w 0 , w 1 , w 2 , w 3 ) be the input to the substitution boxes with w 0 being the least and w 3 the most significant byte.The operations of the two S-boxes are as follows.
S-Box S 1 : S 1 is a 32-bit to 32-bit mapping operating on four bytes.Bytes are interpreted as elements of GF(2 8 ) defined by the polynomial x 8 + x 4 + x 3 + x + 1.The underlaying 8-bit SBox S R (x) is the Rijndael AES SBox [DR13].In general, S 1 is described by which can be expressed in more details as follows.Let R = (r 0 , r 1 , r 2 , r 3 ) be the four byte output through R = S 1 (W ).Then (1) S-Box S 2 : S 2 is also a 32-bit to 32-bit mapping operating on four bytes.Bytes are again interpreted as elements of GF(2 8 ) but this time defined by the polynomial y 8 + y 6 + y 5 + y 3 + 1.The underlaying 8-bit SBox S Q (x) is another 8-to-8 bit S-box derived from the Dickson polynomials.In general, S 2 is described by and in more details S 2 (W ) is: (2) Like other stream ciphers, SNOW 3G uses an initialization phase during which the cipher is clocked without producing output to fully mix the key and IV into the LFSR state and the FSM registers.During the initialization phase, the key and the IV, each consisting of four 32-bit words, are first loaded into the LFSR state and the registers in the FSM are initialized to be zero.Then the cipher runs 32 times with the output from the FSM feeding back to the LFSR instead of giving a keystream output.After the initialization, the cipher enters the keystream mode, with the first output word from FSM to be discarded and the following FSM outputs forming the keystream.Since the attacks in this paper only uses the keystream mode, we do not give the details of the initilization mode, but refer to the design document if interested [UEA06a].

Approximations of the FSM
A main class of attacks on stream ciphers are the so-called linear distinguishing attacks and (fast) correlation attacks.They both build on the idea of approximating some nonlinear operations as linear ones, thereby introducing some approximation noise.The most simple form simply uses bitwise approximations and has a strong connection to linear cryptanalysis of block ciphers.Recent work in cryptanalysis of stream ciphers have shown that approximations on larger alphabets can improve the attacks considerably, e.g.
[ZXM15] used the terminology large-unit linear approximations.
For stream ciphers built around an LFSR, it makes sense to provide approximations that are linear with respect to some binary algebraic structure, such as GF (2) n or GF (2 n ).Since the LFSR part is linearly described in GF (2) n or possibly GF (2 n ), the main obstacle is to approximate the FSM.We will return in Section 5 to the case of how to use an approximation in attacks on the full cipher.
The FSM part in SNOW 3G takes inputs from s 15 , s 0 and outputs z (t) , with t varying.It also contains three unknown values in the registers R1, R2 and R3.As such, they need to be cancelled and a linear approximation of the FSM can thus be described as a linear expression including only s 0 and z (t) for different t values.Such an expression is a good approximation if the corresponding expression has a distribution that is biased.So in general, we are interested in finding an expression of the form ), for some time set I, where operations are in GF (2) n (or GF (2 n )), and c are now m-dimensional matrices and the inputs are considered as column vectors.In order to determine the quality of an approximation, we consider m-bit random variables E (t) , defined as the above expression, i.e., ).
Each such random variable has the same distribution, denoted D. The quality of the linear approximation is measured by the bias of the distribution, which can be defined in many ways.Using the SEI (Squared Euclidean Imbalance) as defined in [BJV04], the bias for the distribution D is computed as We note that when the bias is measured in SEI, the number of samples required to distinguish samples drawn from D from the uniform distribution is in the order of 1/ (D).
We are now ready to investigate how to find expressions of the above form with a large bias.

A 24-bit linear approximation of the FSM
In this first approach, we are targeting an approximation with as large alphabet size as possible, in order to get a bias as large as possible.The novel parts consist in determining how to build the approximation and how to efficiently compute the bias when the alphabet size and the number of involved variables are large.
Let R1, R2, R3 be the content of the FSM registers at time t.Then consider the following word-oriented expressions on three consecutive keystream words: Let us introduce the following notation that applies to three byte-oriented 32-bit vectors, where i, j, k are corresponding bytes of A, B, and C, respectively.So A[i] denotes the i-th byte of a 32-bit byte-oriented vector, for i = 0, 1, 2 or 3. Now we consider a three-byte sampling of the following form: .

Contribution from the LFSR
Basically, we want to achieve a linear approximation where the bias is as large as possible and the choice of multiplying z (t+1) with L −1 1 is chosen to have a really small influence of the noise related to the R1 register.
As already indicated in the above formula, our linear approximation is where and tot , N 1 (t) , N 2 (t) are independent of t, we simplify them by writing N tot , N 1, N 2, respectively.Note also that N 1 and N 2 are independent.

Computation of the 24-bit noise distributions for N 1 and N 2
Computing the distribution of N 1 is trivial, we simply run over all possible values of R1[0] and s Computation for N 2 is more tricky and below we explain how we do that.We can rewrite N 2 as: , [i], and the coefficients are for i = 0, 1, 2, 3.The value c i is the input carry value that comes from the arithmetical addition of the previous byte(s), i.e., bytes 0 to i − 1, of the 3 terms: R2[i], w i and ( R3[i] ⊕ y i ).Note that the carry value is 0 ≤ c i ≤ 2 and the first byte has c 0 = 0.
The idea behind the computation technique is to compute 256 24-bit distributions of the triple (A, B, C), conditioned on the value of R2[0] ∈ [0, . . ., 255].Let us denote those distributions as D R2[0] (A, B, C).Then, the distribution table of N 2 is constructed as follows: we initialize the distribution table N 2 with all zeroes; then, for each combination of R2[0], x 0 , A, B, C (in total 2 40 choices) we do: Also note that we can actually compute distributions D R2[0] (A, B, C) one by one for each value of R2[0] and add to the accumulating distribution of N 2. Thus, we do not need to store all of them in RAM simultaneously.

Computation of sub-noises D R2[0] (A, B, C)
What remains is to show how to compute D R2[0] (A, B, C) for a given value of R2[0], which can be done as follows.
Let us introduce intermediate distributions which are the distributions of the three bytes (A k , B k , C k ), and where the expression for the arithmetical addition in the part of the expression for C has the input carry value c k , and the output carry value o k , for a fixed R2[0].Then, the first 3 distributions for k = 0 and o 0 = 0, 1, 2 , can be computed by trying all possible values of R3[0], y 0 , w 0 in time O(2 24 ).Note that the value R2[0] is fixed and the input carry c 0 = 0.The next 3 × 3 distributions for k = 1 and ).We continue this way to compute all these distributions.Then, the distribution can be derived through a series of convolutions of distributions that are matching in the input/output carries, and summations as follows: where × denotes a XOR-convolution of the 24-bit distribution tables, and is a point-wise arithmetical summation of the probabilities.A single XOR-convolution can be done with Fast Walsh-Hadamard Transform (FWHT) having time complexity O(N log N ) where, in our case, we have N = 2 24 .
We can speed up the computations as follows.Note that for the first k = 0 we only have three distribution tables E k=0, * ,o0=0,1,2 corresponding to three possible output carry values.After we have computed the distributions for k = 1, we can actually perform the partial convolutions and combine E k=0, * and E k=1, * such that we again have only three distributions to keep -corresponding to three possible values of the output carry to the third byte from the first two bytes.I.e., E k=0−1, * ,o1=0,1,2 combines the distributions over the first two bytes.We do that until k = 3, and then the total distribution D * is the point-wise sum of the final three distributions E k=0−3, * ,o3=0,1,2 .
Also note that FWHT is a linear transformation.So convolution in the time domain corresponds to point-wise multiplication in the frequency domain; and, summation in the time domain corresponds to summation in the frequency domain.Thus, there is no need to switch between the time and frequency domain for mixed summation and convolution operations, we can do most of the above in the frequency domain without switching.

Computation results and bias values
We have implemented the above computation method, computed the distribution for N 1 and N 2, and then the 24-bit total noise distribution for N tot = N 1 ⊕ N 2 for the proposed approximation.Note that the approximation takes into account the full 32-bit adders .We have got the following results regarding corresponding biases: Here (i × N tot ) is the notion for the bias of the resulting distribution when summing i random variables distributed as N tot using bitwise XOR.

An 8-bits approximation
As will be clear later, the 24-bit approximation cannot be used in a straight-forward manner in a fast correlation attack.For such a case, one would need an approximation that can be completely described over a finite field.
From the 24-bit noise distribution that we computed in the previous subsection, we can further derive an 8-bit approximation with operations in the Rijndael field GF (2 8 ), with the noise now denoted N tot .The approximation has the following form, where Γ, Λ are some nonzero constants in GF (2 8 ).For each possible choice of Γ, Λ ∈ GF (2 8 ) we compute the 8-bit distribution of N tot directly from the given 24-bit distribution of N tot and then we compute the corresponding bias, Searching through all choices of Γ and Λ would normally imply the computational complexity O(2 48 ), but we can reduce it down to O(∼ 2 40 ) by the following technique.In the loop for Γ , we first precompute the joint 16-bit distribution of ( with complexity O(2 24 ), then we loop for Λ and use the precomputed joint distribution to derive the 8-bit distribution of N tot .The best choice for constants appears to be Γ = 0x9c and Λ = 0x08 (alternatively, Λ = 0x18 also gives the best approximation) and the resulting bias of N tot is:

Experimental verification
In this section, we experimentally verify the correctness of the bias derived in the previous section.The experimental verification can be done by running the cipher and collecting a large number of samples.For distributions of smaller sizes one can fully verify them by experimentally determining the exact distribution, i.e., every probability value in the distribution is correct; while for larger distributions, this might not be computationally possible.Instead, we can use them in a hypothesis testing and in this way demonstrate that it can be used in a distinguisher.This will be the case for the 24-bit approximation from Section 3.
We consider deciding the sample distribution between the uniform distribution and the noise distribution derived in Section 3 by hypothesis testing.We will follow the hypothesis testing approach as formulated in information theory, see [CT12].It is centered around the divergence (or relative entropy, or Kullback Leibler distance), denoted D(P X ||P Y ), between two distributions P X and P Y over the same alphabet and defined as D(P X ||P Y ) = i P (X = i) log P (X=i) P (Y =i) .The relative entropy is used to measure the distance between two distributions: the closer the distributions are, the smaller D(P X ||P Y ) would be.If the ditsributions are the same, D(P X ||P Y ) = 0.
Furthermore, if we have a sequence of n sample symbols x = x 0 , x 1 , . . ., x n−1 from the same alphabet A then we can count the number of occurrences of each symbol a ∈ A, denoted N (a|x) and forming the type (or empirical distribution; or sample distribution) by assigning P (X = a) = N (a|x)/n.
Let us denote the uniform distribution as P U , the noise distribution derived in Section 3 as P N .Assume we collect n samples x from an unknown distribution P X .Then the hypothesis testing can be modeled as below with two hypotheses: H 0 : P X = P N , H 1 : P X = P U . (5) In our case we are considering 24-bit distributions, so |A| = 2 24 and the sample distribution is denoted P X n , with n as the length of the sample symbols.We use the Neyman-Pearson lemma to make the optimum decision for the hypothesis testing, according to the distances from P X n to P U and P N , respectively.The decision problem is a maximum-likelihood test and the log-likelihood ratio can be written as Then we define the decision rule as below With the hypothesis-testing problem defined above and P N being the 24-bit noise distribution from the previous section, we build a distinguisher to decide the underlying sample distribution.We run 64 parallel SNOW 3G instances with random initial states, each clocking 2 40 times and collect the targeted samples.At each clock t, we combine (z (t−1) ⊕ s ])[0], which is exactly the xor-sum of the keystream and LFSR part in (3), to make a 24-bit integer and increase the occurrence of the corresponding entry in the distribution table.We also collect the least three significant bytes of z (t) and concatenate them into a 24-bit variable, which is regarded as a comparison sample drawn from a uniform distribution.
After this process, we get the tables of occurrences of all possible 24-bit values and their corresponding probabilities.Then these sample distributions are tested by the decision rule given in (6) to get the answer to which distribution they follow, by calculating the distances to the uniform distribution P U and noise distribution P N , respectively.There are two types of errors to the correctness of the distinguisher: TYPE I errors, the errors of guessing a noise distribution as random; and TYPE II errors, the errors of falsely guessing a uniform distribution as the biased one.
Figure 2 shows the distances of one sample sequence to a uniform distribution and the noise distribution and their differences under different lengths of samples.We can see that with an increase in the length of samples, the distances to the uniform and noise distribution, which are quite close, are both decreasing.Even so, we could still distinguish the distribution by the difference of the two distances, i.e., D(P X ||P U ) − D(P X ||P N ).While fluctuating around 0 in the beginning, the difference becomes stably positive after length 2 39.58 , indicating that the sample distribution is closer to the noise distribution and we get the correct guess.
Figure 3 shows the TYPE I and II error probabilities for the distinguisher under different lengths of samples.We could see that while fluctuating, the error probabilities are becoming smaller with an increase in the amount of samples, which indicates the guesses are becoming more accurate.From length 2 40 , we can distinguish the samples with large success probabilities, while at the length 2 41.5 , there are no errors in our 21 sample sequences.The result matches well with the bias obtained in Section 3 and the conclusion that O(1/ ) samples are needed to distinguish the distribution from random when the bias is .

Attacks based on the new vectorized linear approximations
We are now ready to use our vectorized linear approximations of the FSM to launch attacks.We recall the approximation on three bytes we derived in Section 3 is of the form below, where (n 0 , n 1 , n 2 ) denotes the noise in the 24-bit linear approximation.In Section 3 we have computed the bias of this noise to be about 2 −37 and experimentally verified it.We now consider how to launch a distinguishing attack with this 24-bit approximation in the next subsection.Fast correlation attacks will be considered in Subsection 5.2.

A distinguishing attack
In a distinguishing attack we build an algorithm that takes a sequence as input and determines with a small error probability whether the sequence stems from the considered keystream generator, or it is a truly random sequence.A potential application would be a case when only two messages m, m are possible, and from the ciphertext c only, we would like to determine which message was sent.One then forms a candidate keystream by computing c ⊕ m and inputs this to the distinguisher.If the distinguisher finds that this candidate keystream is likely to have been generated from the keystream generator, it is likely that the sent message was m.
The basic idea for finding a distinguishing attack in our scenario is to completely remove the contribution from the LFSR part, leaving only a linear function of known output symbols as a sample from a noisy distribution.After collecting enough samples, one can distinguish the considered keystream from a truly random sequence.
Considering the relationship in (7), we would like to cancel the LFSR contribution and for simplicity, we write s (t) 0 simply as s (t) .It is easy to verify the following theorem.
Theorem 1.If one can find t 1 , t 2 , t 3 such that s (0) ⊕ s (t1) ⊕ s (t2) ⊕ s (t3) = 0 then S (t) ⊕ S (t+t1) ⊕ S (t+t2) ⊕ S (t+t3) = (0, 0, 0).Proof.Since s (0) ⊕ s (t1) ⊕ s (t2) ⊕ s (t3) = 0, with any time shift the equation would still hold, i.e., s (t) ⊕ s (t+t1) ⊕ s (t+t2) ⊕ s (t+t3) = 0.For the other terms in S (t) , the xorsum from the values at 0, t 1 , t 2 , t 3 would also be 0. Let us take the term s Assuming that we have found t 1 , t 2 , t 3 satisfying Theorem 1, we can create samples from a biased distribution by computing samples x t as where t 0 = 0.The samples x (t) are then drawn from a noisy distribution, which is the distribution of the sum of 4 noise variables like N tot .This was previously computed to have a bias of (4 × N tot ) = 2 −163 and hence it requires in the order of 2 163 keystream symbols to distinguish the samples from a uniform distribution.
The remaining problem here is to examine the computational complexity of finding t 1 , t 2 , t 3 satisfying s (0) ⊕ s (t1) ⊕ s (t2) ⊕ s (t3) = 0.The sequence s (t) is generated using the feedback polynomial P (x) = αx 16 + x 14 + α −1 x 5 + 1 ∈ GF (2 32 )[x].We are thus seeking a weight 4 multiple K(x) of the feedback polynomial where all coefficients are set to one.We may first argue about the expected degree q of such a polynomial.Let us consider all t ≤ q, then we can create q 3 different combinations of s (0) ⊕ s (t1) ⊕ s (t2) ⊕ s (t3) expressed in the initial state.Since there are 2 512 possible such combinations, we can expect that we need to go to degree such that roughly q 3 /6 ≈ 2 512 , resulting in q ≈ 2 172 .
Finally, we need an efficient way to find a weight 4 multiple.Here we use a slight generalization of the algorithm proposed by Löndahl and Johansson in [LJ14].The algorithm solves the problem with computational complexity of only around 2 d , where d = log q, and similar storage.The algorithm uses the idea of duplicating the desired multiple to many instances and then finding one of them with very large probability.The solution to problem associated to the SNOW 3G case can be described as follows: Assume that K(x) is the weight 4 multiple of the lowest degree and assume that its degree is around 2 d as expected.The Algorithm 1 considers and creates all weight 4 multiples up to degree 2 d+b where b is a small integer, but will only find those that include two monomials x i1 and x i2 such that φ(x i1 + x i2 mod P (x)) = 0, where φ() means the d least significant bits.
Algorithm 1 Finding a multiple of P (x) with weight 4 and all nonzero coefficients one Input Polynomial P (x), a small integer b Output A polynomial multiple K(x) = P (x)Q(x) of weight 4 and expected degree 2 d with nonzero coefficients set to be one 1.From P (x), create all residues x i1 mod P (x), for 0 ≤ i 1 < 2 d+b and put (x i1 mod P (x), i 1 ) in a list L 1 .Sort L 1 according to the residue value of each entry.2. Create all residues x i1 + x i2 mod P (x) such that φ(x i1 + x i2 mod P (x)) = 0, for 0 ≤ i 1 < i 2 < 2 d+b and put in a list L 2 .Here φ() means the d least significant bits.This is done by merging the sorted list L 1 by itself and keeping only residues φ(x i1 + x i2 mod P (x)) = 0.The list L 2 is sorted according to the residue value.3.In the final step we merge the sorted list L 2 with itself to create a list L, keeping only residues x i1 + x i2 + x i3 + x i4 = 0 mod P (x).
As K(x) is of weight 4, any polynomial x i1 K(x) is also of weight 4 and since we consider all weight 4 multiples up to degree 2 d+b we will consider 2 d+b − 2 d such weight 4 polynomials, i.e. about 2 d (2 b − 1) duplicates of K(x).As the probability for a single weight 4 polynomial to have the condition φ(x i1 + x i2 mod P (x)) = 0 can be approximated to be around 2 −d , there will be a large probability that at least one duplicate of x i1 K(x) will survive in Step 2 in Algorithm 1 and will be included in the output.Further details and experimental verification can be found in [LJ14].
Regarding complexity, we note that the tables are all of size around 2 d .Creation of L 1 costs roughly 2 d and creation of L 2 costs about the same as we are only accessing entries in L 1 with the same d least significant bits.For a sufficiently large b, one iteration of Algorithm Algorithm 1 (inner loop) succeds with a high probability.
To conclude, we have described a distinguishing attack on SNOW 3G for which we need a keystream length of around 2 172 and similar complexity.It uses a precomputation step of complexity around 2 172 and required memory is of the same size.

A fast correlation attack
A fast correlation attack is a key recovery attack, which is hence a much stronger attack than a distinguishing attack.It tries to recover the key by exploring the correlation between the keystream and the output of the LFSR states, which always exists for nonlinear functions [Sie84].It is commonly modeled as a decoding problem in GF (2) n or GF (2 n ), with the observed keystream samples y = (y 0 , y 1 , ..., y N −1 ) being the noisy version of the LFSR sequence u = (u 0 , u 1 , ..., u N −1 ) through a discrete memoryless channel (DMC) with nonuniform noise e = (e 0 , e 1 , ..., e N −1 ), i.e., y i = u i + e i for 1 ≤ i ≤ N − 1.It should be noted Correspondingly, If we collect N such parity checks, we can construct a new [N , l ] code, with U = (U 0 , U 1 , ..., U N −1 ) being the output of a converted LFSR with l states and Y = (Y 0 , Y 1 , ..., Y N −1 ) being the noisy version of U i through a more noisy channel with noise E = (E 0 , E 1 , ..., E N −1 ).Since l < l, the decoding complexity is reduced.Then the remaining work is to solve the decoding problem efficiently, which we will describe in detail in the processing stage.
As for the complexity for the preprocessing stage, with the k-tree algorithm in [Wag02] employed to find such parity check equations, the time/space complexities are O(k2 n(l−l )/(1+log k) ) and the sizes of lists are O(2 n(l−l )/(1+log k) ), with n being the size of the finite field, n = 8 in our case.Note that ρ 1+log k such tuples could be found with ρ times as much work as finding a single solution, i.e., O(ρk2 n(l−l )/(1+log k) ) for time and space and O(ρ2 n(l−l )/(1+log k) ) for the size of each list, as long as ρ ≤ 2 n(l−l )/(log k(1+log k) .

Processing Stage: Decoding the code
We now move to the process of decoding the [N , l ] code, following the method in [ZXM15] and using the Fast Walsh Transform (FWT) to accelerate the decoding process.The main idea is to make a distinguisher defined as for a guess û = (u 0 , u 1 , ..., u l −1 ) of the first l LFSR states, where i denotes the i-th tuple.I(û) would be biased for the correct guess since only the noise term E i remains.
The next step is to check the balancedness of I(û) for every guessed û to find the correct key.Firstly, the correlations c( a, I ) of the Boolean function a, I , i.e., the inner product of a and I where a ∈ GF (2) n , is obtained and then the SEI of I(û) can be derived by ∆(û) = a∈GF (2 m ) c 2 ( a, I ) according to [NH07].Then we can verify whether I(û) is biased or not and further recover the key.To get the correlations c( a, I ) efficiently, the method in [LV04] could be used.Firstly, the vectorial boolean function I can be divided into n linearly independent boolean functions I 1 , ..., I n , each expressed as I j = w j , û ⊕ v j , Y i where w j ∈ GF (2) nl , v j ∈ GF (2) n are two binary coefficient vectors.Then FWT can be used to compute the correlation of each I i .It is stated in [ZXM15] that the total correlation can be further derived by the Piling-up Lemma.We refer to other work like [ZXM15] for a more detailed description of this process and we use the complexity formulas from [ZXM15].
For SNOW 3G, we use the 8-bit linear approximation which has a bias of (N tot ) ≈ 2 −40.970689 .We can first rewrite the LFSR sequence symbols as linear functions of 64 initial state bytes, with a new and more complex generating polynomial.Then use the preprocessing stage described before to generate the parity check equations with parameters l = 64, k = 4.The SEI of k = 4 folded noise variables is (4 × N tot ) ≈ 2 −163.877500 .We then tested different choices for l and found that under l = 20 the total complexity is the lowest.The number of parity check equations m k required in this case is m k = 2 171.67 .The time/space complexity of preprocessing is ρk2 n(l−l )/(1+log k) = 2 176.56 and the required length of the keystream is 2 176.56 .With complexity of n(m k + l n2 l n ) + 2 n+l n = 2 174.75 , 20 • 8 = 160 bits of the LFSR initial states could be recovered.Therefore, the time/memory/data/pre-computation complexities are all upper bounded by 2 176.56 .

Potential Correlation Attack using a 16-bit approximation
In Section 3, we got the 24-bit and 8-bit linear approximations with biases 2 −37 and 2 −41 , respectively.We would obviously like to use the 24-bit approximation to launch a correlation attack.But as explained before, the 24-bit approximation is not defined over a finite field and cannot be used directly in a fast correlation attack.Now we report some findings on building a 16-bit approximation by an experimental method based on the 8-bit approximation derived before.Specifically, we concatenate two consecutive 8-bit symbols to build one 16-bit symbol, i.e., (y (t) , y (t+1) ).The theoretical distribution for such a pair of bytes is too computationally consuming to compute, but since we know the bias for a single byte we can get an bound on the bias.
We run a large amount of SNOW 3G instances in parallel with random initial states and collect (y (t) , y (t+1) ) by (4) at each clock, obtaining 2 53 16-bit samples in total.We then record the occurrence of each entry in the distribution table, and got the bias 2 −36.8293 .However, the bias here is not very accurate and more samples are needed for full confidence.Even so, we could still get a general estimation of the bias.After we collected these samples, we used the method in Section 4 to distinguish a 16-bit and 8-bit sample between the uniform distribution and the respective 16-bit and 8-bit distributions we derived.
Table 1 shows the TYPE I errors and probabilities for the two cases under different lengths of samples.We can see directly from the table that the error probability for 16-bit distinguishing is much smaller than the 8-bit one, indicating the bias of the 16-bit approximation is larger than the latter.Considering the error probabilities to be 0 for 16-bit and 8-bit distinguishing are after lengths 2 42 and 2 46 , respectively, we could make a general estimation that the bias for the 16-bit approximation is around 2 −38 .Now we briefly explain how to use this 16-bit approximation in a correlation attack.First we point out that any output from the LFSR at clock t, u (t) , can be derived from the initial states (u 0 , u 1 , ..., u 63 ) by u Then we have, At the next clock t + 1, the value in the (i + 1)-th cell is shifted to the i-th cell for 0 ≤ i ≤ 62 and only the 63-rd cell is updated, which can be expressed as where u 63 is the new value for the 63-rd cell updated by u 63 = 62 i=0 γ i u i , where γ i 's are the feedback coefficients of the LFSR in GF (2 8 ).Then y (t+1) can be expressed as, Assume by the k-tree algorithm, we have found a k-tuple combination, say k = 4, with (t 1 , t 2 , t 3 , t 4 ) which maps the xor-sum of the output to the first l 8-bit symbols in the LFSR, i.e.,  (tj ) .We aim to build another part of the 16-bit symbol by getting 4 i=1 y (ti+1) from (9) .We get From above, 4 j=1 y (tj +1) can be mapped to the xor-sum of (u 0 , u 1 , ...u l ) and we express it as Here (E, E ) can be regarded as following the 16-bit distribution we obtained in the beginning of this subsection.We could see, compared to the 8-bit correlation attack where the output is mapped to the first l states, here the 16-bit symbol is mapped to the first l + 1 states, i.e., one more state is involved.If we find N tuples like this, we could build a new [N , l + 1] code with the 16-bit approximation as the noisy channel.Then we proceed to the decoding stage to recover the l + 1 states.
Next, let us check the complexity.For the preprocessing phase, we could still use the k-tree method to find parity check equations, but through a different DMC with a smaller noise.Using the method before, we could get the best result in terms of complexity is now under l = 19.Now 2 159.72 parity check equations are required with time/space complexity of 2 175.24 and the required length of keystream sample is 2 175.24 .For the decoding process, the complexity is 2 176.06 .From the result, we can find that while the complexity is still upper bounded by the same order around 2 176 , the required number of parity check equations reduces from 2 171.67 to 2 159.72 .These complexity results are based on only experimental result and are not with full confidence.

Conclusion
In this paper, we propose a distinguishing attack and correlation attack to SNOW 3G using new linear approximations over larger alphabets.We first derive a 24-bit and an 8-bit linear approximation of the FSM and verify them by an experimental test using hypothesis testing.Then we used the derived approximations to launch a distinguishing attack and correlation attack.For the distinguishing attack, we find a weight 4 mupltiple of the generating polynomial to cancel out the contribution from the LFSR and distinguish the corresponding keystream sample sequence with complexity around 2 172 .For the correlation attack, we use the the 8-bit approximation to recover 160 bits of the secret key with complexity around 2 177 .As far as we know, these are the first distinguishing and correlation attacks on SNOW 3G.If the key length in SNOW 3G would be increased to 256 bits, the results show that there are then academic attacks on such a version faster than the exhaustive key search.
A possible way to improve the results and achieve a higher bias would be to consider even larger alphabets in the approximations, but this would require much more complex simulation tasks to find and verify biases of different choices of approximation.Another interesting question would be to launch distinguishing attacks which are based directly on the weight 4 recurrence relation for the LFSR, which has nonzero coefficients that are not all one.If so, it would remove the requirement of having a very long keystream, but could be applied on a large set of short keystreams generated with different IVs.

Figure 2 :
Figure 2: Distances to the uniform distribution and noise distribution

cc
i u i ⊕ E ,with c i being some new coefficients and E = 4 j=1 e(ti+1) .Then the 16-bit approximation can be expressed as:( i u i ) ⊕ (E, E ).

Table 1 :
Errors and error probabilities(in the brackets) under different lengths of samples