A new SNOW stream cipher called SNOW-V

. In this paper we are proposing a new member in the SNOW family of stream ciphers, called SNOW-V. The motivation is to meet an industry demand of very high speed encryption in a virtualized environment, something that can be expected to be relevant in a future 5G mobile communication system. We are revising the SNOW 3G architecture to be competitive in such a pure software environment, making use of both existing acceleration instructions for the AES encryption round function as well as the ability of modern CPUs to handle large vectors of integers (e.g. the Advanced Vector Extensions AVX from Intel). We have kept the general design from SNOW 3G, in terms of linear feedback shift register (LFSR) and Finite State Machine (FSM), but both entities are updated to better align with vectorized implementations. The LFSR part is new and operates 8 times the speed of the FSM. We have furthermore increased the total state size by using 128-bit registers in the FSM, we use the full AES encryption round function in the FSM update, and, ﬁnally, the initialization phase includes a masking with key bits at its end. The result is an algorithm generally much faster than AES-256 and with expected security not worse than AES-256.


Introduction
Stream ciphers have always played an important part in securing the various generations of 3GPP mobile telephony systems, starting with the GSM system employing the A5 suit of ciphers, continuing with the use of SNOW 3G as the secondary algorithm in UMTS, and more recently as the primary algorithm in LTE, for both integrity and confidentiality.When we now turn to the next generation system, called 5G, we see some fundamental changes in system architecture and security level that in many cases invalidate the previous algorithms.We will focus on the LTE (or 4G, as it is commonly called) system when describing the current state in link protection for mobile systems.
The basis for the link security in all 3GPP generations of mobile telephony systems is a shared secret key between the device (commonly called the User Equipment, UE) and the home network, the Mobile Network Operator that the user has a service agreement with, and from whom the user receives the credentials in form of a UICC with a USIM application (often referred to as the SIM-card).The shared key is stored in the Home Subscriber Server (HSS) and in the Secure Element on the UICC.From this key, through a rather complicated set of key derivations, the home network and the UE both agree on new keys to be used for integrity and confidentiality protection of the control channel, and confidentiality protection of the user data channel.The 4G system defines three different possible algorithms for integrity (EIAx) and confidentiality (EEAx), based on three different primitives SNOW 3G [SAG06], AES [oST01], and ZUC [SAG11].The algorithms used in UMTS and LTE are all using the 128-bit key size, and are depicted in Table 1.

UMTS LTE Integrity Encryption Integrity Encryption
Table 1: Base algorithms used in UMTS and LTE for integrity and confidentiality.
The SNOW family of stream ciphers started with the SNOW [EJ01] proposal in the European project NESSIE, a call for new primitives.Two attacks [HR02,CHJ02] were soon discovered and the design was subsequently updated to the SNOW 2.0 [EJ02] design.Attacks on SNOW 2.0 will be more discussed in section 3. The ETSI Security Algorithm Group of Experts (SAGE) modified the SNOW 2.0 design and proposed the resulting cipher SNOW 3G as one of the algorithms protecting the air interface in 3GPP telecommunication networks.
Although sufficient for 4G system, these EIA and EEA algorithms face some challenges in the 5G environment.For the 5G system, the 3GPP standardization organization is looking towards increasing the security level to 256-bit key lengths [SA318].For ExA1, and ExA2, this does not immediately appear to be a problem, since both the underlying primitives (AES and SNOW) are specified for 256-bit keys.ZUC is currently only specified and evaluated under 128-bit key strength, but another version, ZUC-256, supporting 256-bit keys has recently been presented [Bin].However, since the design of the radio and core network will also fundamentally change in the 5G system, there are other challenges.Many of the network nodes will become virtualized [3GP] and thus the ability to use specialized hardware for the cryptographic primitives will be reduced.Many newer processors from both Intel and ARM now include instructions to accelerate AES, and it will be fairly easy to reach encryption speeds of 20-25 Gbps for EIA2 and EEA2, but for the stream ciphers SNOW and ZUC, we need to look for other solutions.Current benchmarks on SNOW 3G gives approximately 6-7 Gbps in a pure software implementation, which is far too low for the targeted speed of 10 Gbps in the 5G system (see, e.g., [ITU17]).
In this paper we revise the SNOW 2.0/ SNOW 3G architecture to be competitive in a pure software environment, relying on both the acceleration instructions for the AES round function as well as the ability of modern CPUs to handle large vectors of integers (e.g. the Advanced Vector Extensions AVX from Intel).We have kept most of the design from SNOW 3G, in terms of linear feedback shift register (LFSR) and Finite State Machine (FSM), but both entities are updated to better align with vectorized implementations.We have also increased the total state size by going from 32-bit registers to 128-bit registers in the FSM.Each clocking of SNOW-V (V for Virtualization) now produces 128 bits of keystream.
We also propose an AEAD (Authenticated Encryption with Associated Data) operational mode to provide both confidentiality and integrity protection.The keystream width of 128 bits makes the authentication framework of GMAC [Dwo07] very easy to adopt to SNOW-V.This paper is organized as follows.In section 2, we present the new design, including pseudocode.In section 3, we give a brief security analysis, describing most of the common attack approaches and how they apply to SNOW-V.In section 4, hardware implementation aspects are given and in section 5 the corresponding treatment of software implementations is given.section 6 considers software performance results and implementation aspects using future SIMD instruction set.In section 7 we describe how authentication can be included, in an AEAD mode of operation, and the paper ends with conclusions in section 8.
2 The design SNOW-V follows the design pattern of previous SNOW versions and consists of an LFSR part and an FSM part.The overall schematic is shown in Figure 1.The LFSR part is now a circular construction consisting of two shift registers, each feeding into the other.The FSM has three 128-bit registers and two instances of a single AES encryption round function.
a 15 a 14 a 13 a 12 a 11 a 10 a 9 a 8 a 7 a 6 a 5 a 4 a 3 a 2 a 1 a 0 Each cell represents an element in F 2 16 , but LFSR-A and LFSR-B have different generating polynomials.The elements of LFSR-A are generated by the polynomial (1) and the elements of LFSR-B are generated by When we consider these elements of F 2 16 as words, the x 0 position will be the least significant bit in the word.Let α ∈ F A 2 16 be a root of g A (x) and β ∈ F B 2 16 be a root of g B (x).At time t ≥ 0 we denote the states of the LFSRs as (a 2 16 respectively for LFSR-A and LFSR-B.Referring to Figure 1, the elements a (t) 0 and b (t) 0 are the elements to first exit the LFSRs.The LFSRs produce sequences a (t) and b (t) , t ≥ 0 which are given by the expressions where the initial states of the LFSRs are given by (a (15) , a (14) , . . ., a (0) ) and (b (15) , b (14) , . . ., b (0) ).
We would like to emphasize the notation here; a (t) means the symbol produced by the linear recursion in Equation 3 at time t, whereas a i , 0 ≤ i ≤ 15 are the values of the cells in the LFSR-A at time t.In the case of α and β, the notation α −1 and β −1 are the inverses in the respective implemented fields.
As the reader might notice, we are a bit sloppy in Equation 3 and Equation 4 and apply the field addition operation between elements of different fields, but it should be interpreted as an implicit bit pattern preserving conversion between the fields.
Each time we update the LFSR part, we clock LFSR-A and LFSR-B 8 times, i.e., 256 bits of the total 512-bit state will be updated in a single step, and the two taps T 1 and T 2 will have fresh values.In Appendix A we give the proof that this circular construction gives the maximum cycle length of 2 512 − 1.
The tap T 1 is formed by considering (b 15 , b 14 , . . ., b 8 ) as a 128-bit word where b 8 is the least significant part.Similarly, T 2 is formed by considering (a 7 , a 6 , . . ., a 0 ) as a 128-bit word where a 0 is the least significant part.The mapping is pictured in Figure 2, and the expressions are given by T 2 (t) = (a 6 , . . ., a We will now turn to the FSM.The FSM takes the two blocks T 1 and T 2 from the LFSR part as inputs and produces a 128-bit keystream as output.R1, R2, and R3 are 128-bit registers, ⊕ denotes a bitwise XOR operation, and 32 denotes an addition with carry, but split up into four 32-bit additions.So the four 32-bit parts of the 128-bit words are added with carry, but the carry does not propagate from a lower 32-bit word to the higher. The output, z (t) at time t ≥ 0, is given by the expression Registers R2 and R3 are updated through a full AES encryption round function as shown in Figure 3, see [oST01] for details.Let us denote the AES encryption round function by AES R (IN, KEY ).Then the update expressions for the registers are given by

AES Enc Round
Round key The values of the two round key constants C1 and C2 are set to zero.The mapping between the 128-bit registers and the state array of the AES round function follows the definition in [oST01], and is pictured in Figure 4.
Then the initialization consists of 16 steps where the cipher is updated in the same way as in the running-key mode, with the exception that the 128-bit output z is not an output but is xored into the LFSR structure to positions (a 15 , a 14 , . . ., a 8 ) in every step.Additionally, at the two last steps of the initialization phase, we xor the key into the R1 register, inspired by [HK18].We also limit the keystream length to a maximum of 2 64 for a single pair of key and IV vectors, and each key may be used with a maximum of 2 64 different IV vectors.
The pseudocode in 1 clarifies the procedure.
Algorithm 1 SNOW-V initialization This completes the description of SNOW-V, and the full algorithm can be summarized in the pseudocode as in 2, 3, and 4.

Security analysis
The main and most important design criterion is the security of the design.This section contains a brief analysis for a number of possible standard attack approaches.Before going into the details of various attacks, we need to have a clear picture of the expected security.We have the target of providing 256-bit security in SNOW-V, by which we mean that we claim that the total cost of finding the secret key given some keystreams is not significantly smaller than 2 256 simple operations.The use of the algorithm is limited to keystreams of length at most 2 64 and we also limit the number of different keystreams that are produced for a fixed key to be at most 2 64 .There seem to be no use cases where it makes sense to violate this limitation.Although attacks beyond these limits are certainly of academic interest, an attack claiming to break the cipher should meet this requirement.
We also frequently compare with AES-256 in the GCM mode.We note that exhaustive key search of AES-256 requires computational cost around 2 256 .However, if used in the GCM mode, it actually takes complexity (and data) around 2 64 to distinguish such keystreams from random.For SNOW-V, we claim that the security is never worse than the security of AES-256 in the GCM mode, for any kind of attack on the algorithmic level.

Initialization attacks through MDM/AIDA/cube attacks
Stream ciphers always have an initialization phase before producing keystream bits, during which the key and IV are loaded and a number of rounds (in the SNOW-V case, we use 16 rounds) are processed to fully mix the key and IV until the state becomes random-like.It should be difficult for the cryptanalyst to predict the generated keystream or to get some information about the initial key according to the output after initialization.Then it becomes vital to make sure that the key/IV loading has no fatal flaws and the initialization round is carefully designed in order not to result in a resource waste (too many rounds) or some weakness (too few rounds).
A chosen IV attack is one type of attacks targeting this problem [Mj06,EJT07], in which the adversary attempts to build a distinguishing attack to introduce randomness failures in the output by selecting and running through certain IV values.The rationale behind this idea is that: 1) the cipher can be regarded as a succession of "black box" Boolean functions f i with the keystream as the output and key/IV as the input, and 2) any monomial coefficient in the algebraic normal form (ANF) representations of these Boolean functions should appear to be 1 (or 0) with probability 1/2 if f i is drawn uniformly at random (see [Sta13] for more details).In this attack, the adversary fixes the key and a subset of IV bits and runs through all possible values of the non-fixed IV bits.The truth tables of the Boolean functions can be obtained after that, which are further used to compute the monomial coefficients in the ANF and compared with expected values.The best and most commonly used monomial is the maximum degree monomial (MDM) and the corresponding test is called MDM test.In [Sta10] one even allows setting arbitrary key values to build a non-randomness detector to further check whether the initialization is robust enough.It should be noted that the MDM test and AIDA (algebraic IV differential attack)/cube distinguishers [Vie07,DS09] are various forms of using higher order differentials [Lai94] on stream ciphers.
We employ the greedy MDM test algorithm in [Sta10] to test the SNOW-V initialization.We start with the worst 3-bit set under which the randomness result deviates the most from the expected value and gradually increase to a 24-bit set.Every time when we add one more bit from the remaining bits, we select the bit leading to the worst randomness result.Continuing such steps until we get a 24-bit set (sets with more bits can be tested on more powerful computers).Figure 5 shows the maximum number of initialization rounds failing the MDM test under different bit set sizes.The results for 1, 2 and 3-bit sets are based on the exhaustive search, while for the sets with other sizes, the results are based on a greedy search from the initial worst 3-bit set.It can be seen that roughly the first 7 rounds out of 16 fail the MDM test.One can also note that the number of rounds that the MDM test can detect grows very slowly with the size of the set of key/IV bits that are exhausted.In an attack, one could consider sets of sizes up to 64 bits.This indicates that the 16 initialization rounds in SNOW-V should be enough for the cipher and that the output of the cipher has become random-like after the initialization.It also indicates that significantly reducing the number of rounds might be dangerous.

Other initialization attacks
Another attack possibility is to launch a differential attack, either in the IV bits only, or in combination with key bits.The latter would then lead to a related-key attack.Since the initialization contains 16 rounds, each including two applications of the AES encryption round function, the differential would have to go through a lot of highly nonlinear operations, which makes this approach less successful.
Finally, a further option is the slide attacks [BW99].Such sliding properties have been considered on previous versions in the SNOW family [KY11].The idea is to have the same initial state for two different key/IV pairs in different time instances.Then they will produce the same keystream with the difference of a shift in time.Since the required IV values vary with the choice of key bits, it is questionable whether such an approach is useful at all in cryptanalysis, but at least it indicates that the cipher is not to be considered as a random function of both the key and IV.For SNOW-V such properties would still be much more difficult to find, due to the update of 128-bit blocks in each time instance and the use of the FP (1)-mode [HK18] in the initialization.

Time/Memory/Data tradeoff attacks
A Time/Memory/Data tradeoff (TMD-TO) attack is a generic method of inverting ciphers by balancing between spent time, required memory and obtained data, which can be much more efficient and applicable than an exhaustive key search attack.Some stream ciphers are vulnerable to TMD-TO attacks, and their effective key lengths (e.g., n-bit) could then be reduced towards the birthday bound (i.e., n/2), typically happening if the state size is small.A well known such attack on A5/1 was given in [BSW01].
The TMD-TO attacks have two phases: a preprocessing phase, during which the mapping table from different secret keys or internal states to keystreams is computed and stored with time complexity P and memory M ; and a real-time phase, when attackers have intercepted D keystreams and search them in the table with time complexity T , expecting to get some matches and further recover the corresponding input.By balancing between parameters P, D, M , and T under some tradeoff curves, attackers can launch attacks according to their available time, memory and data resources.The most popular tradeoffs are Babbage-Golic (BG) [Bab95,Gol97] and Biryukov-Shamir (BS) [BS00] tradeoff with curves T M = N , P = M with T ≤ D; and M T 2 D 2 = N 2 , P = N/D with T ≥ D 2 , where N is the input space, respectively.Attackers can try to reconstruct the internal state at a specific time or recover the secret key.
The rationale behind the TMD-TO attacks that try to reconstruct the internal state is that in many stream ciphers, the internal state update process is invertible, which means that if an attacker manages to reconstruct an internal state at any specific time, it can not only obtain subsequent new keystreams by running the cipher forwards, but also recover previous states iteratively and further get the underlying secret key by running backwards.But for the SNOW-V case, attackers have no obvious ways to reconstruct the internal state, since SNOW-V has a large internal state with 894 bits (2 × 256-bit LFSRs + 3 × 128-bit registers), which is 3.5 times the secret key length.The best attack complexity achieved is under BG tradeoff with point T = M = D = N 1/2 = 2 447 , which is still much worse than the exhaustive key search attack.Actually, SNOW-V satisfies the rule derived from TMD-TO attacks in [Gol97] and widely applied in the design of new ciphers, that the size of the internal state should be at least twice the size of the secret key to get the expected security level.
Moreover, in SNOW-V, attackers would get even less although they reconstructed an internal state.While computing subsequent new keystreams corresponding to that specific IV is still possible, they can not trivially recover the secret key or keystreams under other IV values.This is due to the key masking to the register R1 at the last two rounds of initialization, which represents a form of an instantiation of the FP (1)-mode introduced in [HK18].
Attackers can also try to recover the secret key directly.To do so, some mappings from different key/IV pairs to generated keystream segments are firstly pre-computed and stored [HS05,DK08].If attackers get some keystream data under different secret keys corresponding to these IV values, they can search them in the table to expect a collision and further recover some of the secret keys directly.The tradeoff curves are still the same in that to recover the internal states except N is now changed to be the size of the set of all possible (K, IV ) pairs.In the SNOW-V case, the sizes of key and IV spaces are 2 256 and 2 128 , respectively.The typical points for BG and BS attacks are T = D = M = 2 192 and T = 2 256 , D = M = 2 128 , which are both unrealistic to achieve in practice.Someone would question that the efficient size of the key in the first tradeoff is reduced from 256 to 192 bits, but actually, no ciphers including AES-256 can be immune to this as long as their IV sizes are smaller than the key sizes.In any case, the corresponding multikey attacks on AES-256 are not more costly.

Linear distinguishing attacks and correlation attacks
Traditionally, the main threat against stream ciphers has been various types of linear attacks, either in the form of distinguishing attacks on the keystreams, or state recovery attacks through correlation attacks.The basic foundations of correlation attacks can be found in papers like [CJS01,CJM02] and an overview of distinguishing attacks is to be found in [HJB09].
The basic technique for these types of attacks is to use linear approximations of the nonlinear operations used in the cipher and then derive a linear relationship between output values from different time instances.Such a relationship will then hold only as a very rough approximation, which in turn can be thought of as a linear function of some given output bits being considered as a sample drawn from a nonuniform distribution.This approach may give a distinguishing property for the keystream.If the relationship also involves state bits, the same arguments may give samples that are highly noisy observations of state bits, which in turn may be linear combinations of the original initial state.This may give a way to recover the state and that is the foundation of a correlation attack.
For SNOW 2.0, several distinguishing attacks and correlation attacks have been proposed [NW06,ZXM15].The basic idea has been to approximate the FSM part through linear masking and then to cancel out the contributions of the registers by combining expressions for several keystream words.We should note that this kind of attacks tend to require an extremely large length of the keystream.Also, no significant attack of this type on SNOW 3G has been published.We now consider a similar approach for making some basic arguments on SNOW-V.
Since we always set C1 = C2 = 0 we can simplify the notation of the output function and the update: A linear approximation of the FSM would then try to cancel out the contribution from the registers, leaving keystream symbols and the LFSR contribution.Assume that value of the registers at some time t is ( R1, R2, R3).Then we have For time t + 1, and the next keystream block is Let us now consider 32 being approximated by ⊕ and the AES R (X) operation approximated as X • M for some 128 × 128 binary matrix M .Then we could express the keystream blocks as Here each random variable N i represents the noise introduced by approximating the ith 32 by writing 32 = ⊕ + N i .Similarly, N i represents the noise introduced by writing AES R (X) = X • M + N i for the ith approximated AES round function.By rewriting as where T is the contribution from T 1, T 2 values and N is the sum of all noise values N i , N j .
Examining the matrix, one sees that it is not possible to have reduced rank for any meaningful approximation matrix M .So we conclude that based on the direct approach of approximation as above, it would require 4 consecutive keystream blocks in order to cancel the contribution from the registers in the FSM.Such an approach would then involve even more noise variables and one will have the form N j • M 2 , where N j is the noise from approximating one 32 with ⊕.Such a linear approximation of the FSM could then be used in a correlation attack.However, since such an attack would need to use a combination of several linear approximations from different time instances that would add the corresponding noise, it does not seem to be a fruitful way of attacking the cipher as the noise will be very strong.If one would devise a distinguishing attack, one would instead have to cancel the contribution from the LFSR part, which again will give a noise very close to the uniform distribution.We do not see a path to identify a strongly biased approximation in this way.

Algebraic attacks
In an algebraic attack the attacker derives a number of nonlinear equations in either unknown key bits or unknown state bits and solves the system of equations.In general, the problem of solving a system of nonlinear equations is not known to be solvable in polynomial time (even for quadratic equations), but some special cases might be solved efficiently [CKPS00].
For SNOW 2.0 there was a very interesting algebraic attack on a simplified version, given in [BG05].However, due to the use of three FSM registers instead of two, applying such an approach on SNOW-V does not give such a nice quadratic system as in [BG05].
So for a general algebraic attack, we should either target the key or the state.For the latter, one would need to use equations from 7 keystream blocks to be able to solve for the 7 * 128 bit internal state.That would involve nonlinearity from 11 AES encryption round functions and 13 32 operations.Instead, targeting the key bits would require stepping through the equations of the 16 initialization rounds together with the equations of two keystream blocks.Both these approaches are giving systems of nonlinear equations that appear to be much more difficult to solve than corresponding equations for AES-256.This is due to the use of the 32 operation.

Guess-and-determine attacks
In a guess-and-determine attack one guesses part of the state and from the keystream equations, and determines the value of other parts of the state.The goal is to guess as few bits as possible and determine as many as possible through keystream equations.For the case of SNOW-V, the equation z (t) = (R1 (t)  32 T 1 (t) ) ⊕ R2 (t) involves three unknown values, each of size 128 bits.In order to determine some state bits, one then has to guess two of them, i.e. guessing 256 bits.Then looking at the equation for z (t+1) , it would require the guess of one more 128 bit value.This indicates that a guess-and-determine attack would not be successful.

Other attacks
We have not made any specific design choices to explicitly support implementations that should protect against side-channel attacks and fault attacks.So such attacks, if relevant for an application, have to be considered when the algorithm is implemented.In particular, information leakage from the CPU in a software implementation must be carefully considered.

Hardware implementation aspects
When designing new algorithms targeting existing systems, reusability of hardware components is important to reduce area and cost of the ASICs.Many systems dealing with network communication security implement some form of AES acceleration, either in a specialized ASIC or as specialized CPU instructions.SNOW-V leverages this co-existence by using two full AES encryption rounds as the main nonlinear element.A hardware implementation of SNOW-V can utilize either one or two external AES cores, if present, or implement its own AES encryption rounds in a stand-alone design for maximum speed.Although a 128-bit implementation is straight-forward from the algorithm description, it has some drawbacks when we only have one single external AES core available, as is the case in many constraint implementations.In this section we will consider how to implement SNOW-V using a single AES core with a 64-bit hardware architecture.We will refer to the 64-bit and 128-bit hardware implementations as the 64-SNOW-V and 128-SNOW-V respectively.

SNOW-V 64-bit Hardware Architecture
In this section we propose a 64-bit hardware architecture where SNOW-V requires a single AES encryption core (external or built-in), and each clocking of 64-SNOW-V produces 64 bits of the keystream.
Cons: an additional 64-bit delay register D is needed; the logic needs additional 5 64-bit multiplexers; two clocks to produce 128 bits of keystream that actually halves the speed.
Pros: a single AES encryption core is needed; produces 64 bits of keystream at each clock; all basic operations in both FSM and LFSR, such as XOR and ADD, are now halved in size.
In order to utilize a single AES core the FSM update function should be split into two steps.The main critical path is the AES EncRound, which means that while splitting FSM into two stages we should avoid any extra logic on the input and output signals of the AES core.Thus, input to and output from the AES core must be registers.
Let us split all 128-bit registers and all 128-bit signals of the FSM block, say X, into two 64-bit halves as X a (low) and X b (high).We also assume that the tap values T 1 and T 2 from the LFSRs also arrive in 64-bit chunks, such that every even clock FSM gets T 1 a and T 2 a , and every odd clock T 1 b and T 2 b .
In Figure 6 we propose a possible way to split the FSM such that it contains the two circuits for even and odd steps, 0 and 1 resp.(excluding the gates needed for initialization).One can notice that after these two steps the content of the registers R1, R2, R3 become updated to new 128-bit values R1, R2, R3, and ready to process the next 128 bits of data with the same two steps.The above two circuits are then combined into a single circuit using multiplexers.
In Figure 7 the complete hardware architecture for 64-bit SNOW-V is presented.There are 6 64-bit multiplexers in total, and we denote the control signal to them by M 1 ..M 6 , respectively.There are also 5 64-bit AND gates, the purpose of which is to either bypass the signal or block it.Those AND blocks are controlled by four signals G A , G Z , G K , G F , the latter controls 2x64 Critical path.Our primary assumption is that the AES encryption round would be the main critical path (MCP).However, one can easily determine that the secondary critical path (SCP) would be the sequence MUX-ADD-XOR-AND-XOR over 2x32-bit integers, denoted by red wires in Figure 7. Thus, when selecting 32-bit adders one should make sure that they are fast enough so that the MCP is sustained.
The algorithm has 3 stages: Stage 1 -Loading.The design is constructed such a way that the registers do not need to have any RESET signal.Instead, all registers will be sequentially loaded with the key and IV, and the remaining registers will be zeroized, during this stage.
The stage begins with a strobe signal on LOAD, and the first 64-bit chunk of data is expected on the IN DATA bus.In total, the stage expects to receive 8 64-bit words each clock in the following order: {iv 0 , iv 1 , k 0 , k 1 , 0, 0, k 2 , k 3 }.
In this stage, the control unit should block AND gates G Z = G A = 0, and set M 6 = 1, in order to concatenate LFSRs A and B into a single large LFSR while shifting in the initialization data.In order to zeroize FSM registers, the control unit should block G F = 0 and also enforce the multiplexer inputs M 4 = 1, M 5 = 0. G K is set to 0.
After the 8 clocks where the key and IV are loaded, we proceed to stage 2.
Stage 2 -Initialization.In this stage, the FSM works in the same way as when it produces keystream output symbols, i.e. the multiplexer control signals switches according to even/odd clock cycle as explained previously.The LFSRs are connected together by setting G Z = G A = 1 and switching M 6 = 0 to disable any external input.
Note that we placed the AND gating after the registers R3 a , R3 b , so that we do not add extra depth to the critical path of AES core, hence these registers will not be zeroized.To overcome this problem the control unit generates G F = 0 in the first clock of this stage, and then sets G F = 1 until the end of stage 2. We keep G K = 0 for the first 28 clocks.In the remaining 4 clocks we need to XOR the key K to R1 according to the initialization procedure.So we enable G K = 1 and expect to receive {k 0 , k 1 , k 2 , k 3 } consecutively from the input bus IN DATA.After this, the circuit is ready to produce keystream words.

AES Enc Round
Stage 3 -Keystream generation.Both LFSR and FSM operate normally.The control unit in this stage detaches the Z signal from being feeded into LFSR-A by setting G Z = 0.The input bus is also detached by setting G K = M 6 = 0.

Theoretical Analysis of 64/128-bit SNOW-V in Hardware
The area will be estimated in terms of gate equivalence (GE), where 1GE = size of a NAND gate.The speed will be estimated in terms of Gigabits per second (Gbps), based on known speed results of AES circuits.We will use GE values given in [Sam00] for 1-speed technology elements.
For comparison with AES, we will use one of the more recent results from [UMHA16] where an area-speed optimized AES-128 (10 rounds) on NanGate 15nm technology runs with the speed 71.19 Gbps and has the area 17232 GE.This means that having the same design, AES-256 (14 rounds) would run with the speed of 50.85 Gbps.
Our basic assumption is that the AES core is the critical path of the SNOW-V circuit.Thus, if SNOW-V would utilize a single AES core as above, the speed of 64-SNOW-V could be as high as 356 Gbps.The speed of 128-SNOW-V with two AES cores is therefore as high as 712 Gbps.What remains is to calculate the hardware cost of SNOW-V, excluding the external AES core, but including the cost of integration into that external AES core.We will also exclude the control unit, as this can be implemented with a very few gates and latches and every implementation will have slightly different needs of control and ready signaling.
State Registers.For 64-SNOW-V, there are 512 registers for the LFSR and 6x64+64 registers for the FSM.Since our 64-bit implementation does not require complex latches (e.g., no RESET), we can use the simplest D-latch with Q-output only from [Sam00] [FD1Q].The total cost is 960 * 4.33 = 4157 GE.
For arithmetical 32-bit adders we suggest to take, for example, a Han-Carlson 32-bit adder, as it has a low area overhead (15%-25% larger than Ripple-Carry adders) and a very small delay O(log(n)) -which is important in order to keep the critical path upper bounded by the AES round function.We can estimate these components as 4x(30FADD3 + 2HADD2)+20%= 4(30 * 6.33 + 2 * 3.67) * 1.20 = 947 GE for 64-SNOW-V and 1894 GE for 128-SNOW-V.
LFSR Update logic involves two circuits for the feedback functions.16-bit field multiplications by α, α −1 , β, β −1 can be done with 8 XORs in each case, since the Hamming weight of both g A (α) and g B (β) is 8.
However, let us have a closer look on how each bit of, e.g. a 16 is calculated.Each bit a 16 [i], 14 ≥ i ≥ 1 is unconditionally depending on four bits, namely The end bits are easy to work out too.Some of the bits of a 16 are also depending on a 0 [15] and a 8 [0], due to the multiplication with α and α −1 .Table 2 gives a full overview of the dependencies for both a 16 and b 16 .
i 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Depending on This means that in order to compute a 16 [i], we have to XOR 4, 5, or 6 different input bits.For example, in the table above we see that the a 16 [13] is only dependent on the basic input bits in Equation 11, and the XOR gate needs 4 inputs: On the other hand, a 16 [11] needs to XOR 6 inputs: since the multiplication with α and α −1 will both influence that bit.
Integration into an external AES Engine requires input multiplexers for 128 bits of the plaintext and 128 bits for the round key.However, the AES round keys C1 and C2 are zeroes so that we can use 128AND gates, instead.In total we get 128MUX2 + 128AND2 = 128 * (2.33 + 1.33) = 468 GE for 64-SNOW-V.128-SNOW-V requires two such integration circuits.
In case we decide to implement SNOW-V with its own internal AES EncRound, the hardware cost could be as small as 16 AES SBoxes, plus some logic for MixColumn.Also note that in this case the critical path decreases since we only need the forward SBox and thus any outer multiplexing logic for a combined forward and inverse SBox can be removed.This could lead to a potential speed up for 128-SNOW-V.
The part MixColumn of AES encryption round, applied to the AES state {r i,j } for 0 ≤ i, j ≤ 3, is the following matrix multiplication.
Summarizing the above we can derive the comparison given in Table 3 .

Software implementation aspects
One important change in future telecom networks is the virtualization of the network functions.This puts new requirements on the crypto algorithms used to protect the traffic in that it needs to execute fast in a pure software implementation on modern CPUs.According to [ITU17], the minimum requirements related to 5G radio interface are 10 Gbps uplink and 20 Gbps downlink, at peak data rates.Classical encryption algorithms cannot reach these high speeds in pure software without any hardware support.Nowadays, most of CPU vendors provide large registers and vectorized SIMD instructions, such as AVX2 set of instructions (intrinsics) that can execute over registers of up to 256 bits.Typical instructions include such functions as XOR, AND, nADD32, etc., applied to long registers, where, depending on the instruction, a single register can be represented as a vector of 8/16/32/64-bit values.
AES is one of the most widely used crypto algorithms and it has received special support by CPU vendors in the form of SIMD instructions (AES-NI for Intel) that makes it possible to execute AES quite fast even on user-grade laptops.Crypto ciphers SNOW 3G and ZUC, standardized in 4G, and other ciphers (to our knowledge), cannot reach the speed even close to AES when AES-NI is used.
SNOW-V is designed to perform very fast in software, with the aim to utilize currently available SIMD instructions.However, even without AES-NI, SNOW-V can be implemented quite efficiently with 16 64-bit registers.Our take-away is that if a given platform supports AES-NI then other SIMD instructions are also likely supported.If AES-NI is not available then AES-256 will be much slower than SNOW-V, and actually, slower than SNOW 3G as well.This section is written with Intel intrinsics notation, but similar implementations can likely be made on other CPUs, e.g.AMD and ARM.A comprehensive guide on Intel's intrinsics can be found in [Int18].
The FSM part of SNOW-V is quite straightforward to implement using 128-bit registers __m128i and AES-NI intrinsic function _mm_aesenc_si128().For 4 parallel arithmetic additions one can use _mm_add_epi32()3 .
The key to an efficient implementation of the LFSRs is choosing the right data structures.We propose to store the content of the two LFSRs in two 256-bit registers __m256i hi, lo, such that: To perform a single LFSR update (8 steps), we only need to calculate new values for one register, hi_new=update(lo, hi) while the other register update is a copy lo_new=hi.Let gA=0x990f represents the generating polynomial g A (α) of the field F A 2 16 , without the term α 16 .Then, multiplication of x by α in F A 2 16 can be done as follows: we first shift x<<1, then, based on the 15th bit of the original x, we XOR the result with gA.This may be done with only 4 instructions, using 16-bit values mul_alpha(uint16 x, uint16 gA) := (x<<1) xor ( ((signed int16)x >> 15) and gA) Note that the condition wether to xor with gA or not is implemented with the help of the 16-bit mask = (signed int16)x >> 15, where the mask is created by the arithmetical shift of the signed x to the right by 15 positions.The arithmetical shift to the right results in propagation of the sign (15th) bit, thus forming the mask either 0xffff in case the bit 15 was 1, or 0x0000, otherwise.
The above trick can be applied to the combined 256-bit vector lo = (b 7 , . . ., b 0 , a 7 , . . ., a 0 ) to multiply the first half with α from the first base field F A 2 16 and the high part with β from the second base field F B 2 16 , simultaneously.Here we need to use _mm256_srai_epi16() that performs arithmetical shift to the right of 16 16-bit signed integers represented in the combined 256-bit register lo.Obviously, the and operand should be done with the constant where the low 8 x 16-bit values are gA=0x990f and the second half contains gB=0xc963.
A similar idea is applied for multiplication of hi by α −1 and β −1 .In our reference implementation we found the way with only 4 instructions with the help of a non-trivial intrinsic _mm256_sign_epi16() -however, if that intrinsic is not available then there is an alternative solution with 5 instructions.
The results of the above two steps should be XORed together with the values at tap offsets 1 and 3 for LFSRs A and B, respectively.The latter part is just byte shuffling that can be done with _mm256_blend_epi32() and _mm256_alignr_epi8(), three instructions in total.
6 Performance results

In Software
The natural algorithm to compare with is AES-256, implemented with AES-NI intrinsics.We have done a number of performance tests of SNOW-V and AES-256 (CBC) on a user-grade laptop with i7-8650U CPU @1.90GHz with Turbo Boost up to @4.20GHz, testing each algorithm on a single thread and with different sizes of the input plaintext.Before each encryption process, we perform a key/IV setup procedure for both SNOW-V and AES-256.The results are presented in Table 4.

Use case scenarios
For a large plaintext SNOW-V outperforms AES-256 by around 6 times, even with an AES-NI implementation of AES-256.Some block cipher modes (e.g.CTR) can be parallelized and in order to reach a similar speed as SNOW-V running on 1 CPU, AES requires at least 6 CPUs.
Let us consider the scenario with short fragments, where a large message is split into short messages, say 2048 bytes, and sent over the channel.The encryption is performed with the same key K and different IVs for each fragment -IV 1, IV 2, etc.In this case there is a generic approach to speed up any encryption algorithm by precomputing the keystreams for (K, IV 1, IV 2, ...).This technique can be applied to both AES and SNOW-V and from Table 4 we conclude that SNOW-V outperforms AES-256 also in this case.
The only scenario where SNOW-V is slower than AES-256 (in a single core setup) is when AES-256 performs the key setup only once, then uses the same context to prepare keystream for various IVs, with the speed 9Gbps.This relevant modes of operation are e.g.OFB, CTR, and GCM.In this case, SNOW-V is slower than AES-256 when the plaintext size is less than approximately 64 bytes.

Future AVX512
AVX512 is a new set of intrinsics utilizing wider 512-bit registers, and a subset of the AVX512 instructions is currently only available on high-end Intel CPUs.It is expected to be supported by consumer-grade CPUs in the near future.In this new set of intrinsics, there is an instruction to perform 4 AES encryption rounds in parallel _mm512_aesenc_epi128(), which would speed up AES by approximately x4 times.
SNOW-V will benefit from AVX512 as well.In the FSM, where we have to apply two AES encryption rounds, double XOR, and double ADD4x32 (today all done over 128-bit registers), we can in the future use wider registers, and the number of instructions could approximately be halved.
For the LFSRs, the new intrinsics will shrink the number of instructions as well.For example, AVX512 has the function _mm512_ternarylogic_epi32() that implements any user-defined 3input Boolean function.Hence, an expression like XOR(XOR(a, b), c) can be substituted with a single _mm512_ternarylogic_epi32(a, b, c, 0x96).
Both FSM and LFSR would utilize only half of a 512-bit registers while the number of instructions is reduced.Note that the second half of the registers can be used to perform another SNOW-V instance in parallel, with its own key and IV.
Thus, as a rough estimate, the speed of SNOW-V could be increased by x2-4 times.

AEAD mode of operation
The GMAC integrity and authentication algorithm specified in [Dwo07] can easily be adopted to work with SNOW-V to define an AEAD mode of operation.We will use notations from [Dwo07] in the following.In GCM, an unspecified block cipher is used in counter mode to encrypt the plaintext.Additionally, the block cipher is used to produce the final authentication tag T , and to derive the key H used in the function GHASH H .When using SNOW-V together with the GHASH H algorithm, the key H is the very first keystream output z (0) .Then we continue to encrypt the n plaintext blocks using keystream output z (1) , . . ., z (n) , feeding the ciphertext blocks into GHASH H . Finally, we use keystream output z (n+1) as the final masking for the tag, similarly to the encrypted value of J 0 in [Dwo07].
SNOW-V works as described in section 2 with a single change.During initialization of the LFSRs, we set the lower part of the LFSR-B to the following hex values: (b 7 , b 6 , . . ., b 0 ) = (6D6F, 6854, 676E, 694A, 2064, 6B45, 7865, 6C41). ( The hex values are the UTF8 encoding of the names of the authors.An overview of how SNOW-V is used together with the GHASH H algorithm is shown in Figure 8.The padding of the Additional Authenticated Data (AAD) and how to concatenate the length of the AAD and the length of the ciphertext C and all other restrictions on plaintext length and change of IV from [Dwo07] remain.We have only defined a new way to derive the counter mode keystream, and the additional key and xor-value needed in the GCM algorithm.

Conclusions
A new 128-bit stream cipher called SNOW-V is presented.It follows the design principles of the previous ciphers in the SNOW family, but leverages the AES round function instruction support found in many modern CPUs.Both hardware and software implementation aspects are discussed and especially a very compact 64-bit implementation using a single AES encryption round is given as an implementation alternative.Theoretical arguments are presented that implies a very high speed, reaching above 700 Gbps, for a full 128-bit implementation.In single core implementations in software, SNOW-V outperforms AES by a factor of approximately 6 for plaintext lengths above 2kB.Basic cryptanalysis of the new design is presented and SNOW-V is argued to be resistant against these attacks.Finally, an AEAD mode of operation based on the well known GCM scheme is given.Test vectors and reference implementations are given in Appendices.
A Remarks about the maximum period of the LFSR structure We can denote the LFSRs' state at time t ≥ 0 as is determined by a shift, that is a i+1 for i = 0, 1, ...14, and the corresponding binary state transition submatrix for such update is identity matrix M I with size 16 × 16.As for a 15 , b 15 , we can rewrite them in the polynomial form.Suppose the bases for finite field A and B are respectively (1, α, ..., α 15 ),(1, β, ..., β 15 ), then every state element can be expressed as a polynomial corresponding to the two bases.

B Test Vectors
This section presents test vectors for SNOW-V with three different keys and IVs.The vectors are written with the least significant byte of the 128-bit word appearing to the left in the row.
For the keys, the lower 128-bit part is written on the first row, followed by the high part on the second row.

Fig. 3 :
Fig. 3: Internal functions of the AES encryption round function.

Fig. 4 :
Fig.4: Mapping between a 128-bit register value and the state array of the AES round function.

Fig. 5 :
Fig. 5: The maximum number of initialization rounds failing the MDM test under different bit set sizes.
Fig. 6: Splitting of FSM into two steps in order to utilize only one AES core.

Fig. 7 :
Fig. 7: Hardware architecture of 64-bit SNOW-V with a single AES core.

Table 2 :
Bit dependencies due to multiplications for a 16 and b 16 .

Table 3 :
Theoretical comparison of four SNOW-V versions vs AES-256 in hardware.

Table 4 :
Performance comparison of SNOW-V and AES-256 both with AVX2 Fig.8: How SNOW-V is used together with GHASH H to enable AEAD.
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff iv= ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Initialization z= ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff