Rocca: An Efficient AES-based Encryption Scheme for Beyond 5G ⋆

. In this paper, we present an AES-based authenticated-encryption with associated-data scheme called Rocca , with the purpose to reach the requirements on the speed and security in 6G systems. To achieve ultra-fast software implementations, the basic design strategy is to take full advantage of the AES-NI and SIMD instructions as that of the AEGIS family and Tiaoxin-346 . Although Jean and Nikolić have generalized the way to construct efficient round functions using only one round of AES ( aesenc ) and 128-bit XOR operation and have found several efficient candidates, there still seems to exist potential to further improve it regarding speed and state size. In order to minimize the critical path of one round, we remove the case of applying both aesenc and XOR in a cascade way for one round. By introducing a cost-free block permutation in the round function, we are able to search for candidates in a larger space without sacrificing the performance. Consequently, we obtain more efficient constructions with a smaller state size than candidates by Jean and Nikolić. Based on the newly-discovered round function, we carefully design the corresponding AEAD scheme with 256-bit security by taking several reported attacks on the AEGIS family and Tiaxion-346 into account. Our AEAD scheme can reach 150 Gbps which is almost 5 times faster than the AEAD scheme of SNOW-V . Rocca is also much faster than other efficient schemes with 256-bit key length, e.g. AEGIS-256 and AES-256-GCM. As far as we know, Rocca is the first dedicated cryptographic algorithm targeting 6G systems, i.e., 256-bit key length and the speed of more than 100 Gbps.


Background
The fifth-generation mobile communication systems (5G) have been launched in several countries for commercial services since 2020. Besides, researches for beyond-5G or 6G have been already started in some research institutes. As the first white paper of 6G, [LaL19] was published by the 6Genesisi project in 2019, which is mainly organized by the University of Oulu in Finland. In the white paper, several requirements for 6G systems are raised. For the data transmission speed, it says that 6G achieves more than 100 Gbps, which is more than 10 times faster than that of 5G.
For the 4G system, as underlying cryptographic algorithms to ensure confidentiality and integrity, SNOW 3G [SAG06], AES [Nat01], and ZUC-128 [SAG11] are employed, which are specified as 128-EEA1 (EIA1), 128-EEA2 (EIA2), 128-EEA3 (EIA3), respectively, and these algorithms are also selected cryptographic algorithms for the 5G system as 128-NEA1 (NIA1), 128-NEA2 (NIA2), 128-NEA3 (NIA3). However, for the 5G system, the 3GPP standardization organization requires to increase the security level to 256-bit key lengths. In 2018, ZUC-256 [The18] was proposed as the 256-bit key version of ZUC-128. ZUC-256 was revised only in the initialization phase and in the MAC generation phase from ZUC-128. By this revise, ZUC-256 improves the security level against the keyrecovery attack to the 256-bit security from the 128-bit security. On the other hand, the performance of the encryption/decryption speed is not quite improved because the key-stream generation phase is the same as ZUC-128, and a structural weakness was found [YJM20]. In FSE 2020, Ekdahl et al. proposed SNOW-V that is the 256-bit key version of SNOW 3G, and they showed that SNOW-V achieves more than 38 Gbps at an AEAD (Authenticated Encryption with Associated Data) mode on OpenSSL [EJMY19]. The performances of SNOW-V are sufficient for them to be used in the 5G system.
However, when taking requirements in 6G systems into account, we have to tackle some challenges. The biggest one is the encryption/decryption speed. For 6G systems, as the data transmission speed is expected to reach more than 100 Gbps, we have to design a cryptographic algorithm with the encryption/decryption speed of more than 100 Gbps, which is at least three times faster than SNOW-V. Besides, achieving 256-bit security against key-recovery attacks is essential as in 5G systems [3GP18]. In addition, due to the increase of data transmissions in 6G systems, it is necessary to ensure at least 128-bit security against distinguishing attacks while SNOW-V only claims 64-bit security against distinguishing attacks. Therefore, there is no doubt that a new cryptographic algorithm is needed in 6G systems.
For symmetric-key primitives targeting high-performance applications, there are several interesting cryptographic algorithms. The most tempting ones are those employing AES-NI [Gue10,Corb], which is a new AES instruction set equipped on many modern CPUs from Intel and AMD. Some SoCs for mobile devices are also equipped with an instruction set for AES [arm21], and more and more SoCs will support the instruction by the time 6G system is realized. Hence employing AES-NI seems reasonable in designing cryptographic algorithms for 6G systems. The AEGIS family and Tiaoxin-346 belongs to such a category, which are two submissions to the CAESAR competition [cae18] and AEGIS-128 has been selected in the final portfolio for high-performance applications. The round functions of the AEGIS family and Tiaoxin-346 are quite similar. Specifically, they are only based on the usage of one AES round and the 128-bit XOR operation, both of which have been realized with one instruction on SIMD (Single Instruction, Multiple Data) instructions. As a result, both the AEGIS family and Tiaoxin-346 are competitive in terms of encryption/decryption speed in a pure software environment, if compared with many primitives.
Jean and Nikolić generalized the method to design efficient round functions as that used in AEGIS and Tiaoxin-346 in [JN16]. After a thorough search, they discovered round functions that can achieve a faster speed than any of the round functions adopted in the AEGIS family and Tiaoxin-346 and provide the 128-bit security against forgery attacks. However, they did not propose a concrete AEAD scheme [JN16].
Obviously, AEGIS-128, AEGIS-128L and Tiaoxin-346 do not meet the security requirement of the 256-bit key length in 6G systems. In addition, according to our experiments, AEGIS-256 does not reach more than 100 Gbps (See Sect. 5). However, those researches leave us the potential of designing the faster cryptographic algorithm based on AES round functions for 6G.

Our Design
In this paper, we present an AES-based encryption scheme with a 256-bit key and 128-bit tag called Rocca, which provides both a raw encryption scheme and an AEAD scheme with a 128-bit tag. The goal of Rocca is to meet the requirement in 6G systems in terms of both performance and security. For performance, Rocca achieves an encryption/decryption speed of more than 100 Gbps in both raw encryption scheme and AEAD scheme. For security, Rocca can provide 256-bit and 128-bit security against key-recovery attacks and forgery attacks, respectively.
Optimized AES-NI-Friendly Round Function To achieve such a dramatically fast encryption/decryption speed, Rocca is designed for a pure software environment that can fully support both the AES-NI and SIMD instructions. The design of the round function of Rocca is inspired by the work of Jean and Nikolić [JN16]. To further increase its speed and reduce the state size, we explore a new class of AES-based structures. Specifically, we take the following different approaches.
-To minimize the critical path of the round function, we focus on the structure where each 128-bit block of the internal state is updated by either one AES round or XOR while Jean and Nikolić consider the case of applying both aesenc and XOR in a cascade way for one round, and most efficient structures in [JN16] are included in this class.
-We introduce a permutation between the 128-bit state words of the internal state in order to increase the number of possible candidates while keeping efficiency as executing such a permutation is a cost-free operation in the target software, which was not taken into account in [JN16].
We search for round functions that can ensure 128-bit security against forgery attacks in a class of our general constructions as with [JN16]. Consequently, we succeed in discovering more efficient constructions with a smaller state size than those in [JN16]. The internal state of Rocca consists of eight 128-bit words and its round function is composed of 4 aesencs and 4 128-bit XOR operations, which is significantly faster than those of the AEGIS family, Tiaxion-346 and Jean and Nikolić's structure [JN16].
Encryption and Authentication Scheme. To resist against the statistical attack in [Min14], generating each 128-bit ciphertext block will additionally require one AES round, while it is generated with simple quadratic boolean functions in the AEGIS family and Tiaxion-346. However, such a way will have few overhead by AES-NI (See Sect. 3). Moreover, a study on the initialization phases for both reduced AEGIS-128 and Tiaoxin-346 has been reported recently [LIMS21]. To further increase the resistance against the reported attacks, how to place the nonce and the key at the initial state is carefully chosen in our scheme.
Performance The encryption/decryption speed of Rocca is dramatically improved compared with other AES-based encryption schemes. Rocca is more than three and four times faster than SNOW-V and SNOW-V-GCM, respectively, i.e. the speed reaches 215 and 178 Gbps, respectively. Compared to other schemes with 256-bit key, Rocca is more than five times faster than AEGIS-256 and more than three times faster than AES-256-GCM in our evaluations (See Sect. 5 and Appendix. A). Moreover, Rocca is also faster than AEGIS-128, AEGIS-128L, and Tiaoxin-346 even though Rocca provides a higher security level. To the best of our knowledge, Rocca is the first dedicated cryptographic algorithm targeting 6G systems and we hope it can inspire future designs.

Version History
The (4) Modified typo in Algorithm 1 regarding the decryption.

Organization
This paper is organized as follows. We first present the specification of Rocca in Sect. 2. Then, we describe the design rationale, such as the general construction based on AES-NI, criteria for performance and security, and how to find efficient round functions in Sect. 3. In Sect. 4, we provide the details of security evaluations against possible attacks on Rocca. Sect. 5 shows our software implementation results. Finally, we conclude this paper in Sect. 6.

Preliminaries
In this section, the notations and the specification of our designs will be described.

Notations
The following notations will be used in the paper. Throughout this paper, a block means a 16-byte value. For the constants Z 0 and Z 1 , we utilize the same ones as Tiaoxin-346 [Nik14].
where MixColumns, ShiftRows and SubBytes are the same operations as defined in AES.

A(X):
The AES round function without the constant addition operation, as defined below: 6. |X|: The length of X in bits. 7. 0 l : A zero string of length l bits. 8. X||Y : The concatenation of X and Y . 9. R(S, X 0 , X 1 ): The round function used to update the state S.

The Round Update Function
The input of the round function R(S, X 0 , X 1 ) of Rocca consists of the state S and two blocks (X 0 , X 1 ). If denoting the output by S new , S new ← R(S, X 0 , X 1 ) can be defined as follows: The corresponding illustration can be referred to Figure 1.

Specification of Rocca
Rocca is an authenticated-encryption with associated-data scheme composed of four phases: initialization, processing the associated data, encryption and finalization. The input consists of a 256-bit key K 0 ||K 1 ∈ F 128 2 × F 128 2 , a 128bit nonce N , the associated data AD and the message M . The output is the corresponding ciphertext C and a 128-bit tag T . Define X = X||0 l where l is the minimal non-negative integer such that |X| is a multiple of 256. In addition, write X as X = X 0 ||X 1 || . . . ||X |X| 256 −1 with |X i | = 256. Further, X i is written as Initialization. First, (N, K 0 , K 1 ) is loaded into the state S in the following way: Here, two 128-bit constants Z 0 and Z 1 are encoded as 16-byte little endian words and loaded into S[2] and S[3] respectively. Then, 20 iterations of the round function R(S, Z 0 , Z 1 ) is applied to the state S. After 20 iterations of the round function, two 128-bit keys are XORed with the state S in the following way; Processing the associated data. If AD is empty, this phase will be skipped. Otherwise, AD is padded to AD and the state is updated as follows: Encryption. The encryption phase is similar to the phase to process the associated data. If M is empty, the encryption phase will be skipped. Otherwise, M is first padded to M and then M will be absorbed with the round function. During this procedure, the ciphertext C is generated. If the last block of M is incomplete and its length is b bits, i.e. 0 < b < 256, the last block of C will be truncated to the first b bits. A detailed description is shown below: Finalization. After the above three phases, the state S will again pass through 20 iterations of the round function R(S, |AD|, |M |) and then the tag is computed in the following way: The length of associated data and message is encoded as 16-byte little endian word and stored into |AD| and |M |, respectively. A formal description of Rocca can be seen in Algorithm 1 and the corresponding illustration is shown in Figure 2. T ← 0 51: for i = 0 to 7 do 52:

Fig. 2: The procedure of Rocca
A raw encryption scheme. If the phases of processing the associated data and finalization are removed, a raw encryption scheme is obtained.
Security claims. Rocca provides 256-bit security against key-recovery and 128bit security against distinguishing and forgery attacks in the nonce-respecting setting 5 . We do not claim its security in the related-key and known-key settings.
The message length for a fixed key is limited to at most 2 128 and we also limit the number of different messages that are produced for a fixed key to be at most 2 128 . The length of associated data of a fixed key is up to 2 64 .

General Construction
SIMD instruction. The prime design goal of Rocca is to meet the requirements of processing/transmission speed for 6G applications, namely more than 100 Gbps [LaL19]. In order to realize fast encryption/decryption speed (hereafter, we simply call "speed") on a pure software environment, we take full advantage of the SIMD instructions and the AES-NI, both of which are equipped on most of modern CPUs. The SIMD instructions contains some fundamental instructions such as XOR and AND, and can execute them by 32/64/128-bit units as one instruction, where the AES-NI is a special set of the SIMD instructions, which is first rolled out by Intel [Cora] and available on modern processors. The AES-NI can execute AES about 10 times faster than non-AES-NI in parallelizable modes such as CTR mode. In this paper, we utilize on aesenc, which is one of instruction 5 We updated the claimed security of distinguishing attacks from the ToSC version [SLN + 21] for the following reasons. The most well-known and popular distinguishing attack on the keystream seems to be the linear attack. Such a distinguishing attack often requires a large number of plaintexts. If the data complexity exceeds the time complexity to find the key with Grover's algorithm, we view such an attack as invalid in the quantum setting. Therefore, regarding the distinguishing attack, we only claim 128-bit security in the quantum setting and a meaningful distinguishing attack in the classical setting should have data complexity below 2 128 .
sets of AES-NI, and performs one regular (not the last) round of AES on an input state S with a subkey K: The execution speed of these instructions can be evaluated by latency and throughput, where latency is the number of clock cycles required to execute a single instruction and throughput is the required number of clock cycles before the same instruction to be executed. It is important when considering the parallel execution. Table 1 shows latency and throughput of aesenc [RTL] in each architecture. Among existing architectures, we focus the latest architecture Intel Ice-Lake series that has the fastest AES-NI whose latency and throughput of aesenc are 3 and 0.5, respectively. Figure 3 illustrates an example of the process in the parallel execution of aesenc for Intel Ice-lake whose latency and throughput are 3 and 0.5 6 , respectively.
Employing one AES round as an underlying component for future designs has a great merit for performance compared to employing other cryptographic primitives. Many software and libraries support AES-NI natively, e.g OpenSSL. Thus, it seems to be very reasonable that devices connected to 6G services will still support such instructions. SNOW-V also takes advantage of AES-NI for the same reason. cycle/Byte Specifically, we focus on AEGIS family [WP13] and Tiaoxin-346 [Nik14], which are permutation-based authenticated encryption schemes using AES round functions and submitted to CAESAR competition [cae18]. These allow a full parallelization and can achieve the outstanding speed compared to AES-CTR.
However, as it has been pointed out that there exists a linear bias in the ciphertext blocks for AEGIS-256 [Min14], it seems insecure to adopt the similar quadratic boolean function to generate the ciphertexts, especially for the purpose to reach 256-bit security. This fact motivates us to design different ways to generate the ciphertext blocks and finally involving 1 AES round function into generating each ciphertext block is chosen. Such a way is efficient due to the parallel calls to AES-NI. Moreover, a study on the initialization phases for both reduced AEGIS-128 and Tiaoxin-346 has been reported recently [LIMS21]. To further increase the resistance against the reported attacks, how to place the nonce and the key at the initial state is carefully chosen in our scheme, which is little discussed in AEGIS and Tiaoxin-346.

Efficient AES-Based Round Function.
Round functions of AEGIS family [WP13] and Tiaoxin-346 [Nik14] consist of the 128-bit XOR operation and one AES round that is executed by the processor instruction aesenc. Jean and Nikolić have generalized the way to construct efficient round functions using only the one AES round (aesenc) and 128-bit XOR and have found several more efficient candidates [JN16]. Figure 4 shows the general construction of the round function considered in [JN16].
To push the limitation further of efficiency of their structures, we explore a new class of AES-based structures shown in Fig 5.   to minimize the critical path of one round. Specifically, we only consider the case of applying only either aesenc or 128-bit XOR to each block in one round, where aesenc takes a state block or message block as input of AddRoundKey and 128-bit XOR takes state block or message block as inputs, respectively as shown in Figure 5. Moreover, we apply a block permutation to state blocks, which was not considered by Jean and Nikolić (See Fig 4). This sufficiently increases the number of possible candidates. Indeed, as described in later section, it enables us to find more efficient constructions than Jean and Nikolić's results, which is not covered by their target classes. It should be emphasized that executing the block permutation in register size is a cost-free operation, that is, the permutation only changes the order of blocks. More strictly, a permutation needs some temporary registers. However, these registers almost do not affect the speed if the total number of registers used in process of the scheme is lower than 16, which is the total number of xmm-registers equipped in almost all modern CPUs. Hence, applying a block permutation does not affect the speed of the round function. For a block that will be inputted into aesenc or XOR, we use one-block right rotation as in [JN16].

Criteria for Performance and Security
For designing efficient round functions, we need to choose several parameters such as the number of aesencs, the number of inserted message blocks, and a block permutation for our structure in Fig. 5. We clarify requirements of performance and security for target applications to choose these parameters.
Requirements for Performance. To theoretically estimate speed, we utilize a metric called rate, which is proposed by Jean and Nikolić [JN16].  For our general construction of Fig 5, the rate p is estimated as a ratio of (# of aesencs)/( # of the inserted 128-bit messages) in one round. Since a smaller rate leads to more efficient design [JN16], we should design the round function that have as small rate as possible. The rate is the most important parameter for speed.
The number of aesenc in one round is also important factor to maximize the efficiency. Jean and Nikolić claim that the number of aesenc in one round should be close to (latency)/(throughput) ratio [JN16] for the efficient design, e.g. if the latency and throughput of aesenc are respectively 3 and 0.5, the number of aesenc should be 6 in one round. The reason is when the number of aesencs is less than a (latency)/(throughput) ratio, there are empty cycles in process of aesenc. On the other hand, if the number of aesencs is the same as (latency)/(throughput) ratio, there is no empty cycles as shown in Figure 3. Since our target architecture is Ice-lake, the number of aesenc in a round should be 6.
Another important factor related to speed is the number of blocks of round functions, namely the state size. Smaller state size significantly improves the efficiency because it can reduce registers used for encryption and makes a whole process of encryption easier. We experimentally confirmed that reducing the number of blocks leads to increasing speed when the rate is the same. Table 2 shows our experimental result that compares three types of round functions of the rate 2 with the number of blocks of 8, 9, and 10, each of which is measured on Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz with 16 GB RAMs. Details of these round functions are given in Appendix B. Besides, a smaller state size is a preferable feature to be deployed in wider classes of devices with keeping the efficiency. It is because this, that some CPUs, such as ones from AMD, do not support the large size register like AVX512, and the process requiring the use of many registers tends to become more complicated on these CPUs. Since the number of blocks of SNOW-V, which is our reference point, is 7, the state size should be competitive. Requirements for Security. Since evaluating the resistance to all possible attacks for all possible candidates is practically infeasible, we focus on the security against the forgery attack by the internal collision as a criteria of security when finding candidates, as with [JN16]. Especially, we impose the 128-bit security against the forgery attack on our design, i.e. our security requirement is that there are no internal collisions with a probability more than 2 −128 . Through this paper, "forgery attacks" is meant to be a universal forgery in the nonce-respecting setting.
To evaluate the probability of the internal collision, we search the lower bound for the number of active S-boxes by a Mixed Integer Linear Programming (MILP) solver [MWGP11]. Since the maximum probability of an S-box is 2 −6 , it is sufficient to guarantee the security against internal collisions if there are 22 active S-boxes, as it gives 2 (−6×22) < 2 −128 as an estimate of differential probability. For the security against other possible attacks, we evaluate after designing a whole design, and it will be described in Sect. 4.
Summary of Our Criteria. Requirements for AES-based round function are as follows.
For speed. Requirement 1. The lowest rate round function as possible that leads to faster speed. Requirement 2. The number of aesencs in one round is close to 6. Requirement 3. A round function with a smaller number of blocks (around 7). For security.
Requirement 4. 128-bit security to the forgery attack by internal collision, i.e. the lower bound of active S-boxes is 22.
For comparison, Table 3 shows parameters of the round function in the AEGIS [WP13] family, Tiaoxin-346 [Nik14] and structure by Jean and Nikolić [JN16].

Finding Efficient Structures
We choose several parameters such as the number of aesencs, the number of inserted message blocks, and a block permutation to meet requirements given Our Approach. According to Table 3, the most efficient design is Jean and Nikolić's structure whose rate is 2. However, their state size is quite large for our requirement. In our experiments, the round functions with a smaller rate require a larger number of blocks to meet the security requirement. Indeed, we cannot find any structure of rate 2 and less than 12 internal blocks by Jean and Nikolić's constructions (Fig.4) [JN16]. To address it, our approach is as follow.
-To expand possible candidates while keeping efficiency, we introduce a block permutation to state blocks in the round function, while Jean and Nikolić did not consider any permutation. It should be emphasized that executing the block permutation in register size is a cost-free operation. -To further improve the efficiency, we focus on the structure in which each block in one round is applied only either aesenc or XOR to minimized the critical path of the round function.
Search Targets. When the number of inserted message blocks is m, the number of aesencs in one round should be (6 − m) to satisfy requirement 2 as m aesenc is used for generating ciphertext blocks for our design to the resistance to the linear bias (details in Section 3.5). Considering requirement 1 (rate = 2), the only choice of m is 2, thus the number of aesencs is 4. Following requirement 3, we consider the case where # of blocks are from 6 to 8. Besides, we consider the case where rate = 1.5 that can not satisfy requirement 2, because the low rate round function might be possible to more efficient even if it does not meet requirement 2. Table 4 shows our candidates of the round function. We evaluate the lower bounds for the number of active S-boxes for Candidate-1, 2, 3, 4, 5, and 6 by a MILP solver. We can conduct exhaustive searches for Candidates-1, 2, 4, and 5 while exhaustive searches for Candidates-3 and 6 are infeasible due to too large candidates that reach 2 26.23 and 2 25.91 for Candidates-3 and 6, respectively. Thus, we randomly search 2 19.93 candidates for both Candidate-3 and 6.  Table 5 compares the speed of round functions of Rocca and other primitives, where speed is estimated as the average value of the round function executed 1000000 times with 64kB messages on Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz with 16 GB RAMs. Our round function is the fastest one and the number of blocks is smaller than ones whose rate is 2 or 3.
It should be mentioned that the comparison of the speed of round functions does not always reflect directly to the speed of the whole design. This is because that the overhead of the ciphertext generation depends on the structure of the round function, especially the empty cycle in process of XOR/aesenc.

Loading the Nonce and Key
It has been pointed by Liu et al. that there is one useless round in Tiaoxin-346 by expressing the internal states in terms of the nonce and the key at the initialization phase [LIMS21]. The main reason is that the nonce and the key are not well diffused, i.e. after a certain number of rounds, the internal state can be expressed in terms of A(N ) and the key. To avoid it in Rocca, we carefully investigate how to place the nonce and the key. In Rocca, the initial state is loaded as follows: After one-round update, the state (S[0], . . . , S[7]) becomes: It can be observed that N is xored with K 0 and K 1 , respectively. Moreover, N is involved in the expressions of each state block in a very different way, which can avoid the useless rounds and, at the same time, strengthen the resistance against the key-recovery attacks applied to round-reduced AEGIS-128 and Tiaoxin-346 as described in [LIMS21]. Further evidence can be seen from the expressions of the state blocks after 3 rounds of update, as shown below:

Generating the Ciphertext Blocks
In both AEGIS and Tiaoxin-346, each ciphertext block is computed based on a simple quadratic boolean function in terms of the several internal state blocks where the number of AND operations is 1. However, such a way to generate the output seems to be insecure against the statistical attack proposed by [Min14], especially for the scheme targeting 256-bit security. At the initial design phase, we tried many possible combinations to compute each ciphertext block with a similar quadratic boolean function. However, with the MILP-based method [ENP19] to automatically evaluate the security against this statistical attack, the lower bound for the time complexity is always below 2 128 , which is far smaller than 2 256 . Therefore, new strategies are essential for Rocca.
The basic idea is to utilize a complex nonlinear function and finally the AES round function is chosen as the only nonlinear function. Due to the parallel way to perform the AES round function, such a way is indeed rather efficient and can simultaneously strengthen the security of our scheme. To reduce the overall overheads, computing each ciphertext block only utilizes 1 aesenc.
The basic principle to choose the state blocks to compute the ciphertext is that the state blocks (S

], S[2], S[3], S[4]), which will be shifted to (S[4], S[5])
after some rounds. A detailed study of the security of our choice can be found in the following section.
We emphasize that the overhead of executing these two aesencs is few since we can assign them into empty cycles of aesenc in the round function.

Differential Attack
The differential attack is one of the possible attacks on the initialization phase of Rocca. Specifically, the differences in the nonce (and key) can propagate to the ciphertext. If there is a differential characteristic with a high probability, it can be exploited for the differential attack. Hence, we can compute the lower bound for the number of active S-boxes in the initialization phase to evaluate the resistance against the differential attack. To compute the lower bound, we utilize a MILP-aided method proposed by Mouha et al. [MWGP11] and focus on the byte-wise truncated differences. We evaluate it in both the single-key setting where differences can only be injected into the nonce and the related-key setting where differences can be injected into the key and nonce. Table 6 shows the lower bounds for the number of active S-boxes up to 14 rounds in the single-key setting and up to 11 rounds in the related-key setting in the initialization phase. Since the maximal differential probability of the S-box of AES is 2 −6 , it is sufficient to guarantee the security against differential attacks if there are 43 active S-boxes, as it gives 2 (−6×43) < 2 −256 as an estimate of the differential probability. As shown in Table 6, there are 54 active S-boxes over 6 rounds in the single-key setting and 44 active S-boxes over 7 rounds in the related-key setting in the initialization phase. It should be emphasized that we do not claim the security in the related-key setting, although we evaluated the number of active S-boxes in the related-key setting.
Since there is a large security margin, we expect that Rocca can resist against differential attacks in the initialization phase. Table 6: The lower bound for the number of active S-boxes in the initialization phase where AS sk and AS rk mean an active S-box in the single-key setting and in the related-key setting, respectively.

Forgery Attack
It has been shown in [Nik14] that the forgery attack is a main threat to the constructions like Tiaoxin-346 and AEGIS as only one-round update is used to absorb each block of associated data and message. Such a concern has been taken into account in our design phase, as reported in Sect. 3. Specifically, in the forgery attack, the aim is to find a differential trail where the attackers can arbitrarily choose differences at the associated data and expect that such a choice of difference can lead to a collision in the internal state after several number of rounds. The resistance against this attack vector can be efficiently evaluated with an automatic method [MWGP11]. As Rocca is based on the AES round function, it suffices to prove that the number of active S-boxes in such a trail is larger than 22 as the length of the tag is 128 bits. With the MILP-based method, it is found that the lower bound is 24. Consequently, Rocca can provide 128-bit security against the forgery attack.

Integral Attack
One of the most efficient attacks on round-reduced AES is integral attacks. Recently, Liu et al. presented some attacks [LIMS21] on round-reduced AEGIS-128 and Tiaoxin-346 based on the integral distinguisher on 4-round AES. To understand the security of our construction, it is necessary to evaluate the resistance against integral attacks. Similar to [LIMS21], the internal state will be first expressed in terms of the initial state and then we study the expressions.
For simplicity, denote the state after r iterations of the round function at the initialization phase by S r . In addition, when writing the expressions, we omit the constants and use A(X) to represent that X passes through one AES round, i.e. A(X) can represent A(X ⊕ ϵ) where ϵ is a 128-bit constant. In this way, the internal state S 4 can be expressed as follows: A(A(N )), (A(A(N ))) ⊕ N, As our construction can provide 256-bit security, it is necessary to evaluate the case when N traverses all the 2 128 possible values under the same 256-bit key. According to [LIMS21], some terms in the expressions can be eliminated by adding proper conditions and the expressions can be simplified. However, according to the expression of S 4 [3], when N takes all the possible values, it is impossible that S 4 [3] will also take all the 2 128 possible values. In other words, the multiset of S 4 [3] tends to be unstructured. Therefore, by considering the propagation of S 4 [3] and the way to compute the ciphertext, we believe that 20 rounds are sufficient to resist against integral attacks.
On the other hand, consider the expressions for S 6 , as shown below: As . As the internal state consists of 8 blocks and the output in each round only leaks 256-bit information, the attackers at least need to consider 4 consecutive rounds in order to recover the whole secret internal state.
Guess-and-determine attack. The guess-and-determine attack is a common tool to achieve state recovery. Consider four consecutive rounds at the encryption phase and denote the 4 internal states used to generate the ciphertexts by S t , S t+1 , S t+2 and S t+3 , respectively. In this case, the attackers can compute Assuming the message blocks are all zero, we thus have Therefore, the attackers at least need to consider the following 1024 nonlinear boolean equations in terms of 1024 boolean variables (S t [0], . . . , S t [7]) in order to recover the secret state: where α i ∈ F 128 2 (0 ≤ i ≤ 7) are known constants. It is obvious that the attackers should not completely guess 2 state blocks as the time complexity of guess will be 2 256 . A clever way is to guess a column and a diagonal of the state blocks, which fits well with the form of the outputs. Such a strategy will allow attackers to guess at most 8 columns and diagonals. However, only in the conditions imposed by (α 0 , α 1 , α 3 ), one AES round is involved, i.e. the clever way is only applicable to these conditions. For the remaining conditions, two AES rounds are involved, which implies that the attackers at least need to guess a complete 128-bit block due to the full diffusion. For such reasons, we believe the time complexity of the guess-and-determine attack cannot be lower than 2 256 .
Algebraic attack. It is well-known that the S-box of AES can be expressed as a set of quadratic boolean equations if the input zero is discarded. Therefore, the above equation system can be described as quadratic boolean equations by introducing extra intermediate variables to represent the outputs of the S-box for each AES round function. Notice that for different ciphertext blocks (α 0 , ..., α 7 ), the attackers have to introduce different variables due to the big difference between the equations. Although the system of equations is overdefined, the number of equations is only slightly larger than the number of variables and the number of variables is much larger than 256. As far as we know, such a system of equations can not be solved with time complexity 2 256 .

The Linear Bias
Exploiting the fact that the output (keystream) of AEGIS is quadratic in terms of several state blocks and only one-round update is used to process each message block, Minaud proposed a statistical attack [Min14] on the keystream of AEGIS-256. Such an attack was improved in [ENP19] with an automatic search method based on [SSS + 19]. Specifically, the authors first utilized a simple truncated model and evaluated the minimal number of active S-boxes. It is found that for AEGIS-128, AEGIS-128L and AEGIS-256, all the results obtained in the simple truncated model suggest they are insecure against such a statistical attack. However, when searching for compatible linear trails in the bit level, almost all of them are incompatible. In addition, the results obtained in the refined model is far larger than that obtained in the simple truncated model.
To evaluate the resistance of our construction against such a statistical attack, we also adopted the simple truncated model as in [ENP19]. According to our results, the best case is to consider 4 consecutive rounds and the minimal number of active S-boxes is 38, which suggests that the time complexity of the distinguishing attack is at least 2 228 . Achieving 42 active S-boxes is ambitious without affecting the performance and we believe 38 is enough to resist against such an attack considering the big gap between the truncated model and bitwise model as reported in [ENP19]. To further verify whether there is a compatible linear trail to the best solution obtained with the truncated model, we also implemented the bitwise model where there is no additional constraint on the input mask and output mask of the S-box except the trivial infeasible pairs caused by the zero input mask or zero output mask. When searching for a compatible linear trail based on the truncated pattern, it is soon shown to be infeasible. One main reason is that compared with the attack on AEGIS-256 requiring 2 consecutive rounds, this statistical attack on Rocca requires 4 consecutive rounds, which makes the contradictions in the solutions obtained with the simple truncated model occur more easily if verified with the bitwise model. Taking this fact into account, we further believe Rocca is secure against this attack vector.

The State-recovery Attack Using the Decryption Oracle
In a recent work [HII + 22], by using a trivial decryption oracle, it is possible to recover the full internal state after the initialization phase with time complexity 2 128 . Indeed, such a state-recovery attack has been observed by the designers of AEGIS-256 and it is inavoidable if the tag size is small. However, what we need to care is to prevent the further key-recovery attacks after the internal state is recovered in such a way. In AEGIS-256, this is ensured by using a keyed permutation for the initialization phase. In this revised version, we simply use a key feed-foward operation to prevent the further key-recobery attack because the attackers cannot invert the initialization phase without knowing the key even if the state after this phase is fully known.

Other Attacks
While there are many attack vectors for block ciphers, their application to Rocca is restrictive as the attackers can only know partial information of the internal state from the ciphertext blocks. In other words, reversing the round update function is impossible in Rocca without guessing many secret state blocks. For this reason, only the above potential attacks vectors are taken into account. In addition, due to the usage of the constant (Z 0 , Z 1 ) at the initialization phase, the attack based on the similarity in the four columns of the AES state is also excluded.

No Claims
We do not claim the security of our scheme in the nonce-misuse setting and it seems trivial to achieve the state recovery in this setting as the output is computed with only one-round update function at the encryption phase. In addition, we do not claim the security of our scheme in the related-key and known-key setting, which is far from meaningful in real-world applications. For the attacks on the initialization phase, we emphasize that the attackers can only derive information from the restricted outputs and cannot know the full secret internal state.

Software Implementation
According to [ITU17], target peak data rates for 5G communication are 10 Gbps for uplink and 20 Gbps for downlink. SNOW-V [EJMY19] is a new version of SNOW-family designed for 5G communication with 256-bit key support and achieves 58.25 Gbps on Intel(R) Core(TM) i7 8650U CPU @1.90GHz in encryption only mode. In the next generation (i.e. 6G), the target peak data rate is further increased to 100 Gbps to 1 Tbps [LaL19]. In order to realize this high peak data rate, a new encryption algorithm is required.
We evaluate the performance of Rocca and show that Rocca can achieve 160 Gbps when encrypting data of large size. Modern CPUs are equipped with a dedicated instructions set for AES called AES New Instructions (AES-NI). As Rocca has the AES round function as its component, we can optimize the implementation by utilizing AES-NI. Specifically, we use _mm_aesenc_si128() for AES's round function. For XORing two 128-bit values, we use _mm_xor_si128(). We also compare the performance with existing algorithms and demonstrate Rocca's advantage in terms of the performance. All evaluations were performed on a PC with Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz with 32GB RAM. For the fair comparison, we included Rocca as well as SNOW-V, Tiaoxin and AEGIS to Openssl (3.1.0-dev) and measured their performances. We used SNOW-V reference implementation with SIMD, which was given in [EJMY19]. For Tiaoxin-346 and AEGIS, we used implementations available at https:// github.com/floodyberry/supercop. The results are given in Table 7, and all performance results are given in Gbps. In TLS, data will be divided into chunks of 2 14 = 16384 bytes or less before it is encrypted, the values in Table 7 are close to what we expect in practice. As shown, Rocca is 4.16 times faster than SNOW-V, and 3.10 times faster than AES-256-CTR in processing 16384 bytes message. It also outperforms both 128-bit algorithms which we tested. In encryption only mode, the initialization is performed once and only the encryption is iterated. While in AEAD mode, the initialization, associated data addition, encryption, tag generation and finalization are iterated. Here, the size of associated data is fixed to 13 bytes. In case of Rocca, the round function is iterated 20 times in the initialization and finalization, respectively, which is equivalent to processing 1280 bytes of input. As a result, we expect 1280/16384 ≈ 8% overhead to the encryption mode for 16384 bytes input. Additional overhead will be incurred by calling functions for the initialization, tag generation and finalization. The performance results on other CPUs are given in Appendix A, and Rocca achieves the best performance in other CPUs as well. The performance can be further improved by using new instructions set and/or optimizing the implementation. The new instructions set AVX512 contains _mm512_aesenc_epi128(), which runs four 128-bit AES round functions in parallel. As Rocca uses four AES round functions in one state update, using _mm512_aesenc_epi128() instead of four _mm_aesenc_epi128()s can be improved by up-to 4 times.

Conclusions
To fulfill the basic requirements on the speed and security in 6G systems, i.e. 100 Gbps and 256-bit security, we are motivated to further study the generalized method to construct round functions based on the parallel calls to the AES round function, which was first studied by Jean and Nikolić in FSE 2016. As a result, an efficient AES-based AEAD scheme called Rocca is proposed, whose construction is only based on the AES round function and the 128-bit XOR operation supported by the SIMD instructions on model CPUs. In addition, we have performed a thorough study to understand the security of Rocca. According to the software implementation, Rocca can reach 150 Gbps in the AEAD mode, which is more than four times faster than SNOW-V designed for 5G systems. To the best of our knowledge, Rocca is the first dedicated scheme targeting 6G systems and it also shows the potential to reach the basic requirements in such systems.
As future work, a parallelizable mode of Rocca would be interesting and beneficial for both environments equipped with multiple cores and not supported AES-NI.

A Software Implementation Results on Other CPUs
We show software implementation results on other CPUs in Tables 8 to 10. The  evaluations were performed on Windows 10 Pro 21H1 for Table 8, Windows 10  Pro 21H2 for Table 9 and macOS Big Sur 11.4 for Tables 10. The difference of the environments affects the performance of some algorithms(e.g. AESGIS-256, AES-256-CTR and ChaCha20), Rocca shows competitive performance on all environments.   We also evaluate the performance on Android and iOS, implemented with ARM NEON intrinsics. The results are shown in the Tables 11 to 13. Note that the implementation of SNOW-V is not optimized and the shown results can be further improved by optimizing the implementation. In the original paper, Ekdahl et al. [EJMY19] showed SNOW-V can achieve 23.59 Gbps on Apple A11 SoC. Rocca achieves very competitive performance on recent mobile platforms. The performance is improved on the newer platforms (i.e. Snapdragon 888 and A15 Bionic) and further improvement is expected in the future.    Table 2 Fig 6, 7, and 8 show round functions whose # of blocks are 8, 9, and 10 in Table 2, respectively. The round function whose # of blocks is 8 is the same as the one of Rocca. Other 2 round functions whose # of blocks is 9 and 10 are the simple extended version of that.  Fig. 6: The round function whose # of blocks is 8.