Orthros: A Low-Latency PRF

We present Orthros, a 128-bit block pseudorandom function. It is designed with primary focus on latency of fully unrolled circuits. For this purpose, we adopt a parallel structure comprising two keyed permutations. The round function of each permutation is similar to Midori, a low-energy block cipher, however we thoroughly revise it to reduce latency, and introduce different rounds to significantly improve cryptographic strength in a small number of rounds. We provide a comprehensive, dedicated security analysis. For hardware implementation, Orthros achieves the lowest latency among the state-of-the-art low-latency primitives. For example, using the STM 90nm library, Orthros achieves a minimum latency of around 2.4 ns, while other constructions like PRINCE, Midori-128 and QARMA9-128-σ0 achieve 2.56 ns, 4.10 ns, 4.38 ns respectively.


Low-Latency Encryption
Lightweight cryptography is a subfield of symmetric-key cryptography to study cryptographic core functions usable under strong resource constraints. Hardware circuit size is a typical metric, and there are numerous proposals such as GIFT [BPP + 17], KATAN [DDK09], LED [GPPR11], Piccolo [SIH + 11], PRESENT [BKL + 07], and SIMON [BSS + 13] particularly perform well on this metric. Some other metrics exist, and among them, latency has been increasingly receiving attention. Latency affects the response time of encryption or authentication, and a small latency is highly desirable for applications that require instant response, such as encryption of memory bus, storage systems, automotive communication and industrial control network. Gaining throughput is generally possible with common signal processing techniques (pipelining and parallel processing), while achieving a low latency remains a challenge [KNR12].
To our knowledge, the first lightweight block cipher with explicit focus on latency is PRINCE proposed by Borghoff et al. [BCG + 12]. PRINCE is a 64-bit block cipher comprising multiple round functions to significantly reduce the number of rounds while keeping a moderate complexity of each round. Another proposal is QARMA proposed by Avanzi [Ava17], which is a family of low-latency tweakable block ciphers (TBCs) [LRW02]. It adopts the design strategy of PRINCE. Mantis [BJK + 16] is another family of low-latency 38 Orthros: A Low-Latency PRF TBCs. Midori is a family of block ciphers proposed by Banik et al. [BBI + 15]. It primary aims to reduce energy, however its latency is also quite small.
The current work on low-latency encryption focused on invertible primitives, i.e., (tweakable) block ciphers. We started with a question whether this is an exclusive approach -namely, whether we can do better by not requiring an invertible primitive. Motivated by this question, we initiated a study on designing low-latency (non-invertible) pseudorandom function (PRF) consisting of parallel keyed permutations. We study a sum of two block ciphers denoted as C = E K (M ) ⊕ E K (M ), where E, E : K × M → M are different n-bit block ciphers with a key space K and a message space M = {0, 1} n . Since E and E can be computed in parallel, the critical path length of a fully unrolled circuit is a maximum of them instead of the sum. The resulting function has n-bit block and is not invertible in general.
The sum of permutations is indeed not new and it has been adopted in the designs of RIPEMD-160 [DBP96] and Grøstl [GKM + 09]. In addition, the sum of permutations has also been extensively studied in the context of provable security (see Section 3.1). In particular, the result of Dai et al. [DHT17] suggests that it can ideally achieve n-bit PRF security, i.e., indistinguishable from a truly random function with O(2 n ) complexity. However this requires that E K and E K behave as computationally-secure block ciphers, more formally, (computationally-)independent pseudorandom permutations (PRPs). Instead of requiring this, we explore the setting that E and E are rather weak as a stand-alone block cipher, using a small number of very simple rounds. The point is that the outputs of E and E are never given in clear, hence we can hope that both can cover each weakness, and consequently the sum of them can tolerate dedicated attacks as a PRF.

Our Design
Based on the aforementioned considerations, we present Orthros, which is a 128-bit block pseudorandom function (PRF) with a 128-bit key. The overall structure of Orthros is a sum of two SPN-type keyed permutations called Branch1 and Branch2. The round functions of Orthros are based on Midori. It is already suitable to low-latency ciphers, however we performed a thorough study on it and showed that we can further improve latency by adopting new permutation layers and S-boxes.
In particular, we propose a hybrid use of bit and nibble permutations i.e., a bit permutation is used for some rounds and a nibble permutation for the rest, while Midori-128 [BBI + 15] uses a single linear layer including both of a bit permutation and a nibble permutation. Consequently, each branch of Orthros achieves the 2.5-round full diffusion and attains 64 active S-boxes over 10 rounds, while Midori-128 requires 3 rounds for the full diffusion and 13 rounds for 64 active S-boxes. In addition, the whole Orthros has more than 64 active S-boxes over only 5 rounds. Importantly, this change of linear layers does not require any additional hardware cost in an unrolled implementation.
For PRF, we do not need an involutory S-box unlike Midori and QARMA. This allows us to develop a new 4-bit S-box that offers about twice smaller delay than that of Midori-128 [BBI + 15] (see Table 7 of Page 14).
Since we do not rely on provable security of Sum of Permutations, we carried out an extensive security analysis on not only the components of Branch1 and Branch2 but also the whole Orthros.
Motivation for using 128-bit PRF. The lack of invertibility limits applications. For example the classical CBC and XTS (a storage encryption mode) [Dwo10] require the decryption routine of a block cipher. However, many popular modes, e.g., CTR, CMAC [Dwo05] and GCM [NIS07], do not require the decryption routine (this property is also called inverse-freeness). For these modes Orthros can be used as the cryptographic core. For examples of applications that require low latency, ARM's pointer authentication code 40 Orthros: A Low-Latency PRF

Specification
Orthros is a 128-bit pseudorandom function (PRF) with a 128-bit key, the overview of which is illustrated in Fig 1. On the whole, Orthros consists of two SPN-based 128-bit keyed permutations called Branch1 and Branch2, each composed of an S-layer, P-layer, the round-key addition and the constant addition. The S-layer is the parallel application of a 4-bit S-box and the P-layer is a linear transform (bit or nibble permutation, followed by a matrix multiplication). Moreover, two key scheduling functions called KSF1 and KSF2 based on two different bit permutations are exploited in Branch1 and Branch2, respectively.
In Orthros, a 128-bit plaintext M 2 is first copied to two 128-bit internal states X 1 and X 2 . Then X 1 and X 2 are respectively given to Branch1 and Branch2. The 128-bit ciphertext C is an XOR of the outputs of Branch1 and Branch2. More details will be given in the following.

Key Scheduling Function
Orthros adopts two bit-permutation-based key scheduling functions called KSF1 and KSF2, which are used to generate RK 1 r and RK 2 r (0 ≤ r ≤ 12) from the same 128-bit key K for Branch1 and Branch2, respectively. The whitening keys are RK 1 0 and RK 2 0 , which will be first XORed with X 1 and X 2 , respectively. RK 1 r and RK 2 r (1 ≤ r ≤ 12) are the round keys used in the r-th round of Branch1 and Branch2, respectively. The algorithms of KSF1 and KSF2 are shown in Fig 2. The bit-permutation P bk1 and P bk2 used in KSF1 and KSF2 are shown in Table 2.
In Fig 2, when K and RK j r are expressed in bit level, we have RK j r = (rk j r,0 , rk j r,1 · · · rk j r,127 ) and K = (k 0 , k 1 · · · k 127 ), where rk j r,i , k i ∈ F 2 (0 ≤ i ≤ 127, j ∈ {1, 2}). In addition, rk j r,0 and k 0 are the most significant bit of RK j r and K, respectively.

Round Function of Branch1 and Branch2
In this section, we present the details of Branch1 and Branch2, each of which is a 128-bit keyed permutation consisting of 12 rounds. The round keys (and whitening keys) (RK 1 r , RK 2 r ) are first generated by KSF1 and KSF2. After adding the whitening keys, the 128-bit input will be processed via Branch1 and Branch2 as follows.
For the first 4 rounds of Branch1 and Branch2, the round function R is described as where AddConstant and AddRoundKey represent the constant addition operation and the round key addition operation, respectively. For the following 7 rounds, the round function R is described as The sequence of operations in the last round is AddConstant • AddRoundKey • S-box. As described, the 128-bit outputs of Branch1 and Branch2 are XORed to generate the 128-bit ciphertext. Each component in the round function of Branch1 and Branch2 is described as follows.

S-box (S-box).
A 4-bit S-box will be applied to each nibbles in parallel for Branch1 and Branch2. The specification of the 4-bit S-box is displayed in Table 1.  and Branch2, P br1 and P br2 will be used respectively. From the 5th round to the 11th round, the nibble permutations P n1 and P n2 will be adopted in each branch respectively. The details of the permutation P brN and P nN , where N ∈ {1, 2}, are shown in Table 3 and Table 4, respectively.
Matrix Multiplication (matrixMul). Let M b be 4 × 4 matrix over nibbles defined as Four nibbles (a 0 , a 1 , a 2 , a 3 ) will be updated as follows: where (a 0 , a 1 , a 2 , a 3 ) T denotes a transposition. Specifically, AddRoundKey (AddRoundKey). In the r-th round, the internal states in Branch1 and Branch2 will be xored with the corresponding round key RK 1 r and RK 2 r , respectively. AddConstant (AddConstant). The internal state will be xored with the corresponding round constant in each branch. The specification of the round constants is displayed in Table 10 at Appendix A, where RC 1 r and RC 2 r represent the round constant used in the r-th round of Branch1 and Branch2, respectively.  X 1,2 0 X 1,2 1 X 1,2 2 X 1,2 RC 1,2 r Figure 4: The round function of Branch1 and Branch2 in the last 8 rounds. The nibble permutation and the matrix multiplication in the last round will be omitted.

Round Constants.
In a similar manner to PRINCE, the round constants are derived from the fraction part of π = 3.1415926.... Specifically, if the fraction part of π = 3.1415926... is expressed in binary string, it will be 00010100000101011001.... By shifting left the binary string by 3 bits, we obtain a binary string 1010 0000 1010 1100 1..., which corresponds to a nibble string 0xa, 0x0, 0xa, 0xc, and so on. Pseudocode. The algorithms of Branch1 and Branch2 are shown in Fig 5. In the pseudocode, when the 128-bit internal state is expressed in bits, we have X = (x 0 , x 1 , · · · x 127 ) and x 0 is the most significant bit of X. When the 128-bit internal state is expressed in nibbles, we have X = (X 0 , X 1 , . . . , X 31 ) and X 0 is the most significant nibble of X.
For the constant addition, the 128-bit round constant RC j r is expressed in nibbles as RC j r = (RC j r,0 , RC j r,1 , · · · , RC j r,31 ), for 1 ≤ r ≤ 12 and 1 ≤ j ≤ 2, where j denotes the index of Branch. Similarly for the round-key addition, the 128-bit round key RK j r is expressed in nibbles as RK j r = (RK j r,0 , RK j r,1 , . . . , RK j r,31 ), for 1 ≤ r ≤ 12 and 1 ≤ j ≤ 2.
Processing. The 128-bit ciphertext C is generated by XORing the output of Branch1 and Branch2. The processing algorithm of Orthros is shown in Fig 6. We provide test vectors of Orthros in Appendix H.
Claimed Security. Orthros claims single-key security, and does not claim any security in related-key and known/chosen-key settings.

General Construction
As described in introduction, the overall structure of Orthros is a sum of two keyed permutations. This structure and its variant has been extensively studied in the literature [Luc00, BI99, Pat08, DHT17, CLM19, Iwa06]. In particular, if two keyed permutations of Orthros (Branch1 and Branch2) were independent PRPs, we could claim n-bit provable security -more specifically the PRF advantage of (q/2 n ) 1.5 + 2 (q, t + O(q)) for n = 128 and q adaptive queries and time complexity t, where (q, t) denotes the PRP advantages of Branch1 and Branch2 with q queries and t time [DHT17]. However, this means that either Branch1 or Branch2 could be already usable as a low-latency PRP, implying that they should have a sufficient amount of security margins against known cryptanalysis. Since end for 10.
for i = 0 to 31 do 12.
for i = 0 to 127 do 16.
x P brN (i) ← 4. return C   Orthros: A Low-Latency PRF each Branch never gives its outputs in clear, we expect that a pair of weak permutations can suffice to have a desired, n-bit secure PRF. Generally this approach is described as "prove-then-prune" [HKR15]. This means that the provable security reduction does not hold anymore, and an implication of the security bound is more or less fuzzy and depends on the scheme. In our analysis we did not find a full-round attack against Branch1 and Branch2 as a PRP, however they have rather slim margins as a standalone block cipher. Therefore, we bolster our security claim with an extensive security analysis on the whole construction. We note that realizing Branch1 and Branch2 as Even-Mansour ciphers [EM93] can reduce the circuit size thanks to the absence of key schedule. However, it can provide 2n/3-bit PRF security at best, as proved by Chen et al. [CLM19]. Instead, we adopt bit permutation-based key scheduling functions for its hardware friendliness.
In order to investigate the initial security of the sum of permutations from the perspective of cryptanalysis, we compare in Appendix B the differential and linear behaviour for two toy ciphers adopting single branch and double branches, respectively. However, it should be emphasized that the security of Orthros never relies on our experiments on the toy ciphers but rather a comprehensive study of Orthros, as will be detailed in Section 4.

Linear Layer
As underlying matrices in the linear layer, we adopt 4 × 4 almost MDS binary matrix used in Midori [BBI + 15], whose delay is much smaller than MDS matrices. However, as discussed in [BBI + 15], its diffusion speed is slower and the lower bounds of the number of active S-boxes in each round is smaller than those of ciphers employing MDS matrices due to its lower branch number. To improve the diffusion speed and to increase active S-boxes in each round, we utilize bit and nibble permutations in a hybrid manner. We will see that this enables to guarantee security with a relatively small number of rounds.
Midori-128 [BBI + 15] also adopted a bit and a nibble permutations, however, our design is more efficient in term of the diffusion speed and the number of active S-boxes while keeping the same hardware cost in an unrolled implementation. Specifically, we adopt two different linear layers that consists of a bit permutation and a nibble permutation, i.e., a bit permutation is used for some rounds and a nibble permutation for the rest, while Midori-128 [BBI + 15] uses a single linear layer including both of a bit permutation and a nibble permutation. Importantly, this change of linear layers does not require any additional hardware cost in an unrolled implementation, as observed by [BCG + 12].

Bit Permutation vs Nibble Permutation.
To see the advantage of our hybrid use of bit and nibble permutations over the consistent use of a bit or a nibble permutation, we compare the diffusion effect and the lower bound of the number of active S-boxes of two different 128-bit block SPN structures, which we call SPN-B and SPN-N. Here, SPN-B (SPN-N) consistently uses a bit (nibble) permutation. Both use the same 4-bit S-box and the matrix as those used in a branch of Orthros. In addition, we used the full-diffusion property of S-box when evaluating the diffusion effect.

Diffusion.
To evaluate the minimum number of rounds that achieves the full diffusion for each SPN-B and SPN-N, we look into the propagation of one active input bit and count the upper bounds of the number of active bits through each operation. Here, we only need to consider the number of active bits after S-box and matrixMul as the remaining operations do not affect its value. The upper bounds of the number of active bits after each operation over some rounds are shown in Table 5.
According to Table 5, SPN-B and SPN-N require at least 2.5 rounds (2 rounds plus S-box) and 4 rounds, respectively, i.e., the optimal numbers of rounds for the full diffusion Subhadeep Banik and Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and Kosei Sakamoto 47 of SPN-B and SPN-N are estimated as 2.5 and 4 rounds, respectively. We conclude that there is a clear gap between bit permutations and nibble permutations in terms of the diffusion. Note that a class of bit permutations covers all nibble permutations. Thus, to achieve the 2.5-round full diffusion, we need to find a class of bit permutations that are not included in a class of nibble permutations.
Active S-box. Mixed Integer Linear Programming (MILP) is used to obtain the lower bound of the number of active S-boxes in each round. Unfortunately, it is computationally infeasible to estimate the lower bound of the number of active S-boxes of SPN-B, except for a very small number of rounds, due to the large search space of 128-bit bitwise differential and linear trails. This problem about the use of bit permutation was also pointed out by the designers of QARMA-128 [Ava17]. As a consequence, QARMA-128 does not claim a lower bound of active S-boxes. Indeed, it was infeasible to compute a lower bound of the number of active S-boxes for more than 5 rounds for SPN-B, even with a computer equipped with 48 cores and 256 GB RAM. Therefore, SPN-B can only guarantee a very small number of active S-boxes for a moderately large number of rounds (say 10), since we have to combine the bounds obtained for a small number of rounds, which generally yields a loose bound. For example, the lower bound for 8 rounds is obtained by the sum of the bounds for 4 rounds. In our experiment, the best lower bound for 4 rounds is 16, therefore the obtained bound for 8 rounds is only 32. On the other hand, for SPN-N, it is feasible to obtain a tight lower bound of the number of active S-boxes up to 8 rounds, using the aforementioned 48-core computer. The evaluation of one candidate requires about 2 days by the same computer. As a result, we found a class of nibble permutations that attain 60 active S-boxes over 8 rounds. We will explain the details in the following section.
Conclusion. Based on the above discussions, we conclude that a structure employing a single permutation cannot achieve a fast full diffusion and a guaranteed large number of active S-boxes simultaneously. Hence, we decided to use a bit permutation for the first few rounds, and use a nibble permutation for the rest of the rounds. Consequently, Orthros reaches the full diffusion after first 2.5 rounds and guarantees more than 64 active S-boxes over 10 rounds.

Finding Optimal Bit Permutations for Diffusion.
We take a two-step approach to find a class of bit permutations that achieves 2.5-round full diffusion for SPN-B, which turns out to be optimal. Let S denote the 128-bit internal 48 Orthros: A Low-Latency PRF state S of SPN-B. It is also viewed as a 4 × 8 two-dimensional nibble array: Note that S can also be viewed as a 16 × 8 two-dimensional binary array by seeing each column of S as a 16-bit sequence. Hereafter, bit-cell means a binary cell in the 16 × 8 array and nibble-cell means a nibble cell in the 4 × 8 array, that is, S i .
We first try to reduce the search space of target 128-bit permutations as it is computationally infeasible to test all possible 2 716.16 (= 128!) candidates. Specifically, we focus on a class satisfying Condition 1 to efficiently find the class of bit permutations having the 2.5 rounds full diffusion property.

mapped to the bit-cells in different columns after applying the bit permutation.
A detailed description of Condition 1 can be seen in Appendix C.1. Condition 1 can be justified as follows. If a bit permutation satisfies Condition 1, for the 4 output bits of each S-box, they will be mapped to 4 different groups of inputs to the binary matrix. Observe that in the MatrixMul operation, the binary matrix M b is independently applied to 8 different groups of inputs, each of which consists of 4 consecutive nibbles. This is expected to be a key feature for the fast diffusion since 1 input active bit is expanded to 4 active bits via the S-box, and the 4 active bits are subsequently expanded to 12 active bits after the bit permutation and MatrixMul operations. This matches the upper bound of the number of active bits after one-round permutation, as shown in Table 5.
In the class of the bit permutations satisfying Condition 1, we obtain the following sufficient condition for the 2.5-round full diffusion. It should be emphasized that for each active bit in the nibble-cell S i (0 ≤ i ≤ 31), after applying the S-box and the bit permutation satisfying Condition 1, there will exist 4 different groups of inputs to the binary matrix, each of which will contain exactly one active nibble. Therefore, after matrixMul there will be 12 active nibbles. We add the following condition on these 12 active nibbles.

Condition 2.
After applying the bit permutation, in each column of the 4 × 8 array, there exist at least 2 nibble-cells containing the bits coming from those in the active 12 nibbles.
The proof that Condition 2 is sufficient for the 2.5 round full diffusion in the class of the bit permutations satisfying Condition 1 is described in Appendix C.2.
Bit permutations of Table 3 used in Branch1 and Branch2 satisfy both Condition 1 and 2, respectively, i.e. attain 2.5-round full diffusion.

Finding Good Nibble Permutation for Active S-boxes.
As in the case of bit permutations, we take the following two-step approach to find nibble permutations that can activate as many S-boxes as possible over a certain number of rounds, which is expected to outperform that used in Midori-128. To explain our approach, we use the same 4 × 8 array to express the 128-bit state as above.
In the first step, we reduce the search space by focusing on the class of nibble permutations satisfying Condition 3 since it is computationally infeasible to estimate the lower bound of the number of active S-boxes for all possible 2 117.66 (= 32!) permutations. Condition 3 is chosen to achieve fast diffusion for differences and linear masks. Condition 3. For each column (S 4i , S 4i+1 , S 4i+2 , S 4i+3 ) in the 4 × 8 array, after applying the nibble permutation, they will be mapped to four nibble-cells in different columns.

Subhadeep Banik and Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and Kosei Sakamoto 49
In the second step, we first randomly choose 7,000 nibble permutations satisfying Condition 3. Then, we compute the lower bound of the number of active S-boxes after 5, 6, 7 and 8 rounds for these nibble permutations. To efficiently find good permutations among these nibble permutations, we first find a class of nibble permutations that have the best lower bound after 5 rounds. Then, we focus on this class and evaluate the lower bound of the number of active S-boxes after 6 rounds, which will be repeated until the 8th round.
As a result, we find three nibble permutations that can achieve 60 active S-boxes over 8 rounds. Table 6 shows the comparison of the lower bound of the number of active S-boxes for Midori-128 and our structure. Compared with Midori-128, our nibble permutations guarantee a much larger number of active S-boxes after 6 rounds, about a factor of 1.5.

Hybrid Use of Bit and Nibble Permutations.
Consequently, we obtain a set of bit and nibble permutations. For Orthros, we pick a bit and nibble permutations from the set, and use the bit permutation for the first 4 rounds, in order to achieve a fast full diffusion. Specifically, it achieves full diffusion in 2.5 rounds, while Midori-128 requires 3 rounds. For the rest of 8 rounds we used the nibble permutation to guarantee a large number of active S-boxes. Indeed, 10 rounds of Orthros starting from the 3rd round, i.e., the 3rd to the 12th round, achieve 64 active S-boxes. We note that Midori-128 needs 13 rounds to obtain 64 active S-boxes, thus the gain is 3 rounds.

S-box
We search a small-delay and lightweight 4-bit S-box which fulfills the following requirements: (1) the maximal probability of a differential is 2 −2 , (2) the maximal absolute bias of a linear approximation is 2 −2 and (3) full diffusion, i.e., any input bit difference diffuses to all output bits. We use a metric called depth [BBI + 15] to estimate the path delay of S-boxes.

Definition 1. (depth):
The depth is defined as the sum of the sequential path delays of basic operations, namely AND, OR, NAND, NOR and NOT.
Following the assumption of [BBI + 15], our search assumes that depths of XOR, AND/OR, NAND/NOR, and NOT are weighted as 2, 1.5, 1 and 0.5, respectively, and the required gates of NOT, NAND/NOR, AND/OR and XOR/XNOR are estimated as 0.5, 1, 1.5 and 2 Gate Equivalents (GEs), respectively. We search over the set of all 4-bit S-boxes, whose size is 2 44.3 , sort them in order of small depth, and check whether they satisfy our security requirements.
We remark that our construction does not require the involution property of S-box unlike Midori's Sb 1 . It allows us to expand the number of possible candidates from 2 25.5 (the number of all involution 4-bit S-boxes) to 2 44.3 . As a result, we found an S-box (see Table 1) whose depth and gate size are the lowest and the smallest in our search. Specifically, the depth is 3.5 and the area is 19 GE under the aforementioned assumption of [BBI + 15]. The S-box can be expressed as follows, where inputs and outputs are defined as {x 0 , x 1 , x 2 , x 3 } and {y 0 , y 1 , y 2 , y 3 }, and x 3 and y 3 are the most significant bits.

50
Orthros: A Low-Latency PRF Compared to the S-box of Midori-128, the depth and area can be reduced to 3.5 and 10.7 GE from 4 and 12 GE, respectively, when synthesized with the standard cell library of the STM 90nm CMOS logic process (as shown in Table 7) with area optimization. The table also shows detailed comparisons with S-boxes of the QARMA and PRINCE family when the circuit is optimized with respect to area as well as delay. The S-box of Orthros performs well when optimized across both metrics. Note that σ 0 does not have the full diffusion property.

Key Scheduling Function
To minimize the hardware cost, key scheduling functions of Orthros are realized by only bit permutations, whose hardware overhead, such as area and delay, is essentially free. We use a class of bit permutations that satisfy both Condition 1 and 2 in the key scheduling functions of Branch1 and Branch2 as shown in Table 2, although it cannot guarantee the full diffusion property as there is no S-box and Matrix in the key scheduling functions. One reason to introduce two different key scheduling functions for each branch is to increase the hardness of the key-recovery attack, as will be discussed in the next section.

Differential/Linear Attack
To evaluate the resistance against differential attacks and linear attacks, one way is to obtain the lower bound of the number of differentially and linearly active S-boxes in each round, which can be efficiently computed with a MILP-based method [MWGP11]. In the following, we will present lower bounds of the number of differentially and linearly active Sboxes for Branch1, Branch2 and the whole Orthros. Since the maximal differential and linear probability of the S-box is 2 −2 , it is sufficient to guarantee the security against differential attacks and linear attacks if there are 64 active S-boxes, as it gives 2 −2×64 = 2 −128 as an estimate of a differential probability. In our evaluation, we only consider the single-key setting.

Subhadeep Banik and Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and Kosei Sakamoto 51
As discussed in Section 3.2, we observed that it is computationally infeasible to obtain a lower bound for more than 5 rounds starting from the first round of Branch1, Branch2 and Orthros, even with a computer equipped with 48 cores and 256 GB RAM. The search space of 128-bit bitwise differential and linear trails for the first 4 rounds is huge. On the other hand, for the last 8 rounds of Branch1 and Branch2 starting from the 5th round, where the nibble permutation is adopted, we can obtain tight lower bounds of the number of active S-boxes for the nibble-wise differential and linear trails. In addition, we can obtain tight lower bounds of the number of active S-boxes of 5 rounds of Orthros starting from the 5th round, i.e., 5 to 9 rounds.
In our evaluation, each of Branch1, Branch2 and Orthros is first divided into two parts, i.e., the first 4 rounds and the remaining 8 rounds. We compute a lower bound of the number of active S-boxes for each part. The lower bound of Orthros is obtained by the sum of those of Branch1 and Branch2.
The corresponding results are displayed in Table 8. Table 8 shows that there are at least 68 active S-boxes in 5 rounds of Orthros starting from the 5th round, i.e., 5 to 9 rounds. In addition, the last 10 rounds of Branch1 and Branch2 including 2 bit-permutation rounds and 8 nibble-permutation rounds, i.e., 3 to 12 rounds, have at least 64(= 4 + 60) active S-boxes. Although we do not claim any security for Branch1 and Branch2 as a full-fledged block cipher, each has a sufficient number of active S-boxes in the full 12 rounds.  It should be emphasized that the lower bounds of Orthros in Table 8 are not tight i.e., actually the full rounds of Orthros includes more active S-boxes. This is because the number of active S-boxes of Orthros after 10 rounds is computed as the sum of those of Branch1 and Branch2. Besides, those of first 4 rounds and the last 8 rounds are independently obtained. Thus, we expect that the full-round Orthros can resist against the differential attack and the linear attack.

Impossible Differential Attack
The impossible differential attack can be estimated by the required number of rounds for the full diffusions. In the forward direction, both Branch1 and Branch2 require 2.5 rounds for the full diffusion, while it is 5 rounds in the backward direction. Consequently, we expect that there is no any probability-one impossible differential characteristic over 8 rounds of Branch1 and Branch2, respectively. Since Orthros take a sum of the outputs of Branch1 and Branch2, we believe that the number of rounds of an impossible differential characteristic for Orthros is much lower than that of Branch1 and Branch2.
To obtain actual impossible differential characteristics, we utilize the MILP-aided automatic searching tool proposed by Sasaki and Todo [ST17]. Taking DDT (differential distribution table) of the S-box into consideration, we searched bit-wise impossible differential characteristics of Orthros that have one active bit for both of a plaintext and a ciphertext. The details of modeling S-boxes is in Appendix D.2 As a result, we found 3/5/5-round impossible differential characteristics of Orthros, Branch1 and Branch2, respectively, as shown in Appendix D.3. Thus, we expect that the full-round Orthros is secure against impossible differential attacks.

Integral Attack
We present integral distinguishers on round-reduced Orthros. Since the division property [Tod15, TM16] was proposed, it has become an efficient tool to evaluate the resistance against integral attacks. Moreover, with the development of the MILP model for the bit-based division property [XZBL16], the attacker now only needs to focus on modeling the propagation of the division property.
The round function of Orthros consists of a nonlinear layer (S-box), a bit/nibble permutation layer, another linear layer (matrixMul) and the constant/key addition. To model the propagation of the division property through each component, we only need to consider the S-box and the binary matrix used in matrixMul. The bit/nibble permutation only has an influence on the coordinates of the variables used in the MILP model. The modelling of the S-box and binary matrix can be referred to Appendix D.4. Based on our model, the longest integral distinguisher can reach up to at most 7 rounds with 127 active bits in the input. For example, when the most significant bit of the plaintext is constant and the remaining 127 bits take all possible 2 127 values, for 7-round Orthros, all output bits are balanced.

Remark.
Although there are several 7-round integral distinguishers, it is difficult to mount a key-recovery attack on 8 rounds of Orthros. This is different from usual key-recovery attacks on block ciphers, where the attacker is able to add several rounds after the integral distinguisher and guess partial key bits to decrypt the ciphertext. It is quite costly to guess the key bits and reverse the ciphertext for Orthros since the final output is the sum of the outputs of two branches, i.e., the attacker further needs to guess the output of the other branch.

Invariant Subspace Attack
Beierle et al. [BCLR17] showed that an invariant subspace attack can be mounted on a block cipher if one finds a non-trivial invariant for the substitution layer whose linear space is invariant under the linear layer matrix L that it uses and contains all the differences between the round keys. For block ciphers without a dedicated key schedule function, say when the i-th round key rk i = k ⊕ rc i is simply the xor of the master key and the i-th round constant, the difference of all round keys is the difference of the round constants. If D denotes the set of all round constant differences, the authors of [BCLR17] computed W L (D), which denotes the smallest L-invariant subspace containing D. If the dimension of W L (D), dim(W L (D)), satisfies dim(W L (D)) ≥ n − 1, where n is the block size of the cipher in bits, then the authors showed that there is no non-trivial invariant of the substitution layer, provided that the S-box is well designed and does not have any linear component.
If this condition is not satisfied, one must further investigate the properties of the substitution layer. The authors then showed that, for every subspace Z of the 0-linear space of the invariant of the substitution layer S, the invariant, g, takes the same value on each coset of Z in F n 2 and also on each element of the set S(Z). To show that g is trivial, the authors computed the S-box layer at some points in Z and hoped that all cosets would be hit when evaluating S at a few points in Z. If g takes the same value on all the corresponding cosets, we would conclude that g must be a constant function and thus trivial. This can be done if we take Z = W L (D) and if dim(W L (D)) is close to n, since one would only need to hit 2 n−dim(W L (D)) cosets.
Since our construction uses a key schedule function for both branches, we can not directly construct the set D as the difference of round constants. From this fact, we use four different linear layers (two different in each branch) in our construction. However for any randomly chosen value of the secret key k one can construct the sets D 1 , D 2 , D 3 , D 4 , one each for the difference of the round keys used in each of the four linear layers L 1 , L 2 , L 3 , L 4 , Subhadeep Banik and Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and Kosei Sakamoto 53 and then try to compute dim(W Li (D i )) for each i. We found that the linear matrices composed by a bit permutation and a matrix multiplication (used in the first to 4th rounds of each branch, call them L 1 (left), L 3 (right)) have extremely high multiplicative ordersaround 2 48 to 2 60 -and thus it is not directly possible to find W L (D) for these matrices. Thus we limited ourselves to find W L (D) = c∈D < L i (c), i ≤ 10000 > for L 1 , L 2 and L 3 (where < · > denotes the subspace generated by the constituent vectors). We did an experiment with 1000 randomly chosen keys, and computed W Li (D i ), W Li (D i ) . The dimension of these spaces is almost always more than 127 for L 1 , L 3 and always more than 123 for L 2 , L 4 . Whenever the dimension of these spaces was less than 127, we tried to run Algorithm 1 of [BCLR17] to see if all cosets are hit when trying to evaluate the S-layer. For all choices of the random key, we find that all the cosets are always hit, and thus we conclude that it is highly unlikely that an invariant subspace attack can be mounted on our construction.

Other Attacks
We have evaluated the security against meet-in-the-middle [SA09], Yoyo [RBH17], exchange [BR19] and mixture-differential [Gra18] attacks on Orthros. The details are described in Appendix E. Moreover, we also reported the difficulty of key recovery from the statistical distinguishers for Orthros as well as a key-recovery attack framework for the initial design of Orthros in Appendix F.

Hardware Evaluation
Since our target construction is a low-latency PRF, the most useful hardware evaluation of the design is a fully unrolled circuit that optimizes the signal delay from the input to output ports. Such a circuit would be able to evaluate the PRF in one clock cycle itself, and naturally the clock frequency can be increased till the clock period is just above the total critical path of the circuit, affording a maximum throughput of blocksize critical path bits per second. To perform a fair evaluation we compare our construction with two other low-latency primitives that provide at least 128-bit block and 128-bit security. The first is Midori-128 and the second is QARMA 9 -128-σ 0 (in [Ava17], QARMA with 9 forward and backward rounds was recommended for applications targeting 192 bit security). QARMA 9 -128-σ 0 is a particular instantiation of QARMA with 9 forward and backward rounds and a low-delay S-box σ 0 . For an added comparison, we also include the 64-bit block cipher PRINCE in our results. On the other hand we also benchmark some permutation based constructions that can be used as a PRF. For example the Kangaroo12-XOF [BDP + 18] which is based on the 12-round Keccak-f[1600] permutation can be used as a PRF: one could absorb the key and plaintext in the permutation state and extract 128 bits from the resulting XOF. Let us call this construction Kangaroo12-PRF[1600]. Since any design based on a 1600-bit state would naturally be hardware-intensive we also consider a lightweight version of the above construction Kangaroo12-PRF[400] based on the 12-round Keccak-f[400] permutation. We can also use the Subterranean-Deck function [DMMR20] to extract a 128 bit MAC from a 128 bit key and message. We also benchmark this design which we call Subterranean-PRF. It is even more lightweight as it has only a 257-bit state.
We found that across all libraries, Orthros even outperforms PRINCE (see Tables 9, 18, 19, 17) when it comes to the absolute signal delay between the input/output ports. We remark that PRINCE is a 64-bit block cipher.
For a fair evaluation we adhered to the following design flow for all the ciphers: 1. The RTL source codes for the circuit of the ciphers were first written in the Verilog HDL, and a functional simulation was done using the Modelsim software to ensure correctness.
2. The RTL codes were synthesized by the Synopsys Design Vision circuit compiler, with the compiler command set to compile_ultra. No other optimizations are done at this stage. For this process we used the standard cell libraries of the following CMOS logic processes: a) STM 90nm, b) TSMC 90nm, c) Nangate 45nm and d) Nangate 15nm.
3. A timing simulation was done on the synthesized netlist to confirm the correctness of the design, by comparing the output of the timing simulation with known test vectors.
4. The switching activity of each gate of the circuit was collected while running postsynthesis simulation. The average power was obtained using Synopsys Power Compiler, using the back annotated switching activity.

5.
Step 2 outputs the critical path of the circuit. We repeat steps 2-4 (for each of the libraries) but this time by asking the circuit compiler to constrain the total signal delay between the input/output ports to some value less than the critical path computed in step 2.
6. We repeat the above processes, with progressively lower values of total signal delay, till such time as the circuit compiler is unable to construct a circuit with given delay. We stop the flow at this point. All results have been tabulated in Tables 9, 17, 18,  19 (for space constraints Tables 17, 18, 19 showing results for Nangate 15nm, TSMC 90nm and Nangate 45 nm processes are shifted to Appendix G).

Why area increases with decrease in latency:
A cell library typically has several drive strengths of cells that implement a given logic function. These drive strengths correspond to the capacitive load that a cell can drive without excessive delay and with acceptable signal characteristics. Thus when we force the circuit compiler to construct a circuit with lower delay, it starts introducing higher drive strength gates, that typically occupy more area while offering the same functionality. For example, in the TSMC 90nm library 2-input xor gates of drive strength 1, 2 ,4 occupy around 2.5, 3, 5 GE respectively. Thus it is natural for the area of a circuit to progressively increase as we constrain the circuit compiler to construct circuits of increasingly lower delay as shown in Tables 9, 17, 18, 19. For a better illustrative purposes, we provide Area vs Delay (see Fig. 7). The plots tell us that not only does Orthros perform around 40% better across all standard cell libraries when it comes to the absolute delay value, it also outperforms QARMA 9 -128-σ 0 and Midori-128 when it comes to achieving a) lower area figures and power consumption given a particular delay budget and b) lower or competitive delay given a particular area/power budget.

Conclusions
We have presented a new low-latency PRF of 128-bit block, dubbed Orthros. The design is essentially a sum of keyed permutations, which has been studied in the context of provable security. We found this design is suitable to a low-latency cryptographic primitive, which is intuitive, however, to our knowledge never seriously considered before. We made it real by thoroughly revising the current state-of-the-art low-latency, lightweight building blocks, together with an extensive security analysis and comprehensive hardware benchmarks.
For further directions, it would be interesting to extend our design, say having more branches (with even simpler round functions or fewer rounds), to even reduce latency. Software performance and related-key/side-channel security would also be interesting topics.  Table 10 shows the round constants of Branch1 and Branch2.  Figure 8: (Left) The toy cipher using a single branch and (Right) that using double branches, where WK and RK are the whitening key and round key, respectively.

B Toy Ciphers
The unique feature of our design is the use of two parallel branches (effectively block ciphers). In order to investigate the generic security of this design, we introduce two toy ciphers using single branch and double branches as shown in Fig. 8. We focus on the maximal differential probability (MDP) [BS91] and linear bias [Mat94] as they are two of the most fundamental security metrics. For the underlying branches, we consider both SPN and Feistel structures.
Experiments for a SPN-based toy cipher. First, we consider the case of SPN with 16-bit internal state and 16-bit round key. The basic design of this SPN-based toy cipher is similar to Midori. For our SPN-based toy cipher, the round function is composed of the following operations: S-box, Shuffle, Mix and AK. The state is organized as a 4 × 4 two-dimensional Boolean array A. The (4j + i)-th bit of the internal state is placed at  ≤ 3). For the AK operation, a random 16-bit round key will be XORed with the 16-bit internal state. Since the construction of the underlying block cipher used in Orthros is somewhat similar to that of Midori, we adopt the same Shuffle and Mix operations as in Midori for the toy cipher in order to construct a 16-bit toy cipher.
To compute MDP over a certain number of rounds for each construction, we first generate a random value for the round keys. Then, the whole block cipher can be viewed as a large 16-bit S-box. By exhausting all possible 2 16 × (2 16 − 1)/2 input pairs, we count the number of occurrences for each possible output difference and obtain the maximum frequency of the output difference, which we denote by CNT 0 . In this way, MDP is calculated as CNT 0 /2 16 . With this method, we carried out 100 experiments and compute MDP for each experiments, and take the maximum for all experiments. The results are displayed in Table 11. Table 11 shows that the maximal value becomes stable after about 4 rounds in the construction using double branches. For the construction using a single branch, it becomes stable in about 5 rounds.
Experiments for a Feistel-based toy cipher. For the case of Feistel-based cipher, we consider 4-GFS (generalized Feistel structure with 4 sub-blocks), as shown in Fig. 9. The  Table 11: The maximal differential probability of SPN-based toy ciphers.

Rounds
Max Pro. Single Branch Double Branches 1 2 −3 2 −1 2 2 −8 2 −6 3 2 −8 2 −5.7 4 2 −11.3 2 −12.3 5 2 −12.5 2 −12.5 6 2 −12.5 2 −12.5 7 2 −12.5 2 −12.5 8 2 −12.5 2 −12.5 round function consists of two parallel 4-bit S-boxes with random round keys, where the S-box is the same with that used in Orthros. We carried out 100 experiments and compute the maximum of the MDP for all experiments, for both the constructions using a single branch and double branches. The corresponding results are displayed in Table 12. It shows that the maximal value becomes stable after about 7 rounds in the construction using double branches. For the construction using single branch, it becomes stable in about 10 rounds. Figure 9: 4-GFS. Dotted lines denote (random) round keys.

S S
2 −9.6 2 −11.1 7 2 −9.4 2 −12.5 8 2 −11 2 −12.5 9 2 −12 2 −12.5 10 2 −12.5 2 −12.5 11 2 −12.5 2 −12.5 12 2 −12.5 2 −12.5 Experiments on linear masks. Similar experiments have also been performed to evaluate the maximal linear bias for the toy ciphers. Due to the high time complexity to accurately Subhadeep Banik and Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and Kosei Sakamoto 65 compute the maximal linear bias, we turn to calculating it in a probabilistic way. Specifically, we randomly choose some input and output masks and select the maximal linear bias from them. For the SPN-based toy cipher, it is found that the maximal linear bias (2 −6.4 ) becomes stable after 4 rounds if using a single branch, while it becomes stable in 3 rounds for double branches. For the GFS-based toy cipher, the maximal linear bias (2 −6.4 ) becomes stable in 9 rounds and 6 rounds for a single branch and double branches, respectively.

Summary.
In our experiments, both the maximal differential probability and linear bias of the double branches reach a stable value in a smaller number of rounds than that of the single branch. Of course the scale of experiment is limited and a more theoretical support should be desired. However, these results suggest that the double branch enhances the security of the single branch. We also emphasize that the security of Orthros is never ensured based on such a simple simulation. Instead, a comprehensive study is performed.

C Detailed Explanations for Conditions 1 and 2 C.1 Condition 1
Fig 10 shows the transition of the state after applying P bk1 , which satisfies Condition 1. The state is represented as a 16 × 8 bit array, where S 0 , S 1 , S 2 , S 3 are nibbles consisting of the first state column. Let us focus on the leftmost column. After applying a bit-permutation satisfying Condition 1, for 0 ≤ i ≤ 3, Fig 10 shows that the 4 bits of S i are mapped to different columns. The same applies to the remaining 7 columns.

C.2 Proof of Condition 2
For each active bit in the nibble cell S i (0 ≤ i ≤ 31) in the input, after applying the S-box and the bit permutation satisfying Condition 1, there will exist 4 different groups of inputs to the binary matrix, each of which will contain exactly one active nibble, as shown in Fig. 11. Therefore, there will be 12 active nibbles after MatrixMul in the first round, which will activate 12 nibbles located in the same positions in the second round after S-box operation, as depicted in Fig. 12. Therefore, there are 12 nibbles (48 bits in total) in the state after the S-box (with full-diffusion property) operation in the second round, independent of the value of the one active bit in the input to the first round. After applying a permutation satisfying Condition 2 for these 48 bits, in each column of the 4 × 8 array, there exist at least 2 nibble cells containing the bits coming from these 48 bits, as shown in Fig. 13 for bit level and in Fig. 14 for nibble level. In other words, after applying the bit permutation satisfying Condition 2 in the second round, in each column of the nibble array, there are at least 2 nibbles dependent of the one active bit in the input to the first round. When the MatrixMul operation is further applied to each column of the 4 × 8 nibble array, the values of all the four nibbles in each column will therefore dependent of the one active bit in the input to the first round. However, it cannot be guaranteed that the value of each bit will be dependent of the one active bit. Thus, after further applying the S-box with a full-diffusion property in the third round, all 128 bits become dependent of the one active bit. This means that the full diffusion is achieved by 2.5 rounds.
An example to explain the 2.5-round diffusion can be referred to Fig. 11, 12, 13, and Fig. 14.

D.1 DDT of S-box
The aforementioned DDT table of our S-box is shown shown in Table 13 , where d in and d out denote the input and output difference of 4-bit S-box, respectively.

D.2 Modeling S-box
Based on the method of [SHW + 14], we derive following 37 linear inequalities from Table 13,

D.4 Modeling for Division Property
Modeling S-box. Similar to Section 4.2, one could build a table to describe the propagation of the division property through S-box, as shown in Table 14.
In this table, u and v denote the input and output division property of S-box, respectively. The entry at (u, v) is * when the propagation u → v is possible. Otherwise, the propagation is impossible. Based on the method proposed by [SHW + 14], such a table is equivalent to the linear inequalities as shown below, where u = (u 0 , u 1 , u 2 , u 3 ) and v = (v 0 , v 1 , v 2 , v 3 ). attacks.

E.2 Yoyo and Mixture-Differential Attacks
The yoyo attack was first introduced by Biham [BBD + 99]. Recently, it has been applied to the cryptanalysis of AES by Rønjom et al. [RBH17], where generic attacks on up to 3 rounds of SPNs have been discussed. Since 2-round AES can be viewed as 1-round SPN with the concept of super S-box, distinguishing attack on up to 6 rounds of AES are derived by [RBH17]. However, there is one major step in the yoyo attack, that is, the attacker needs to make a decryption query. For the design of Orthros, since the final output is the xor sum of the outputs of Branch1 and Branch2, it is infeasible to make a decryption query. In addition, due to the fast diffusion of the bit permutation and the fact that each branch adopts a different bit permutation, it is quite difficult to construct an efficient super S-box for Orthros in the first four rounds. Based on these reasons, we believe that Orthros is resistant against such an attack.
In Asiacrypt 2019, a different view of the yoyo attack on AES, called exchange attack, was proposed by Bardeh and Rønjom [BR19]. It does not require decryption queries and can reach up to 6 rounds of AES. However, due to the similarity in the underlying idea between the yoyo attack and exchange attack, we believe that the resistance against the yoyo attack implies the resistance against exchange attack.
The mixture differential introduced by Grassi [Gra18] is an efficient tool to analyze a reduced-round AES, as its contribution to the recent progress of key-recovery attacks on 5-round AES [BDK + 18, DKRS20]. An important factor which makes the mixture differential efficient is that AES adopts a word-wise permutation. Due to the effect of the bit permutation in the first four rounds, we are not able to find a useful mixture differential for Orthros.

F Difficulty of Key-Recovery Attacks
As we repeated several times, the unique feature of Orthros (as a cryptographic primitive) is that it takes the sum of two branch outputs. To mount a key-recovery attack with a statistical distinguisher for block ciphers, such as a differential/linear/integral distinguisher, it is common to append a few rounds after a certain number of rounds for which a distinguisher exists, and guess partial key bits by partially decrypting the ciphertext and verifying whether the distinguishing property holds. However, such a common strategy is quite hard for Orthros since the attacker even needs to guess the outputs of each branch in order to reverse the ciphertext. In addition, it is required to construct two distinguishers for two different branches with the same plaintext set simultaneously if the attacker wants to append a few rounds after the distinguishable rounds. Even if it is possible to construct a distinguisher for one block cipher with an advanced cryptanalysis method as discussed above, it would be challenging to construct two different distinguishers for two different block ciphers for the same plaintext set simultaneously. In such a situation, we think generally the most promising direction is to find integral distinguishers. This will be discussed later.
Another attacking strategy is to prepend some rounds before the distinguishable rounds. However, this implies that there exists a distinguisher for each branch simultaneously. When extending two distinguishers backwards to the plaintext, the attacker can derive which key bits should be guessed in order to compute the desired value of the intermediate internal state of both branches. Since each branch adopts a different linear layer in its round function, and the key schedules of two branches also differ, a lot of to-be-guessed key bits will be involved. Moreover, the whitening keys in two branches are different as well, which further increases the complexity to prepend some rounds before a distinguisher. Takanori Isobe and Fukang Liu and Kazuhiko Minematsu and  Kosei Sakamoto  73 For better understanding, we present a framework to recover the secret key by extending an integral distinguisher backwards.

Subhadeep Banik and
A Framework for Recovering the Secret Key. This framework was once applied to a preliminary version of Orthros. Therefore, we omit the details of the design and explain a high-level idea. First, we denote the states after the S-box layer in the first round of Branch1 and Branch2 by XL 0.5 and XR 0.5 , respectively. Suppose there is a set of bit positions denoted by PSet ⊆ {i | 0 ≤ i ≤ 127} and the size of PSet is PSize. In addition, let us denote the final output after r rounds of (an old version of) Orthros by C r , which is the sum of the outputs of (old versions of) Branch1 and Branch2. Suppose there is an integral distinguisher such that i.e., the set of values whose bits located at the positions belonging to PSet take all the possible 2 PSize values and the remaining (128 − PSize) bits take constant values.
To mount a key recovery attack, the attacker first derives from PSet the active bits in the plaintext. Specifically, if i ∈ PSet, the (4 × i/4)-th, (4 × i/4 + 1)-th, (4 × i/4 + 2)-th and (4 × i/4 + 3)-th bits of the plaintext are all active. Let ASize be the size of the active bits in the plaintext. The attacker then prepares a plaintext set whose active bits take all possible values and make encryption queries with the r-round Orthros. It is easily detected that the sum of the ciphertext is zero. Record the corresponding 2 ASize pairs of plaintext and ciphertext in a table.
Suppose the whitening keys used by Branch1 and Branch2 are the same. In this case, the attacker guesses 2 ASize different values of the whitening key which is xored with the active bits in the input. For each guess, the attacker can partially know the corresponding XL 0.5 and XR 0.5 and can divide the plaintext set into 2 ASize−PSize different subsets according to the value of the nonactive bits of XL 0.5 and XR 0.5 via the constructed integral distinguisher. For each subset, compute the sum of the corresponding ciphertexts. For the correct key, the sum will be zero for all subsets. However, for a wrong key, the sum is zero for a subset with a probability 2 −128 . Therefore, the attacker can recover the key bits by checking the sum of the ciphertexts for the plaintexts in each subset.
Consider the case when different whitening keys are used for Branch1 and Branch2. In this case, when the attacker guesses the key bits in the left branch to obtain the corresponding 2 ASize−PSize subsets of the plaintexts, the sum of the ciphetexts for the plaintexts in each subset is not clear even the guess is correct. This is because the set of XR 0.5 behaves randomly for each subset of plaintext obtained according to the guess of the whitening key used in Branch1.
Obviously, it can be interpreted that this framework for Orthros is to convert a r-round distinguisher into a r-round key-recovery attack. Therefore, a long distinguisher should be prevented in our design.