The design of Xoodoo and Xooﬀf

. This paper presents Xoodoo , a 48-byte cryptographic permutation with excellent propagation properties. Its design approach is inspired by Keccak - p , while it is dimensioned like Gimli for eﬃciency on low-end processors. The structure consists of three planes of 128 bits each, which interact per 3-bit columns through mixing and nonlinear operations, and which otherwise move as three independent rigid objects. We analyze its diﬀerential and linear propagation properties and, in particular, prove lower bounds on the weight of trails using the tree search-based technique of Mella et al. (ToSC 2017). Xoodoo ’s primary target application is in the Farfalle construction that we instantiate for the doubly-extendable cryptographic keyed (or deck) function Xoofff . Combining a relatively narrow permutation with the parallelism of Farfalle results in very eﬃcient schemes on a wide range of platforms, from low-end devices to high-end processors with vector instructions.


Introduction
Designing a symmetric cryptographic primitive involves careful trade-offs between performance and security.For the former, an interesting challenge is to yield excellent performance on a wide range of targets, in particular from the low-end processors as used for embedded devices to the high-end server processors.This is especially useful if such a mixture of devices have to interact together.For the latter, while the security of a primitive cannot be measured and relies on public scrutiny by skilled cryptanalysts, a good design typically starts with a round function that mixes bits and frustrates the propagation of differences and linear correlations as quickly as possible.
A way to achieve performance on a wide range of targets consists in combining a high level of parallelism with a relatively small building block.On a low-end target, the computation is done serially with a small footprint, and the high-end processor can fully exploit its capabilities with the evaluation of multiple instances of the building block in parallel.As a concrete example, the Farfalle construction is a mode of use based on a permutation that allows a very high level of parallelism [BDH + 17].Instantiated with a permutation of relatively small size, it offers thus such a potential.
In [BDH + 17], the authors define Kravatte by instantiating Farfalle with the 200-byte wide permutation Keccak-p[1600] with 6 rounds.In general, Kravatte is very fast on a wide range of platforms, but there are some exceptions due to the large width of the permutation.First, on low-end processors the computation is slowed down by swapping the permutation state in and out of registers.This effect plays independently of the input and output lengths.Second, Kravatte with short input and output has a relatively large overhead per byte, and the cost of computing, say, a MAC over a message of 20 bytes is the same as for a MAC over a 199-byte message.It therefore makes sense to consider instantiating Farfalle with a narrower permutation.
Xoodoo is an iterated permutation that is inspired by Keccak-p and Gimli [BKL + 17, BDPA11b], with a novel structure consisting of three planes of 4 × 32 bits each.The three planes interact per 3-bit columns through a column parity mixer [SD18] and a degree-2 nonlinear operation, while they move as three independent rigid objects for dispersion.
We show that Xoodoo scores very well with respect to avalanche metrics and has excellent differential propagation and correlation properties.In particular, we prove lower bounds on the weight of trails using the tree search-based technique of Mella et al. [MDA17], although using finer-grained units than what was proposed for Keccak-p.The increased symmetry, the two dispersion layers and the involutive nature of the nonlinear layer make Xoodoo easier to analyze than Keccak-p.In particular, it allows us obtain better trail bounds.
Finally, this paper analyzes Xoofff, its rolling functions and the properties of Xoodoo in the light of what is needed for Xoofff to be secure.From a user perspective, Xoofff is an efficient deck function that can be used for building stream ciphers, MAC functions and full-featured authenticated encryption schemes, as proposed in [BDH + 17].We provide benchmarks on some low-end and high-end processors.

Design philosophy
In this section, we discuss the design philosophy underlying Xoofff, in the more general context of building cryptographic schemes in a modular way.
Typically, one specifies a cryptographic scheme as a mode on top of a primitive, where one can prove the scheme secure on the condition that the primitive is secure (or ideal, or random, . . .).We define a primitive as a cryptographic object that cannot be proved secure, but rather one that has the objective of being secure.This objective is expressed as a security claim, and this claim can be used by cryptanalysts to challenge the primitive.In the security proof of the mode, the statements in the security claim are assumed to be true, and so the security of the scheme is conditional on the validity of these statements.
In the case of this paper, what is the mode that is provably secure assuming the primitive is secure, and what is the primitive that must be cryptanalyzed?Here, the primitive is Xoofff and provably secure modes are modes on top of Xoofff, such as those defined in [DHAK18a].
Xoofff is a primitive but, like many other primitives, it has been built in a modular way.It makes use of the Farfalle construction and uses as building blocks the Xoodoo permutation and two rolling functions, see Figure 1.The idea behind Farfalle is not to build a secure function assuming the underlying building blocks are secure (or ideal or random).Instead, the idea is to use building blocks we know how to design in order to build an efficient cryptographic function that is useful for encryption, authentication and authenticated encryption.Of course, the propagation and algebraic properties of the underlying components are relevant and interesting, but the object to be cryptanalyzed is Xoofff and there is no security claim on the building blocks.In particular, there is no security claim on Xoodoo.
One can compare it with the way block ciphers, MAC functions or stream ciphers have been built.Here are some examples.
• Key-alternating block ciphers repeat a round function interleaved with round key addition [DR02].The round keys are generated in a key schedule that takes the cipher key as input.No security claims are made on the round function nor the key schedule.
• Tweakable block ciphers can be built using the tweakey framework [JNP14].Also here, there is no security reduction to underlying parts, e.g., the so-called tweakey schedule or the round function.• The MAC function Pelican-MAC, an application of the Alred framework, is built from AES and a building block similar to Xoodoo: the permutation that consists of 4 unkeyed rounds of AES [DR10,DR14].This permutation has many structural properties (e.g., symmetry, impossible differentials) not present in a random permutation, but there is no security claim on this building block, only on the function Pelican-MAC itself.
• The Salsa stream cipher can be seen as a permutation, the so-called Salsa core, in a certain mode [Ber08b].The Salsa core has structural properties due to a very symmetric round function and the absence of round constants.But there is no security claim on this permutation: the construction prevents exploiting the symmetry properties, and the assumed security and target of cryptanalysis is in the Salsa stream cipher as a whole.

Outline
In Section 2, we define the Xoodoo permutation and in Section 3 the Xoofff deck function.Optimization techniques and benchmarks are presented in Section 4. The design rationale of the permutation is given in Section 5, while that of the deck function can be found in Section 6.We detail the techniques for searching differential and linear trails in Section 7. Finally, we conclude in Section 8.

Xoodoo specification
Xoodoo is a family of permutations parameterized by its number of rounds n r and denoted Xoodoo[n r ].
Xoodoo has a classical iterated structure: It iteratively applies a round function to a state.The state consists of 3 equally sized horizontal planes, each one consisting of 4 parallel 32-bit lanes.Similarly, the state can be seen as a set of 128 columns of 3 bits, arranged in a 4 × 32 array.The planes are indexed by y, with plane y = 0 at the bottom and plane y = 2 at the top.Within a lane, we index bits with z.The lanes within a plane are indexed by x, so the position of a lane in the state is determined by the two coordinates (x, y).The bits of the state are indexed by (x, y, z) and the columns by (x, z).Sheets are the arrays of three lanes on top of each other and they are indexed by x.The Xoodoo state is illustrated in Figure 2.
The permutation consists of the iteration of a round function R i that has 5 steps: a mixing layer θ, a plane shifting ρ west , the addition of round constants ι, a non-linear layer χ and another plane shifting ρ east .
We specify Xoodoo in Algorithm 1, completely in terms of operations on planes and use thereby the notational conventions we specify in Table 1.We illustrate the step mappings in a series of figures: the χ operation in Figure 3, the θ operation in Figure 4, the ρ east and ρ west operations in Figure 5.
The round constants C i are planes with a single non-zero lane at x = 0, denoted as c i .We specify the value of this lane for indices −11 to 0 in Table 2 and refer to Appendix A for the specification of the round constants for any index.
Finally, in many applications the state must be specified as a 384-bit string s with the bits indexed by i.The mapping from the three-dimensional indexing (x, y, z) and i is given by i = z + 32(x + 4y).

Xoofff specification and security claim
Xoofff is a deck function obtained by applying the Farfalle construction on Xoodoo[6] and two rolling functions: roll Xc for rolling the input masks and roll Xe for rolling the state.We specify them with operations on the lanes of the state, following the conventions of Table 1 and Table 3.The input mask rolling function roll Xc updates a state A in the following way: The state rolling function roll Xe updates a state A in the following way: , roll e ] with the following parameters: • roll c = roll Xc and • roll e = roll Xe .
Here R i is specified by the following sequence of steps: θ :   Bitwise product (AND) of lanes A y,x and A y ,x We make the following security claim on Xoofff.
Claim 1.Let K = (K 0 , . . ., K u−1 ) be an array of u secret keys, each uniformly and independently chosen from Z κ 2 with κ < 384.Then, the advantage of distinguishing the array of functions Xoofff Ki (•) with i ∈ Z u from an array of random oracles (1) Here, • N is the computational complexity expressed in the (computationally equivalent) number of executions of Xoodoo[6], • N is the quantum computational complexity expressed in the (equivalent) number of quantum oracle accesses to Xoodoo[6], and • M is the online or data complexity expressed in the total number of input and output blocks processed by Xoofff Ki (•).
In (1), the first term accounts for the effort to find one of the u secret keys by exhaustive search, and for the probability that two keys are equal.The second term expresses that the complexity of recovering the accumulator or any rolling state inside Xoofff must be as hard as recovering 192 secret bits.The third term expresses the effort to find a collision in the accumulator.
The fourth and fifth terms only apply if the adversary has access to a quantum computer.The fourth term accounts for a quantum search (or quantum amplification algorithm) to find one of the u keys [Gro96,BHMT02].The probability of success after N iterations is sin 2 ((2N + 1) θ) with θ = arcsin u/2 κ .We upper bound this as 2N u/2 κ .The fifth term similarly accounts for a quantum search of a 192-bit secret.
Note that we assume that Xoofff is implemented on a classical computer.In other words, we do not make claims w.r.t.adversaries who would make quantum superpositions of queries to the device implementing Xoofff and holding its secret key(s).
We restrict keys to the uniform distribution to keep our claim simple and to avoid pathological cases that would not offer good security.In the multi-user setting, we require the keys to be independently drawn.If an adversary can manipulate K i , such as in so-called unique keys that consist of a long-term key with a counter appended, we recommend hashing the key and the counter with a proper hash function.
We do support the use of variable-length keys in the multi-user setting, where we assume that a key of given length is selected uniformly of the strings with that length.The claimed distinguishing bound then becomes slightly more complex and is given in Equation (2): with L the array of the distinct key lengths in use and u l the number of keys of length l.

Implementation aspects
In this section, we first report on some possible optimizations, then we give benchmarks for Xoofff.

Optimizations
Naturally, the lanes of Xoodoo coincide with words on 32-bit processors.All the operations in Algorithm 1 can be implemented using bitwise logical operations and rotations.The χ and θ steps can be implemented in a way that minimizes temporary storage.The step χ is specified by a number of parallel computations but can be serialized to allow in place processing with no computational penalty.In particular, the following sequence of operations performs χ: For the step θ, one can exploit the fact that the θ-effect E added to a sheet x depends only on the parity of the sheet at x − 1.One can proceed as follows.First, one computes the lane x = 1 of the θ-effect E (denoted E 1 ) from the parity of the sheet at x = 0 and stores it in a temporary 32-bit register R.Then, one adds R = E 1 to the sheet at x = 1.
To compute E 2 from the parity of the sheet at x = 1, we notice that the sheet at x = 1 does not have its original value anymore, but all the lanes got E 1 added to it.Hence, one can reuse the register R = E 1 and add the three lanes of the sheet at x = 1 so that E 1 cancels out and R correctly gets the parity before θ.One proceeds similarly to compute E 3 then E 0 , each time re-using the register that contained the previous lane of the θ-effect.

Benchmarks
We implemented Xoodoo and Xoofff, and we benchmarked them on different processors: the 32-bit processors ARM Cortex-M0 and M3 and two mainstream desktop Intel processors, with the Skylake and SkylakeX architectures.The difference between these last two processors is that Skylake supports 256-bit vector instructions (AVX2), while SkylakeX offers 512-bit vector instructions (AVX-512).
On the Cortex-M3, we fit the state in 12 registers and use 2 registers for temporary variables.Furthermore, we can use the free rotations that this platform provides so that no explicit rotation instruction needs to be used.The state gets globally rotated, and this can be corrected at the end, as done by Schwabe et al. [SYY12].With this technique, one round takes 49 cycles, so about 1.02 cycles/byte per round, plus 12 cycles for the global correction at the end.In Table 4, we assume that this global correction and the function call overhead are amortized on 18 rounds, as this can be done in the expansion phase of Xoofff.
The Cortex-M0 limits bitwise logical operations to the first 8 registers and does not support free rotations.This translates into more transfers between the first and the last registers, as well as explicit rotations.This trend is visible in Table 4, where we display the computational effort for one round of Xoodoo, as well as the round of other permutations.[Log, SS].Skylake refers to an Intel R Core TM i5-6500 processor at 3.2GHz and SkylakeX to an Intel R Core TM i7-7800X processor at 3.5GHz.In both cases, the benchmark was performed with the Turbo Boost feature disabled.Xoodoo's internal parallelism can be exploited with vector instructions, where one plane fits in a 128-bit register and several operations can be done plane-wise.Naturally, vector instructions are particularly well-suited to the parallelism provided by Xoofff's Farfalle construction.Here the size of the vector instruction determines the maximum number of Xoodoo instances that can be computed in parallel: 8 for Skylake (AVX2) and 16 for SkylakeX (AVX-512).In addition, the AVX-512 instruction set allows arbitrary three-input bitwise logical operations in one instruction, as well as rotations.The former is used to implement χ and the parity computation in θ, and the latter speeds up θ, ρ east and ρ west .This explains why Xoodoo and Xoofff are faster on SkylakeX than on Skylake even at the same level of parallelism.We present benchmarks of Xoofff in Table 5 with simple use-cases.

Design rationale of Xoodoo
Xoodoo is an iterated permutation that is strongly inspired by Keccak-p: it is bitoriented and its round function uses similar operations, see Section 5.1.It prevents high-probability differential trails and high-correlation linear trails by adopting the wide trail strategy, see Section 5.2.We discuss in Section 5.3 that Xoodoo enjoys the benefits of weak alignment like Keccak-p does.These benefits include negligible trail clustering of trails in differentials or correlations and the inapplicability of classes of attack such as truncated differentials [BDPA11a].Moreover, the Xoodoo avalanche characteristics we report in Section 5.4 show that the combination of wide trail with weak alignment results in very fast diffusion.We discuss the high degree of symmetry of the round function in Section 5.5 and the choice of round constants that remove that symmetry in Section 5.6.Finally, in Section 5.7 we explain how we gravitated to the Xoodoo round function structure starting from that of Keccak-p and Section 5.8 reports on how we arrived at the choices of the rotation constants.

The Xoodoo round function in a nutshell
The Xoodoo round function uses the five step mappings θ, ρ west , ι, χ and ρ east .Four of them are very symmetric, as they operate on bits of the state in the same way, independently of their position.More formally, we say that a given step α is translation-invariant over a direction (x, y, z) if it commutes with a translation by (x, y, z), i.e., if α•τ (x,y,z) = τ (x,y,z) •α, where τ (x,y,z) is a (cyclic) translation of the state by (x, y, z).
The nonlinear layer χ is an instance of the transformation χ that was already described and analyzed in [Dae95].In Xoodoo it operates in parallel on 3-bit columns and as such forms a layer of 4 × 32 3-bit S-boxes.In general, χ has algebraic degree two [Dae95, Section 6.9], with interesting consequences for the analysis (see Section 5.2.3).For 3-bit units, χ is involutive and hence this also holds for its inverse.Consequently, r rounds of Xoodoo or its inverse cannot have an algebraic degree higher than 2 r .The χ layer is translation-invariant in all directions.
The mixing layer θ is a column parity mixer [SD18] that operates as follows.It builds the parity plane by adding the three planes and computes from this the θ-effect plane by cloning it, shifting these two copies and adding them.Then it adds the θ-effect to each of the three planes.The θ layer is invertible, has order 32 and its inverse is dense.Similar to χ it is translation-invariant in all directions.
The dispersion layer in Xoodoo consists of two steps.As in both the parity plane computation in θ and in χ the state bits interact only within columns, there is a need for dislocating the bits of the columns between every application of θ and of χ.For that reason, after each χ layer, we have the so-called Eastern shift ρ east and after each θ layer the Western shift ρ west .Both shift the planes, treating them as rigid objects.Clearly, ρ east and ρ west are translation-invariant in all directions with y = 0, so all horizontal directions.Their breaking up of the columns results in weak alignment [BDPA11a].
The translation-invariance in horizontal directions of the step mappings results in high symmetry.We destroy this symmetry by the classical addition of round constants in the step ι.In trail search the round constants can be ignored and we can take advantage of this high amount of symmetry.
The round function is defined as For the study of propagation, we often rephase the rounds as starting with ρ east and ending in χ and group the sequence of linear mappings in λ = ρ west • θ • ρ east , so the re-phased round function becomes χ • ι • λ.We call λ the linear layer.When studying propagation through the rounds, we denote the input to λ of the first round by a 0 , its output b 0 , the output of χ by a 1 and so on.One can think of b as before χ and a as after χ.The values of a and b with the same index i are connected through λ.
The rounds numbering starts from negative numbers with the last round having index 0.The reason for this is to avoid slide-like attacks when Xoodoo instances are used with different numbers of rounds in a single construction.

Difference and correlation propagation
In this section, we report on the difference and correlation propagation in Xoodoo.For a detailed description of the trail search we refer to Section 7.

Differential probability and trails
In many use cases we are interested in the differential propagation probabilities (DP) of a cryptographic primitive.In the case of Xoodoo specifically this is essential for the security of the compression phase of Xoofff.In particular, we would like to characterize the distribution of DP(∆ in , ∆ out ) values over all input differences ∆ in and output differences ∆ out of our permutation, where DP(∆ in , ∆ out ) is the fraction of input pairs with difference ∆ in that results in a difference ∆ out after the primitive.For iterated cryptographic permutations and block ciphers, this is a hard problem.However, we can gain understanding by studying differential trails, as the DP of a differential is the sum over the DP values of its trails.
An n-round differential trail is the concatenation of n round differentials (a i , a i+1 ) and is fully specified by the sequence (a 0 , a 1 , . . ., a n ).We say a pair (α, β) follows a trail when its initial difference is a 0 and the difference after round i is a i for all i ≤ n.We apply the rephasing introduced in Section 5.1 and use a redundant representation of trails, where we also include the differences after the linear layer: Clearly b i = λ(a i ) and each differential (b i−1 , a i ) over χ imposes a number of conditions on the members of the pair (α, β).We call this number the restriction weight w r .It follows that the restriction weight of a trail is the sum of the restriction weights of its round differentials.The restriction weight allows approximating the DP of a trail: If the conditions are independent, the DP of the trail is 2 −wr .We report on the distribution of trail weights for 3 rounds in Table 6 at the end of this section.
Even in the absence of low-weight trails, one may have differentials (∆ in , ∆ out ) with high DP if there are very many differential trails from ∆ in to ∆ out , or if there are differential trails where the DP is much higher than 2 −wr due to dependencies between the round differential conditions.The study of these two aspects is closely related to that of alignment that we treat in Section 5.3.
A differential over χ is only possible if b i and a i+1 have the same column activity pattern, i.e., the set of active columns must be the same.As shown in Section 5.2.3 below, the restriction weight equals twice the number of active columns in b i , or equivalently, in a i+1 .It follows that the restriction weight of an n-round trail (a is fully determined by the sequence b 1 , . . ., b n−1 and is given by w r (a 1 ) + 1≤i<n w r (b i ).We call such a sequence a differential trail core, as in [DA12]: A trail core can be extended to an n-round differential trail by pre-pending a couple a 0 , b 0 with b 0 compatible through χ with a 1 and appending a value a n compatible through χ with b n−1 .It follows that a trail core Q represents in total 2 wr(a1) × 2 wr(bn−1) trails, all with the same weight.In our analysis, we bound the weight of trail cores.In the sequel, we use w(•) as a shortcut notation for w r (•) when clear from the context.

Correlation and linear trail cores
Similarly to differential probability, we are interested in the input-output correlation properties of a cryptographic primitive f .In particular, we would like to characterize the distribution of C(u out f (x), u in x), i.e., the correlation between linear combinations of output bits u out f (x) and linear combination of input bits u in x over all values of output (linear) mask u out and input (linear) mask u in .For iterated cryptographic permutations and block ciphers, this is a hard problem.Here too, we can gain understanding by studying linear trails, as a correlation C(u out f (x), u in x) is the sum over the (signed) correlation contributions of its trails.In the sequel we will for readability slightly abuse terminology by speaking about correlations between masks.
An n-round linear trail is the concatenation of n single-round correlations.A correlation over round i is defined by a mask a i at its output and a mask a i+1 at its input and we denote its correlation value C(a i , a i+1 ).
As for differential trails, we use a redundant representation by including the masks after the linear layer.To make notation consistent with differential trails, we rephase the rounds as starting with χ and ending with λ.However, as linear propagation is studied naturally from the output to the input, the trail first encounters λ and then χ of each round.A mask a i at the output of λ maps to a mask b i = λ (a i ) before λ.In this way a i fully determines b i via the linear layer λ.Our trails look like (a where a 0 is the mask after the last round and a n the mask before the first round.Note that the transposition denotes the following operation of a linear mapping: When the linear mapping µ is expressed as the multiplication by the matrix M , the transpose of µ, or µ , is a linear mapping given by the multiplication by M .It follows that λ west since the inverse of a bit transposition matrix is its transpose. The correlation contribution of a linear trail is the product of its round correlations.Similarly to differential trails, we define a correlation weight for a round correlation w c (b i , a i+1 ), as C 2 (b i , a i+1 ) = 2 −wc(bi,ai+1) , and we define the weight of a trail as the sum of the weights of its round correlations.Also here we may have large input-output correlations (u out , u in ) even in the absence of low-weight trails if there are very many linear trails from u out to u in and their signed correlation contributions combine constructively.This is again covered by our treatment of alignment in Section 5.3.
The correlation weight of a round correlation (a i , a i+1 ) is determined by the correlation weight of the mask couple (b i , a i+1 ) over χ.This correlation is only non-zero if b i and a i+1 have the same column activity pattern and, as shown in Section 5.2.3 below, the correlation weight equals twice the number of active columns in b i , or equivalently, in a i+1 .It follows that the correlation weight of an n-round trail (a We also use w(•) as shortcut notation for w c (•) when clear from the context that we are dealing with a linear trail.

Properties of χ
χ can be defined generically, operating on n bits arranged in a circle.The differential and correlation propagation properties of this generic χ are non-trivial and have been described in [Dae95, Section 6.9].Thanks to the fact that in Xoodoo χ operates on 3-bit circle, formed by the columns, propagation of differences and masks through it can be specified very compactly.We do this in Proposition 1.
We do not give a proof for this proposition as it can easily be checked exhaustively.Proposition 1 has several corollaries:

Corollary 1. For fixed (difference or mask) a, the compatible (difference or mask) b values form an affine space of dimension 2 and vice versa.
Corollary 2. The restriction weight of a differential over χ is equal to two times the number of active columns in b, or equivalently in a.

Corollary 3. The correlation weight of a differential over χ is equal to two times the number of active columns in b, or equivalently in a.
These three corollaries simplify trail analysis in comparison with that of Keccak-p.Thanks to the first one both forward and backward extension can make use of linear algebra.Thanks to the last two corollaries we can replace the weight and the minimum reverse weight (see [DA12]) by the number of active columns.

Trail weight distributions
As we will detail in Section 7, we have determined all 3-round trails up to weight 50, both for linear and differential trails and we list them in Table 6.The minimum weight for both types of trail is 36.The trails of weight 36 and 38 are simply due to the 3-plane structure of Xoodoo and are described in Section 7.6.Apart from them, there are no 3-round trails below weight 44.When extending the 3-round trails to 6 rounds, none were found.This results in a lower bound on the weight of 6-round trails is 2T 3 + 4 = 104, for both differential and linear trails.The lower bounds on trail weights are summarized in Table 7.

Alignment
In [BDPA11a], Bertoni et al. investigated an aspect of round functions called alignment.Alignment is related to the propagation of activity patterns through the linear layer of a round function.We summarize what alignment means in the context of round functions that have an S-box layer as non-linear layer.The s-bit S-boxes partition the bits of the state in subsets that are processed by the same S-box, and we call those boxes.When applying a difference ∆, a mask u or in general any state-sized binary pattern, we can define its corresponding (box) activity pattern.The activity pattern corresponding to a concrete pattern specifies for each box whether it contains only zero bits or at least one bit with value 1.In the former case we call the box passive, in the latter we call it active.Given an activity pattern a with n active boxes, there are (2 s − 1) n concrete patterns a compliant with a.We will treat activity patterns as sets of concrete patterns and we say a ∈ a.Clearly, an invertible S-box layer preserves (box) activity patterns of differences and of linear masks (and those of most other propagating structures).This is not true in general for a linear layer.
If the linear layer maps the elements of a to many different activity patterns, we say it has weak alignment.Otherwise, if it maps large fractions of the elements of a to a small set of activity patterns, we speak of strong alignment.This can be applied to different types of patterns but the most important are differences and (linear) masks.Their propagation through the linear layer is governed by different laws but they behave in very similar ways (see Sections 5.2.1 and 5.2.2).We denote by λ(a) the set of states b with b = λ(a) and a ∈ a. Strength of alignment manifests itself in the number of trail cores in truncated differentials [Knu94].A truncated differential is defined by a couple of activity patterns (∆ in , ∆ out ) and is the set of all differentials with input difference in ∆ in and output difference in ∆ out .A trail is in (∆ in , ∆ out ) if its initial difference is in ∆ in and its final difference is in ∆ out .Let us now consider a truncated differential (∆ in , ∆ out ) over χ • λ • χ.The number of trail cores in this truncated differential is the number of elements in λ(∆ in ) ∩ ∆ out .In the case of weak alignment the elements of λ(∆ in ) will per definition have many different activity patterns and hence for any ∆ out , the set λ(∆ in ) ∩ ∆ out will be small.This implies that truncated differentials, and a fortiori ordinary differentials, will have a small number of trail cores.In the case of strong alignment λ(∆ in ) ∩ ∆ out may be large and (truncated) differentials possibly have many trails.An analogous reasoning holds for correlations and linear trails, where clustering depends on the alignment of λ instead.
With a similar (but not the same) reasoning it can be shown that in the case of weak alignment the conditions imposed by round differentials in a trail tend to be independent, while strong alignment increases the risk for dependence, as observed in plateau trails of Rijndael [DR07].
In Xoodoo the boxes are 3-bit columns and an activity pattern has the shape of a plane.Clearly, the linear layer maps the 2 3 − 1 single-row patterns (both differences and masks) to 7 different output activity patterns.This is weak alignment.We experimented with randomly generating many activity patterns a with n ≤ 10 active columns and for the vast majority of cases the 7 n elements of λ(a) had 7 n different activity patterns.

Avalanche behavior
When reporting on (reduced-round) cryptographic functions, one often mentions criteria such as full diffusion, avalanche and strict avalanche criterion (SAC) [WT85].These criteria are useful in estimating the vulnerability of the cipher to certain attacks.Each of these criteria is binary: it is either met, or it is not.Typically, for an iterated cipher one reports on the number of rounds required to satisfy it.In this section we define metrics that allow evaluating in a more fine-grained way how the function realizes it through the rounds.

Definition of avalanche metrics
Concretely, we compute the avalanche probability vector of a cryptographic primitive F for some input difference ∆: a vector P ∆F where component i is the probability that bit i of the output of F flips due to the input difference ∆.For clarity, we specify the generation of the avalanche probability vector in Algorithm 2. After M samples, the expected standard deviation of the elements of P ∆F is 1/ √ M .So for high precision, M must be chosen large enough.In our experiments, we took M = 250000.
Algorithm 2 Computation of the avalanche probability vector P ∆F .
Parameters: a transformation F over Z b 2 , an input difference ∆ and number of samples M .Output: the avalanche probability vector P ∆F .Initialize a b-bit vector p of probabilities p i to all zeroes for M randomly generated states A do Compute B = F (A) + F (A + ∆) for all state bit positions i do From the avalanche probability vector P ∆F we extract three metrics, each one measuring an aspect of the difference at the output of F due to a given input difference ∆.In the following we write p i for P ∆F [i].
Avalanche dependence: number of output bits that may flip, defined as: with δ(x) equal to 1 if x = 0 and 0 otherwise.This metric generalizes full diffusion, that is satisfied if D av (F, ∆) = b for all ∆ with Hamming weight 1.
Avalanche weight: expected Hamming weight of the output difference, defined as: Clearly w av (F, ∆) ≤ D av (F, ∆).This metric generalizes the avalanche criterion, that is satisfied if w av (F, ∆) ≈ b/2 for all ∆ with Hamming weight 1.
Avalanche entropy: uncertainty about whether output bits flip, defined as an entropy: This metric generalizes SAC, that is satisfied if H av (F, ∆) ≈ b for all input differences ∆ with Hamming weight 1.

Reporting on the avalanche properties of Xoodoo
We report on the performance of Xoodoo with respect to the three avalanche metrics described above, for particular input differences and for different number of rounds and inverse rounds.For each of these metrics, we report on the worst-case values of these metrics: the minimum value taken over all individual input differences of given type.We give the results in Table 8.
In Table 8, we follow the convention a i and b i introduced in Section 5.2.1 and we apply the difference in the stage a 0 .As b i = λ(a i ) with λ linear, applying a difference ∆ at a 0 is equivalent to applying a difference λ(∆) at b 0 .Clearly, the avalanche scores at stage a i report on the difference after i Xoodoo rounds.If i is positive, these are forward rounds, if i is negative, these are inverse rounds.Avalanche scores at stages b i add one linear layer λ to it.For the differences applied at a 0 or b 0 , we consider: δ a Single-bit differences at a 0 .
δ K Orbitals at the input/output of θ.These are 2-bit differences both at a 0 and b 0 .
The avalanche behavior of a cipher gives a good indication of the number of rounds that certain structural distinguishers can cover.For example, as a rule of thumb, it is hard to find impossible differentials that span a number of rounds that is more than two times the number of rounds it takes to have full diffusion.From Table 8 we can see that strict avalanche is reached after 3.5 rounds in forward direction and after 2 rounds in backward direction.

Symmetry
A plane in Xoodoo can be seen as an infinite state periodic in two directions: period 4 in the direction of the x-axis and period 32 in the direction of the z axis.Put otherwise: it is invariant for translations over any vector in the two-dimensional lattice with basis vectors (4, 0) and (0, 32).We express this lattice as (4, 0), (0, 32) and we call this the Xoodoo lattice Ξ. Differences and masks propagate irrespective of the round constants so that symmetry can be maintained during propagation.
This effect also exists in Keccak-p and is called Matryoshka [BDPA11b]: states (differences or masks) of Keccak-p[25 × 2 n ] with symmetry ∀(x, y, z) : A[x, y, z] = A[x, y, z +2 j ] map to states of Keccak-p[25 × 2 n−j ].The invariance of Xoodoo with respect to all horizontal translations results in a two-dimensional Matryoshka property.A symmetric state of Xoodoo can be expressed with respect to a lattice V : ∀(x, y, z) and v ∈ V : If we take V the Xoodoo lattice Ξ, this describes a regular Xoodoo state.If V is a lattice that has Ξ as a sub-lattice, we have a state with additional symmetry.Each symmetric state maps to a state in a smaller instance of Xoodoo, with equal steps χ and variants of ρ west , ρ east and θ.
Each lattice V that has Ξ as a sub-lattice defines a symmetry class S V that forms a subset of the state values.A state a is in the symmetry class S V if it is invariant with respect to any translation along V and there exists no lattice V with V ⊂ V such that a is invariant with respect to V .The symmetry classes form a partition of the state space.
We can exhaustively specify the symmetry classes by the basis of their lattices, where the first element of the basis is of the form (0, 2 e ) with 0 ≤ e ≤ 5.For a basis with first vector (0, 2 e ), the second vector is in the following range: • (4, 0), (2, 0) and (1, 0): exists for all e • (2, 2 e−1 ) and (1, 2 e−1 ): exist for e > 0 • (1, 2 e−2 ) and (1, 2 e−2 3): exist for e > 1 We count here 6 × 3 + 5 × 2 + 4 × 2 = 36 lattices, including Ξ itself.The symmetry class S V of the lattice V = (1, 0), (0, 1) can further be sub-divided in 2 symmetry classes due the fact that all-0 or all-1 planes give rise to shift-invariance along the y axis.The two classes are the one with three equal planes or the one with different planes.We can model this split by extending the lattice vectors by a y-component and adding a third lattice vector.The 3-equal-plane lattice can now be specified as (1, 0, 0), (0, 0, 1), (0, 1, 0) .The 36 other lattices just get the additional vector (0, 3, 0).For readability, we will stick to the two-dimension representation and will ignore the y component.So there are 37 symmetry classes in total.For 36 of these classes, the elements exhibit some symmetry within the boundaries of the Xoodoo state.For one class this is not the case: namely S Ξ .This class contains about 2 384 − 3 × 2 192 of the 2 384 state values, so the vast majority.

Round constants
Thanks to their shift-invariance and invertibility, applying any step mapping of the round function other than ι to a state in a symmetry class S V results in a state in the same symmetry class S V .The symmetry classes are hence invariant subsets [LAAZ11].Moreover, the union of any subset of the 37 symmetry classes is also an invariant subset.As n disjoint subsets can be grouped into two non-empty subsets in 2 n−1 − 1 ways, all steps except ι have the same 2 36 − 1 invariant subsets.This property would carry over to a round function variant without ι and thus to such a Xoodoo[n r ] variant irrespective of the number of rounds.
We chose the round constants to destroy shift-invariance of the round function and to remove all these 2 36 − 1 invariant subsets.For this reason, we chose them to be in S Ξ so that addition of a round constant maps any state not in S Ξ to a state in S Ξ .As such the round function cannot have any of the 2 36 − 1 subsets as invariant subsets.As S Ξ contains more than half of the state values, it is not possible to group the symmetry classes into two equal subsets.This allows us to exclude the case of a subset mapping to its complement.
It may be the case that the effect of the round constant in two (or more) consecutive rounds would compensate each other.For that reason, we have opted for round constants with support in a single lane (x, y) = (0, 0) so that the subsequent application of θ will make it propagate to other lanes.Moreover, to avoid attacks that exploit equality of the rounds such as slide attacks [BW99], the round constants depend on the round number.
Naturally, Xoodoo[6] is a permutation and for any permutation the union of the elements in any subset of its cycles forms an invariant subset.For a random permutation these invariant subsets would not carry enough structure to be exploitable in Farfalle.Investigating what are the relevant cycle behavior properties of a permutation in the context of Farfalle and whether such behavior is present in Xoodoo[6] are interesting topics for future research.
Finally, for allowing efficient implementation on ARM Cortex-M3 processors, the round constants span at most 4 consecutive positions along the z axis.

The making of Xoodoo
For the design of Xoodoo, we started from Keccak-p and aimed for a 384-bit permutation.We decided to use a nonlinear layer similar to Keccak-p's χ but on 3 bits instead of 5 to match the factor 3 in 384.Anyway, χ needs to apply on an odd number of bits, otherwise it is not invertible.Another option would be to aim for a 320-bit permutation and to stick to Keccak-p's χ on 5 bits, but we thought that 384 bits would be better suited.
Next, we opted for 32-bit lanes.Another option could have been to take 128-bit lanes and rewrite the algorithm into equivalent operations on 32-bit words using the bit interleaving technique [BDP + 12], but we found that a structure with 3 planes of 4 × 32 bits was easier to describe.
For the mixing layer, we opted for a column parity mixer, similar to Keccak-p's θ.For this layer to have a dense inverse [SD18], θ needs to work on columns of odd size.This made it clear that both χ and θ would need to work on 3-bit columns.
For dispersion, Keccak-p uses two operations: π that moves lanes and ρ that translates lanes, both applied before χ, but it has no dispersion layer between χ and θ.In the case of Xoodoo, however, we need two dispersion layers, to avoid overlaps between the mixing and the nonlinear layers: one before χ and one before θ.So the idea of using ρ east and ρ west came early in the design.Another option would have been to have χ operate on skewed columns, but this was equivalent and just more complicated to describe than the two dispersion layers.Moreover, it seemed that a ρ mapping that shifted planes rather than lanes would be sufficient, and so would be simpler than that in Keccak-p.We definitely liked the idea of three independent rigid objects interacting through χ and θ.
Initially, we thought that we could get away with one of the dispersion layers that moves only along the x axis, i.e., to only move lanes without any shifts, hence saving on the number of rotations in the implementation.However, it turned out that this was not enough, and we therefore added some translation along z to make the work more balanced between the two ρ steps.
Finally, we fixed all the rotation offsets as described in the next section.

Choosing the shift offsets in the light of trails
Once we defined the general structure of the Xoodoo, we set out experiments to find good shift offsets for the linear layer.Specifically, the family we investigated is as in Algorithm 1, where • the θ-effect is computed as • in ρ west , the translation of A 2 is A 2 ← A 2 ≪ (0, w 1 ), and for parameters t 1 , t 2 , t 3 , w 1 , e 0 and e 1 .
In short, we chose the shift offsets of ρ east , θ and ρ west such that the number of trails of weight below 44 is minimized, both for differential and linear trails.Actually, only the so-called inherent trails remain below 44, see Section 7.6.Let us detail the decision process.
We restricted the offsets in ρ west and ρ east from the start to limit their computational cost.As only relative shifts count, we start by not shifting plane A 0 at all.For the other two planes, we set out to limit the number cyclic lane shifts as they are expensive on some platforms.In an early phase, we limited shifts in ρ west and ρ east to shifts along the x-axis (with an offset of the form (s, 0)) and shifts along the z-axis (of the form (0, t)).In particular, A 1 would undergo a shift (1, 0) in ρ west and a shift (0, 1) in ρ east and A 2 a shift (0, w 1 ) in ρ west and (e 0 , 0) in ρ east .In that way, both ρ west • ρ east and ρ east • ρ west shift any pair of planes with respect to each other over an offset of the form (s, t) with s = 0 = t.However, our propagation experiments immediately revealed systematic low-weight trails.Adding a shift of A 2 along the z-axis in ρ west made these trails go away, even with the offset e 1 = 8 that is cheaper on some platforms.
For θ, we needed to select the two offsets in the computation of the θ-effect.We decided to fix t 3 = 1, meaning that both affected columns are in the same lane.This allows computing the θ-effect with just one additional register (see Section 4.1).
Loops are sequences of 32 odd columns such that the θ-effect cancels out, see Section 7.3.2for a more formal definition.Given the number of odd columns, the weight of a trail containing a loop is well above our target, so this was not considered a problem.Yet, we required that t 1 − t 2 is odd, as otherwise loops with fewer than 32 odd columns would exist.
There remained the choice of the values of t 1 , t 2 , w 1 , e 0 in the light of differential and linear trails.For this part, we refer to Section 7 for the terminology.We proceeded in two steps.First, we looked for 3-round trail cores where the input to θ is in the kernel in both rounds.In this case, θ acts as the identity, and this process is thus independent of the values of t 1 , t 2 .This allowed us to select good candidates for w 1 , e 0 .Second, we extended the search to all 3-round trail cores to select good values for t 1 , t 2 .
1.The Vortex is an inherent trail core with weight 36 that is in the kernel (see Table 9).
2. Outside the kernel, i.e., without any constraints on the input of θ, the best trails are the Single-orbital fan and the θ 2 -glide.We looked for tuples (t 1 , t 2 , w 1 ), with w 1 in the set above, such that no other trails below weight 44 would exist.
After this selection process, we were left with a list of about 20 candidate tuples for (t 1 , t 2 , w 1 ).We finally selected a single tuple from this list on the basis of performance with respect to avalanche criteria, as illustrated in Table 8.

Design rationale of Xoofff
In this section, we first discuss the rolling functions, then give a rationale for the number of rounds in the different permutations.
Both rolling functions operate as 12-stage feedback shift registers (FSR), with the lanes mapping to the 32-bit stages.We can define an infinite sequence V of stages V i with i ≥ 0. The initial state/mask consists of the first 12 stages and the state/mask after j iterations of the rolling function consists of stages j to j + 11.The mapping of these 12 stages to the lanes of the state/mask at iteration t is as follows: A x,y = V i with i = t + y + 3x.The first 12 stages V 0 to V 11 are the initial value of the mask/state.All subsequent stages V t are defined by a recursion of the type V t ← F (V t−1 . . .V t−12 ).Clearly, this is the operation of an FSR.

The rolling function roll Xc
The rolling function roll Xc is a lightweight invertible linear FSR of maximum order operating on the entire 384-bit state constructed as proposed by Granger et  As a consequence, each non-zero mask value will be in a cycle of length 2 384 − 1.The zero mask value is a fixed point in our rolling function.We think the probability that a user key K maps to such a mask value is negligible.
The recursion in the stage representation is: As V j+12 depends only on V j and V j+1 , this allows the parallel computation of up to 11 subsequent iterations.In other words, given that we have V j to V j+11 , we can compute An important purpose of roll Xc is to avoid affine spaces of large dimensions.This aspect is discussed in more detail by Bertoni et

The rolling function roll Xe
The rolling function roll Xe is a lightweight invertible non-linear FSR.It is non-linear to resist against state-recovery attacks described in [CFG + 18] that work if roll e is linear, p b is Keccak-p with 6 rounds and the adversary has a very long sequence of output blocks.
The recursion in the stage representation of roll Xe is: As V j+12 depends only on V j , V j+1 and V j+2 , this allows the parallel computation of up to 10 subsequent iterations, so given V j to V j+11 , we can compute V j+12 to V j+21 in parallel.
The recursion contains a bitwise product of two stages for non-linearity, two linear terms for diffusion and a constant term to remove symmetry and avoid fixed points.The algebraic normal form (ANF) of roll i Xe (A) is non-linear.Informally, the criterion for roll Xe is that the degree and number of monomials in this ANF grows sufficiently with i to thwart attacks like the aforementioned ones.
The ANF of roll i+1 Xe (A) can be computed iteratively with i.Every iteration, the nonlinear term in (3) introduces 32 fresh products, contributing to the increase in algebraic degree.The linear terms and the constant contribute to the increase of the number of monomials.Due to the fact that bits of V j are independent of V j−1 to V j−10 , monomial growth is relatively slow.Still, as the aforementioned attacks require a very long sequence of output blocks, we believe this is sufficient.
Since roll Xe is non-linear, its selection process is not as straightforward as that of roll Xc .We applied the following tests to arrive at our choice: Fixed points First we test for cycles of length one (i.e., fixed points), preferring candidates that have the fewest.Since a single iteration of the rolling function merely moves all of the lanes by one position and calculates a new lane, the necessary and sufficient condition for the state to be a fixed point is one where all lanes comprising the state, including the newly calculated lane, are equal.For 32-bit (and smaller) lane sizes, it is feasible to enumerate all fixed points simply by testing every possible lane value.

Short cycles
Then we test for short cycles induced by various symmetric states, preferring candidates for which no cycles could be found.The tested states are: • Alternating bit patterns 10101... and 01010... Monomial count Then we run simulations to gain an understanding of the number of monomials present in the algebraic normal form of the rolling function after n iterations, preferring candidates that contain the greatest number of monomials in the fewest number of iterations.We repeat this simulation for monomials of degree two, three, four and greater.We describe the degree two monomial count test in Algorithm 3; the higher degrees are a generalization of this algorithm except with a random sampling of monomial coordinates to make the execution time more practical.
, increment p and break out of innermost loop return p

Addressing Farfalle and Kravatte attacks
In this subsection we discuss Xoofff in the light of the attack paths on Farfalle and Kravatte as identified in [BDH + 17].Specifically, our choice for the number of rounds in p b , p c , p d and p e follows essentially the same rationale as for the number of rounds in Kravatte Achouffe, with 6 rounds for the four permutations.Moreover, our choice of the two rolling functions was guided by the experience with Kravatte.We will follow the canvas of Sections 5 and 7.4 of [BDH + 17].

Accumulator collisions
Clearly, as the accumulator has width 384, collisions can be found generically with expected complexity 2 192 Xoofff executions, the so-called birthday bound.Section 5.1 of [BDH + 17] discusses three non-generic methods to achieve collisions.
The first two methods are finding sets of input blocks that contribute 0 to the accumulator and input block swapping that leave the accumulator unchanged.Both methods strongly depend on the properties of the mask rolling function.In [BDH + 17, Section 5.1] an extensive analysis provides evidence that a maximum-order linear rolling function with a characteristic polynomial that is not too sparse has the properties to make these methods infeasible.This is the case of the mask rolling function roll Xc .Note that in Kravatte the mask rolling function operates on 320 of the 1600-bit mask, thereby leaving 1280 bits untouched, while roll Xc is a maximum-order linear mapping operating on the full 384-bit state.
The third method is the exploitation of differentials in p c and the choice of Xoodoo[6] for p c is actually motivated by this method.In this method one applies message pairs that differ in two blocks by the same difference ∆ and a collision occurs iff the differences cancel in the accumulator.This happens with probability We call CP(∆) the collision probability of ∆.Applying n pairs with the same difference ∆ gives success probability nCP(∆).In general, if we apply a set of messages X = {X (1) , X (2) , . . .X (n) }, the success probability of having two message collide is CP(X ) = i,j CP(X (i) + X (j) ).Finding a set X that maximizes its collision probability is in general a hard problem.Section 7.4.1 of [BDH + 17] discusses how to find a set X and an estimate of CP(X ) for Kravatte based on the properties of Keccak-p[6, n r ].When working out a similar analysis for Xoofff, we found a method to increase the success probability for generating collisions.We explain our method in the next subsection.

An improved collision generating method
We describe the method for a 6-round permutation, but it can trivially be generalized to any number of rounds.As in [BDH + 17, Section 7.4.1],we make the assumption that the differentials over 6 and 5 rounds with the highest DP are dominated by a single trail.We believe this is the case for both of Xoodoo and Keccak-p thanks to weak alignment.Hence, in a pair of colliding messages, the differences ∆ follow the same trail in both active blocks to end up in the same difference γ.Denoting trails solely by the differences b i at the input of χ, (with b 0 = λ(∆)) the ones followed by the differences in the two active blocks is The value of γ is not important, as long as it is the same in both trails.There are 2 w(b5) possible values for γ, hence applying M/4 two-block message pairs with input differences ∆ leads to following success probability: M 2 −(2+2w(b0)+2w(b1)+2w(b2)+2w(b3)+2w(b4)+w(b5)) .
In [BDH + 17, Section 7.4.1]this success probability is slightly increased by embedding the pairs in a larger structure exploiting multiple input differences that have a high CP.Our new method does something similar, but with a more spectacular increase of success probability: it removes the contribution of b 0 to the weight altogether.
In Xoofff, we arrange the two-block inputs in an affine space V = U + q that we specify at the input of χ of the first round, with U a vector space and q an offset.U is the vector space that spans all the bits in the columns of the two blocks where a 1 = λ −1 (b 1 ) is active, and all the other (passive columns) bits are fixed to 0. The number of active columns in the two active blocks is w(a 1 ) and hence the dimension of V is 3w(a 1 ).Knowing the basis of U at the input of χ it is straightforward to obtain the basis at the input of λ by simply applying λ −1 to the basis vectors.
Note that, when applied to Kravatte, we do the same but with rows taking the place of columns.In Kravatte there is no one-to-one correspondence between the (minimum reverse) weight of a 1 and the dimension of U , except that it is at most 5 times the minimum reverse weight of a 1 .
Since V takes all the possible values in each active column and χ is bijective, the set T obtained by applying χ to the elements of V is an affine space with the same basis as U and it can only differ from V in its offset.The point is now that T contains |T |/2 = |V |/2 pairs with difference a 1 in both blocks, that we denote as a 1 ||a 1 .We construct these pairs as follows.For each element u ∈ T , construct u * = u + (a 1 ||a 1 ).As (a 1 ||a 1 ) ∈ U , we have u * ∈ T .This gives a pair with difference a 1 ||a 1 .We can do this for any element u ∈ T leading to a total of |T |/2 pairs.Note that the attacker cannot identify these pairs a priori due to the presence of the secret masks, but their mere presence in T contributes to the probability of a collision that is easily identifiable a posteriori.
So, we apply |V | messages and have |V |/2 pairs at the input of χ of the second round with a difference b 1 in both active blocks.We have basically linearized χ of the first round by applying inputs forming a column-aligned affine space.Clearly, the first difference b 0 vanishes from the equation as does γ and our attack is effectively based on the trail core (b 1 , b 2 , b 3 , b 4 , b 5 ).When applying M/2 two-block inputs with M/2 a multiple of |V |, the success probability now becomes: With the current trail weight bounds, found in Table 7, we would obtain an upper bound of Pr(collision) ≈ M 2 −(2+54+56) = M 2 −112 , higher than the term in our security claim.However, our 4-and 5-round bounds for Xoodoo are not tight as they are just the side effect of the absence of trails with weight below 54 and 56 respectively.Improving the bounds to weights of, say, 15 per round, would be enough to decrease this term well below M 2 −128 .Our current bounds on 3 and 6 rounds suggest this can be done by adapting our trail scanning software to 4-and 5-round trails and we consider this future work.
We also consider following more theoretical questions as interesting research problems: • Is it possible to linearize (some active columns of) the non-linear layer χ of the second round and, if so, what is the cost in terms of data complexity?
• How does success probability behave for values of M that are not large enough to form an affine space that covers all active columns of a 1 ?
• Can one increase the success probability by constructing structures with multiple input differences that are horizontal shifts of each other?

Properties of mask derivation
The purpose of the mask derivation is to derive the 384-bit mask k from a variable-size key K.To counter attacks that swap input blocks i and i + δ, the adversary should have no effective way to predict the value of k + roll δ Xc (k) by guessing part of the mask k or key K. Regarding collision attacks as described in the previous section, it shall be hard for the adversary to reduce the required dimension of the vector space U by n after guessing less than n bits of the masks (or linear combinations thereof) of the two active blocks.
We have addressed these requirements by having roll Xc operate on the full state and having the non-linear layer χ after the diffusion mapping θ in the round function.

Attacks solely based on outputs
Clearly, two or more blocks of output give enough information to determine the value of the output mask k and the rolling state, independently of the compression phase or the input that was applied.Extracting it however should be computationally difficult.When performing an algebraic attack using two or more output blocks, the adversary must solve a system of equations with unknown variables spread over two full instances of Xoodoo[6].The best reference on attacks on the expansion phase is Chaigneau et al. [CFG + 18] that discusses attacks on a preliminary version of Kravatte that we will denote as Kravatte'.They used the following techniques: Meet-in-the-middle They express bits of the intermediate state after q rounds of p e as polynomials of bits of the rolling state roll j e (y) on the one hand and as polynomials of the output mask k on the other, using the knowledge of an output block z j .The number of monomials in y is limited by the algebraic degree of q rounds of p e , and the number of monomials in k is limited by the algebraic degree of n − q inverse rounds of p e .As the inverse of χ in Xoodoo has only algebraic degree 2, this technique would likely work better in Xoofff than in Kravatte'.
Linearization They convert non-linear equations to a system of linear equations by considering the monomials as independent variables, so-called monomial variables.

Elimination of monomials by exploiting linear recurrence
If roll e is linear, the bits of roll j e (y) satisfy a linear recurrence equation.This allows eliminating the monomial variables in roll j e (y) from the system of linear equations above, leaving only monomial variables in k .
If roll Xe would be linear, the attacks of [CFG + 18] would probably work much better on Xoofff than on Kravatte'.In Xoofff the inverse of χ has lower degree than in Kravatte' and it has a smaller state.However, roll Xe is not linear and it has been designed with these attacks in mind.

Attacks using input-output pairs
In [BDH + 17, Section 5.4] an attack is described that exploits the outputs of a large set of inputs that result in an affine space in the accumulator.In a way it skips the application of p c and roll c , by restricting the value of the input blocks in each position to two values.If the dimension D of this affine space is equal to the degree d of p e • p d , the sum of the outputs is independent of the input mask.If D > d this sum is zero and if it D = d − 1, each bit of the output sum is a linear function of the mask.The former two can lead to a distinguishing attacks, the latter to key recovery.These attacks impose a lower bound to the degree of p e • p d .In Xoofff these are 12 rounds of Xoodoo and hence the degree approaches the maximum value 383.
In [CFG + 18] this attack was improved by guessing output mask bits and peeling off 2 rounds at the end.Still, with 12 rounds of Xoodoo even peeling off 4 rounds would still require applying a set of 2 256 chosen inputs.
All remaining attacks on Xoofff require some distinguisher in an Even-Mansour like structure, where the input and output masks serve as secret keys and the permutation consists of 18 rounds of Xoodoo.This is the realm of the attacks based on classical distinguishers such as differential and linear cryptanalysis, truncated differentials, impossible differentials, boomerang and rectangle attacks, integral cryptanalysis, and of course invariant subspace and nonlinear invariant attacks.The challenge for the majority of above attacks is that 18 − rounds need to be bridged with some distinguisher.In the light of the fact that Xoodoo reaches SAC after 3.5 rounds, that Xoodoo has weak alignment and that for 4 rounds or more low-weight differential and linear trails are nowhere to be seen, finding such a distinguisher would be a major breakthrough.Moreover, note that an attacker has only access to the forward cipher, as Xoofff, or Farfalle in general, simply has no inverse.

Trail analysis
In this section, we prove lower bounds on the weight of differential and linear trails using a computer-aided approach.We base ourselves on the techniques presented by Mella et al. in [MDA17].The results are in Section 5.2.4.

Unifying differential and linear trail search
Given the strong similarity between the study of differential and linear trails, we further unify the notation by defining: • λ = λ, ρ early = ρ east , θ = θ and ρ late = ρ west for differential trails, and west , θ = θ and ρ late = ρ −1 east for linear trails.This is illustrated in Figure 6.

General strategy
Thanks to the similarity of our permutation with Keccak-p, we base our approach on that of Mella et al. in [MDA17].We exhaustively scan the space of 3-round trail cores and use that to prove lower bounds on the weight of trail cores up to 6 rounds.The 3-round trails are in turn obtained by extending 2-round trails forward and backward.We set our parameters such that all trails up to weight T 3 = 50 are generated.Finally, we extend these trails to 6 rounds with the guarantee that any trail with weight ≤ 2T 3 + 2 = 102 will be found, if it exists.In a nutshell, the underlying ideas are the following.The weight of a 3-round trail is w(b 0 ) + w(b 1 ) + w(b 2 ).A naive way to generate all trails up to some weight 6n would be to generate all patterns b with weight below 2n and then extend forward and backward to 3-round trails.The number of such patterns however grows very fast with the weight, i.e., there are 128 n 7 n patterns with weight 2n.E.g., for 2n = 10, this is already about 2 42 .The number of patterns can be drastically reduced by considering symmetry.As both χ and λ are invariant with respect to translations parallel to the planes, the number of patterns reduces roughly by a factor 128.But it still grows very fast with the weight.
The weight of a 3-round trail can be expressed alternatively as w(a 1 ) + w(λ(a 1 )) + w(b 2 ) or as w(a 1 ) + w(a 2 ) + w(λ(a 2 )).In the former case, two of the three weights are fully determined by a 1 and in the latter case by a 2 .In both cases, the sum of those two weights is of the form w(a) + w(λ(a)).
As demonstrated in [MDA17] for Keccak-p, we expect for Xoodoo that the number of trails with a given weight per round decreases with the number of rounds.In other words, we expect that the number of 2-round trails with weight below 4n is smaller than that of 1-round trails with weight below 2n.We will hence scan the space of patterns a taking into account the sum w(a) + w(λ(a)) and extend them forward and backward, thereby dramatically reducing the number of patterns to extend.
• The former case implies that 2w(a 1 ) + w(b 1 ) ≤ T 3 + δ.Such trails can be obtained by generating all 2-round trails a 1 λ −→ b 1 satisfying this inequality and then extending each of them forward by finding all states a 2 compatible with b 1 .
• The latter case implies that w(a 2 ) + 2w(b 2 ) < T 3 − δ, and since all weights are even, the condition is equivalent to w(a 2 ) + 2w(b 2 ) ≤ T 3 − δ − 2. Such trails can be obtained by generating all 2-round trails a 2 λ −→ b 2 satisfying this inequality and then extending each of them backward by finding all states b 1 compatible with a 2 .
Note that there are two differences between our general approach and that with the one of Mella et al.First, we do not distinguish between kernel and non-kernel states when generating 2-round trail cores.Second, we allow the generation of 2-round trail cores to be unbalanced between those that are extended forward and backward by allowing δ = 0. We noticed that in our implementation the backward extension takes more time than the forward extension, so by setting δ > 0 we reduce the number of 2-round trails to extend backward, at the cost of an increase of those to be extended forward, and we reach a more balanced timing in both cases.

Generation of 2-round trail cores
We now concentrate on the generation of 2-round trail cores of the form a λ −→ b.We do this by generating state values A at the input of θ , so that we can control the parity of A and exploit the properties of θ.From A, we can compute a = ρ −1 early (A) and b = ρ late (θ (A)).We do this while bounding the cost function αw(a) + βw(b) for α, β ∈ {1, 2} as explained above.

Properties of θ
As θ is a linear layer similar to Keccak's θ function, the following definitions are adapted from those in [BDPA11b].Note that, as a linear function, the properties of θ are the same whether applied on a state absolute value or on a difference, so we just write âĂĲvalueâĂİ.
The parity plane (or parity for short) P (A) of a value A is defined as the parity of the columns of A, namely P (A) = y A y .A column is even (resp.odd) if its parity is 0 (resp.1).When the parity of a value is zero (i.e., all its columns are even), we say it is in the column-parity kernel (or kernel for short).
The θ-effect of a value A is E(A) = P (A) ≪ (1, 5) + P (A) ≪ (1, 14).A column of coordinates (x, z) is affected iff the corresponding bit in E(A) is 1; otherwise, it is unaffected.Note that the θ-effect always has an even Hamming weight so the number of affected columns is even.We define the θ-gap as the number of affected columns divided by two.
An odd column at coordinates (x, z) induces two affected columns at coordinates (x + 1, z + 5) and (x + 1, z + 14).Adding a second odd column at coordinates (x, z + 9) will induce an affected column at (x + 1, z + 23) and cancel the affected column at (x + 1, z + 14).This can be further extended to more odd columns at coordinates (x, z + 9n).Informally, chaining such odd columns, together with their two induced affected columns, makes up a run.When all the columns in a sheet are odd, then the θ-effect cancels and there are no induced affected columns.Informally, we call the set of such odd columns a loop.
To better formalize this, we can make a change of coordinates on the z-axis and use the t coordinates instead: z = 9t ⇔ t = 25z.
An odd column at coordinates (x, t) induces two affected columns at coordinates (x, t − 3) and (x, t − 2).A run is thus a sequence of odd columns with the same x coordinate and consecutive t coordinates.

Decomposition of a state value around θ
Given a state value A, a bit at position (x, y, z) (or, equivalently, at (x, y, t)) is said to be active if its value is 1.Otherwise, it is passive.We decompose a state value at the input of θ as the sum of basis vectors, called elements, of three different kinds: the parity loops, the parity runs and the orbitals.Definition 3. A parity loop (or loop for short) is a state value with 32 active bits in a sheet, each in a distinct column.Definition 4. A parity run (or run for short) is a state value composed of 1 ≤ l ≤ 31 active bits in l different columns of a sheet x with consecutive t coordinates (t 0 , t 0 +1, . . ., t 0 +l−1), and of zero or two active bits in each of the (affected) columns (x + 1, t 0 − 3) and (x + 1, t 0 + l − 3).Definition 5.An orbital is a state value with two active bits in the same column.
Loops and runs generate odd columns, while runs also have affected columns.From the decomposition of the value before θ, the value at the output of θ is easy to determine: A loop and an orbital are invariant through θ, and a run gets the bits in its two affected columns complemented through θ, while the remaining columns remain unchanged.Since θ is linear, the state after θ can be decomposed in the images of the elements through θ, with the same coefficients.So the decomposition into elements tells as much about the state before as the state after θ.
Any state value can be expressed as the bitwise sum of elements and, as such, they generate the full state space.However, this decomposition is not unique.To make the decomposition unique, we rely on a number of conventions.These conventions also help bounding the weight of the states obtained when combining these elements by avoiding as much as possible turning an active bit back to passive on either side of θ.The conventions are as follows.First, all odd columns stem from a unique loop or run.Then, an orbital can only be added to an empty column or to an unaffected odd column with a single active bit at y = 0. Finally, an affected odd column must follow the odd-0 convention: Definition 6.The odd-0 convention says that an affected odd column must be represented as the sum of an unaffected odd column with a single active bit at y = 0 (of a loop or of a run) and of an affected even column (of a run) chosen accordingly.Figure 7 illustrates this.
Lemma 1.The value A at the input of θ can be uniquely decomposed as a sum of elements.
Proof.Let us define an algorithm that determines the last element of the state, then removes it by adding it back to A. The algorithm can then be applied recursively until all bits are passive.
The algorithm takes as input a value A at the input of θ, computes the parity P (A) and the θ-effect E(A) and then proceeds as follows.
1. First, it looks for the unaffected column with two or more active bits with the highest coordinates using [x, z] lexicographical ordering.If it exists, the algorithm outputs an orbital O by taking the two bits with the highest y coordinates in that column, adds back O to A and recursively starts again.

Breaking down to bit-level units
To construct differential and linear trails, we use the tree traversal technique as defined by Mella et al. in [MDA17, Section 3].We represent a state value as a set of units, each consisting of a pattern of active bits.By defining an ordering of units, a set of units becomes a unit list.A unit list can be constructed progressively by appending a unit at the end.Finally, we need to define a cost function and a subtree bounding function to define the set of state values we are interested in and to be able to prune the search.Compared to the choice of more macroscopic units by Mella et al. for the bounds in Keccak [MDA17], we decided to define units that activate at most one bit before θ and at most one bit after θ.More specifically, a unit represents an active bit both before and after θ in unaffected columns, or a bit that is active either before or after θ in affected columns.This allows a finer-grained bounding function, where the decision to set a bit can every time lead to pruning the tree and thus potentially save processing time.We thus break down loops, runs and orbitals in a number of bits, called bit units.
A bit unit is first characterized by the type of the element it composes, and that can be a loop, a run or an orbital.The ordering is primarily on the type and such that loops come first, then runs and finally orbitals: loop ≺ run ≺ orbital.
• A loop-typed bit unit represents an active bit in a loop and is characterized by its (x, y, z) coordinates.After the type, the ordering is lexicographic on [x, z, y].Since a loop is bound to a given sheet x, all the bit units composing a loop are consecutive in the unit list.
• A run-typed bit unit represents an active bit in an odd column or in an affected column of a run.It is first characterized by the (x 0 , z 0 ) (or, equivalently, (x 0 , t 0 )) coordinates of the first odd column.We then distinguish the different bit units by their rank r and subrank s, and either by their y coordinate or their value v.
After the type, the ordering is lexicographic on [x 0 , z 0 , r, s, y, v] so that the bit units composing the same run are consecutive.The different bit units are as follows, in this order: -For the affected column at (x 0 + 1, t 0 − 3), there are three bit units with rank r = 0 and subrank s ∈ {−3, −2, −1}.Since these units represent what happens in an affected column, they are also characterized by a value v ∈ {before, after} telling whether the bit is active before or after θ.The effective position of the active bit is (x 0 + 1, s + 3, t 0 − 3).
-For each odd column, there is bit unit representing an active bit at (x 0 , y, t 0 + r) for rank r ∈ {0, 1, . . ., l − 1}, subrank s = 0 and y coordinate.These bits are active before and after θ.
• An orbital-typed bit unit represents an active bit in an orbital and is characterized by its (x, y, z) coordinates.After the type, the ordering is lexicographic on [x, z, y] so that the bit units composing an orbital are consecutive.
As already said, the ordering of units is defined such that the units that compose an element are consecutive in the unit list.Additionally, the ordering is chosen in agreement with Lemma 1.That is, for a state value represented as a unit list, the last units correspond to the element returned by the algorithm in the proof of Lemma 1.
In a unit list, we distinguish between stable and unstable bits.A stable bit is an active bit that is guaranteed to stay active even if more bit units are added to the unit list.We characterize stable bits as follows.
• If the unit list does not contain orbital-typed bit units, an active bit in a loop or in a run is stable iff its y coordinate is not zero.
• If the unit list contains at least one orbital-typed bit unit, all active bits are stable.
The rationale is as follows.If an active bit in a loop or in a run is added to an unaffected column, it is active both before and after θ.But if the column it sits in becomes affected due to the addition of a run, the active bit is removed from either before or after θ.Similarly, a bit in an even affected column is active at a given side of θ, but a new active bit from a loop or a run can be added to it, effectively changing the side where the old bit is active.However, the odd-0 convention prevents an affected column from being added to an odd column where the active bit is at y > 0 and, vice-versa, it prevents an active bit of a loop or a run with y > 0 from being added to an affected column.So once an active bit with y > 0 is added, it cannot be removed.Furthermore, an orbital is stable and comes after other types of elements, so once we start adding orbital-type bits, all the bits are stable.
The subtree bounding function starts by counting the contribution of the stable bits after translations through ρ −1 early and ρ late .An active bit contributes the weight by 2 only if it lands in a column without any active bits yet.Then the subtree bounding function lower-bounds the contribution of the unstable bits.Notice that an unstable bit will yield an active bit at least in either side of θ.So the subtree bounding function counts the minimum of the contribution of an active bit before or after θ.It also marks the column where the unstable bit lands, on both side of θ, so that the contribution in a given column cannot be counted more than once.

Extension to 3 and 6 rounds
For every 2-round trail core produced, we extend it forward or backward according to the general strategy outlined in Section 7.2 above.
The extension exploits the fact that the compatible states form an affine space, as shown in Corollary 1.Let us illustrate this for the forward extension, as the backward extension enjoys the same property and the description can be easily adapted to that case.For each active column at the input of χ, we form an affine space of the compatible states at the output of χ.We do this for all active columns, resulting in a description of an affine space at the output of χ: O + B 1 , B 2 , . . ., B w .We then transform the offset and basis vectors through λ to get a description of the affine space at the input of the next χ, namely O + B 1 , B 2 , . . ., B w with O = λ (O) and B i = λ (B i ).This way, we can more easily compute (and bound) the weight that is added to the trail when extending it.
To help bound the weight of the trail when extending it, we first triangularize the basis vectors (B i ).The triangularization defines a nested sequence of sets (B i ) of bit positions such that i < j ⇒ B i ⊃ B j and B i has no bits 1 outside of B i .The search through the affine space also uses the technique of tree search of Mella et al. [MDA17].
Here, the units represent the basis vectors and their ordering is according to their indexes, i.e., i < j ⇔ B i ≺ B j .This way, if the current unit list ends with B i , we can lower bound the weight of all its descendants given that all bits outside of B i+1 cannot be changed.
So if such a trail exists, we must find it when extending all 3-round trails of weight at most T 3 both backward and forward.

Concrete experiments
We performed the generation of 2-round trail cores and their extension to 3 rounds together.The search for both differential and linear trails of 3 rounds took about 16 core×days on a desktop PC equipped with an Intel R CoreâĎć i5-6500 CPU.We ran the computation on four parts in parallel, split between linear and differential trails and between the forward and backward extensions, with the longest part taking 5 days.The extension to 6 rounds took a few minutes.
The source code of the program to generate and extend trails is available as open source software [DHAK18b].We wrote it in C++ and used parts of KeccakTools [BDP + 17], in particular for the generic tree search code from Mella at al. [MDA17].We improved the generic tree search code slightly, then instantiated it with the appropriate classes for the generation of 2-round trail cores.The resulting trail cores listed in Table 6 are also given, with differential and linear trails split in different files [DHAK18b].Each set of trails is provided both in a format easily parsable by the software and in a visual text representation.

Inherent 3-round trails
We now describe trails that are inherent to the very structure of Xoodoo.Definition 7. We say that a trail is inherent if a trail with the same structure exists for any variant of Xoodoo with θ that is a column parity mixer, with χ that allows one-active-bit patterns to propagate to the same one-active-bit pattern, and with ρ west and ρ east that consist of plane shifts.
For convenience, we restrict to the case where one odd column in θ makes 2 columns affected, but this could be easily generalized to another number of affected columns, and the number of active bits in the sequel would need to be adapted.Table 9 shows the three types of inherent trails with weight up to 38 and Figure 8 schematically depicts them.
In all three types of inherent trail, the nonlinear layer χ of the middle round acts as the identity.This means that we can replace ρ early • χ • ρ late with ρ early • ρ late , and we denote it by ρ both .Moreover, in the nominal case, all active bits in the three χ layers appear in different columns, hence each making one column active.So, we study trails (a 0 , a 0 , a 1 , a 1 ) that propagate through θ • ρ both • θ with a 0 = θ (a 0 )), a 1 = ρ both (a 0 ) and a 1 = θ (a 1 ): The trail weight equals the sum of the Hamming weights of a 0 , a 0 (or equivalently a 1 as ρ both is a mere bit transposition) and a 1 times 2. Note that the value of a 0 determines the full trail.
In the first inherent trail the input to θ in both rounds is in the kernel making it vanish.In such trails, a 0 is called a vortex in [DA12] and it is a state that is in the kernel and remains so after applying ρ both .Clearly, as all patterns in the trail have the same Hamming weight, the trail weight is 6 times the Hamming weight of a 0 .Vortices are completely determined by ρ both , and more specifically, by its (two-dimensional) translation offsets for the planes 1 and 2. Let us denote these by u 1 and u 2 .Due to the fact that ρ both treats the three planes as rigid structures, there exists a vortex consisting of 3 orbitals, that ρ both maps to 3 orbitals.In particular, a 0 and a 1 have active bits in in the following positions: • Plane 0: in 0 and u 1 − u 2 before ρ both , then in 0 and u 1 − u 2 after ρ both • Plane 1: in 0 and −u 2 before ρ both , then in u 1 and u 1 − u 2 after ρ both • Plane 2: in −u 2 and u 1 − u 2 before ρ both , then in 0 and u 1 after ρ both Note that the existence of such a 3-orbital vortex is independent of the dimension of the rigid structures (in our case planes) and so is the weight of the corresponding 3-round trails: 36.
A single-orbital fan is a trail that has two active bits in the same column in a 0 .Since a 0 is in the kernel, a 0 = a 0 .These bits propagate through ρ both , where they land in different columns before θ .They then expand to 7 bits each after θ in a 1 , and these 2 × 7 bits can land in 14 different columns.Since a 0 and a 0 contain 2 active bits and a 1 contain 14 active bits, the total weight is thus 36.
A θ 2 -glide is a trail with the following structure: • In a 0 , the state is made of one active bit alone in a column and of two orbitals located in the affected columns induced by the first active bit.
• In a 0 , the two orbitals are replaced by a single active bit each, in addition to the first active bit that remains.We choose the positions of the bits in a 0 such that they all arrive in the same plane in a 0 .
• Since all the active bits are in the same plane, ρ both acts as a global shift of the whole state, which we can ignore in this discussion and consider that a 1 = a 0 .
• Finally, a 1 = θ (a 1 ) = θ (θ (a 0 )) up to the global shift.The (θ ) 2 operation applied to the single active bit of a 0 induces 4 affected columns, two going from 1 to 2 active bits, and two going from 0 to 3 active bits, so with a total of 1 + 4 + 6 = 11 active bits.
Since a 0 contains 5 active bits, a 0 contains 3 active bits and a 1 contains 11 active bits, the total weight is thus 38.

Conclusions and perspectives
In this paper, we have introduced a novel permutation called Xoodoo and a deck function called Xoofff for concrete encryption and authentication applications.From a cryptographic point of view, we think that the chosen structure and set of operations lead to a design with nice properties.It is easier to analyze the differential and linear trail propagation than on Keccak, and we could make sure that the chosen rotation constants avoid low-weight trails that are not inherent to the structure.Our permutation has a dispersion layer between every mixing and nonlinear layer, whereas in Keccak-p the mixing layer follows the nonlinear layer immediately, causing suboptimal intra-slice effects.This may explain why the minimum weight over 3-round trails is 36 for Xoodoo, while it is 24 for differential trails in Keccak-p[400] [MDA17].Furthermore, there is no known bounding of linear trails in Keccak-p [400].
From an implementation point of view, we expect that Xoodoo shares with Keccak-p highly efficient hardware implementations (and much smaller than Keccak-p[1600]) and efficient protections against side-channel attacks.Looking back at Table 5, the performance of Xoofff on Skylake(X) processors is excellent, competing with that of the AES in counter mode although the latter benefits from hardware acceleration and the former uses only general-purpose instructions.On 32-bit processors, Xoofff is much faster than the AES in counter mode.As for the performance on ARM Cortex processors as reported in Table 4, we notice that the performance per round of Xoodoo on ARM Cortex-M3 is similar to that of Gimli, while Gimli is significantly faster on Cortex-M0.However, these values have to be taken with care, as each permutation does not need the same number of rounds when used in a given mode to yield a secure scheme.For instance, Xoodoo needs 6 rounds to guarantee the absence of trails with weight less than 104, while Gimli would need 16 rounds as suggested by [BKL + 17, Table 1].
For future work, defining Xoodoo variants with other dimensions can be useful.For constrained platforms, it could be interesting to have permutations with smaller widths.If we take the birthday bound on the permutation width as the limiting factor and we consider the minimum processor word length to be 32 bits, two particular widths come to mind: 288 = 32 × 3 × 3 for targeting 128-bit security and 192 = 32 × 2 × 3 for 80-bit security.This can be realized by taking planes consisting of 3 and 2 lanes respectively.On the other hand, one can also have "planes" consisting of only a single lane of length 96 and 64 respectively.Efficient implementations on 32-bit platform can then be derived by rewriting the operations using bit interleaving [BDP + 12].Towards larger permutations, a 768-bit variant with planes consisting of 4 lanes of 64 bits each would be interesting for sponge-based hashing.

Figure 2 :
Figure 2: Toy version of the Xoodoo state, with lanes reduced to 8 bits, and different parts of the state highlighted.
The three metrics have values in the range [0 . . .b] and for a random transformation F we have for any input difference ∆: D av (F, ∆) ≈ b, w av (F, ∆) ≈ b/2 and H av (F, ∆) ≈ b.

Algorithm 3
Definition of monomials(F, r, M ) for monomials of degree two.Parameters: a b-bit non-linear rolling function F , number of rounds r and number of samples M Let δ i = 0 i ||1||0 b−i−1 Initialize monomial count p to 0 for i = 0 to b − 1 do for j = i + 1 to b − 1 do for all state bit positions k do for M randomly generated states A do Compute

Figure 6 :
Figure 6: Conventions for differential (DC) and linear (LC) trails in the round function.

Table 1 :
Notational conventionsA y Plane y of state A A y ≪ (t, v) Cyclic shift of A y moving bit in (x, z) to position (x + t, z + v) A yBitwise complement of plane A y A y + A y Bitwise sum (XOR) of planes A y and A y A y • A y Bitwise product (AND) of planes A y and A y Algorithm 1 Definition of Xoodoo[n r ] with n r the number of rounds Parameters: Number of rounds n r for Round

Table 2 :
The round constants c i with −11 ≤ i ≤ 0, in hexadecimal notation (the least significant bit is at z

Table 3 :
Notational conventions for specification of the rolling functions v Shift of lane A y,x moving bit from x to x + v, setting bits x < v to 0 A y,x + A y ,xBitwise sum (XOR) of lanes A y,x and A y ,x A y,x • A y ,x

Table 4 :
Performance of a round of different permutations on Cortex-M3 and -M0

Table 5 :
Performance of Xoofff on different platforms.The AES figures on Cortex-M0 and -M3 come from

Table 6 :
The 3-round trail cores.The number of trail cores is up to translations along x and z, see Section 5.5.

Table 7 :
[BDPA11b]t of the best differential and linear trails (or lower bounds) as a function of the number of rounds.Let us illustrate strong and weak alignment by applying the simplest difference activity patterns to two of the best known cryptographic primitives: Rijndael[DR02]and Keccak-p[BDPA11b].The simplest activity pattern is that with a single active box:• In Rijndael, boxes are bytes and the linear layer is MixColumns • ShiftRows.Shift-Rows is a transposition of boxes and therefore moves inputs with equal activity patterns to outputs with equal activity patterns.This allows us to ignore it and focus on MixColumns.If we apply an input to the MixColumns matrix with a single active byte, all 4 output bytes will be active.This is a consequence of the fact that the MixColumns matrix has branch number 5.So the Rijndael linear layer maps all 2 8 − 1 = 255 patterns with a single active byte at some position to the same 4-byte activity pattern, and this is the case for all 16 byte positions of the active byte at the input.• In Keccak-p[400], boxes are 5-bit rows and the linear layer λ = π • ρ • θ maps the 31 single-row patterns to 31 different activity patterns, and this for all row positions [BDPA11a].
Clearly, for this simple case Rijndael has the strongest possible alignment and Keccak-p[400] the weakest possible.If we consider input activity patterns with multiple active boxes the distinction is less extreme but the trend is similar.

•
Single bit toggles such as 10000..., 01000..., etc. and complements 01111..., 10111..., etc. • Single bit toggles like the above, but starting with alternating lanes as 0 32 ||1 32 ||0 32 ||... and 1 32 ||0 32 ||1 32 ||...Note that having an invertible rolling function is desirable since it is conducive to longer cycles, and makes the short cycle test more efficient since it obviates the need to keep track of all states that have been seen.Instead, only the initial state needs to be remembered; if the state arrives back at the initial state, then a cycle has been found.In our tests, we ran 10 6 iterations per pattern and candidate.