Fast AES-Based Universal Hash Functions and MACs Featuring LeMac and

. Ultra-fast AES round-based software cryptographic authentication/encryption primitives have recently seen important developments, fuelled by the authenticated encryption competition CAESAR and the prospect of future high-profile applications such as post-5G telecommunication technology security standards. In particular, Universal Hash Functions (UHF) are crucial primitives used as core components in many popular modes of operation for various use-cases, such as Message Authentication Codes (MACs), authenticated encryption, wide block ciphers, etc. In this paper, we extend and improve upon existing design approaches and present a general framework for the construction of UHFs, relying only on the AES round function and 128-bit word-wide XORs. This framework, drawing inspiration from tweakable block ciphers design, allows both strong security arguments and extremely high throughput. The security with regards to differential cryptanalysis is guaranteed thanks to an optimized MILP modelling strategy, while performances are pushed to their limits with a deep study of the details of AES-NI software implementations. In particular, our framework not only takes into account the number of AES -round calls per message block, but also the very important role of XOR operations and the overall scheduling of the computations. We instantiate our findings with two concrete UHF candidates, both requiring only 2 AES rounds per 128-bit message block, and each used to construct two MACs. First, LeMac , a large-state primitive that is the fastest MAC as of today on modern Intel processors, reaching performances of 0.068 c/B on Intel Ice Lake (an improvement of 60% in throughput compared to the state-of-the-art). The second MAC construction, PetitMac , provides an interesting memory/throughput tradeoff, allowing good performances on many platforms.


Introduction
Since its standardization, the AES block cipher [DR02] has deeply influenced the design of symmetric-key cryptographic primitives.This trend even accelerated after the introduction in modern CPUs of AES-NI [Gue08], a set of dedicated hardware accelerated instructions implementing the AES encryption and decryption.To benefit from that potential performance boost, designers continued studying operating modes allowing a direct and efficient reuse of the full AES [MV04, RBB03,KR21].Yet, since AES-NI granularity lies at the round level, many new cryptographic designs actually use the AES round-function as building block, either for hash functions [BBG + 08, IAC + 08, BD08, GK08], for authenticated encryption schemes [WP14, Nik14, JNPS21, SLN + 21, NFI24], for permutations [IIL + 23, GM16, KLMR16, BLLS22], or collision resistant building blocks [JN16,Nik17a], among other applications.Today, hardware acceleration of the AES round-function is widespread in most computing platforms, from high-end Intel/AMD CPUs to microcontrollers for mobile devices, and AES-NI have become even more helpful over the processors generations, with reduced latency and increased throughput.
These technological advances allowed many symmetric-key primitives to eventually reach throughput performances under 1 c/B, but new use-cases arise.In particular, sixth-generation mobile communication systems (6G) plan to deliver transmissions with an impressive throughput range of 100 Gbps to 1 Tbps.This puts a lot of pressure on the encryption/authentication performances and AES-NI-based solutions seem very natural.This is the direction taken by the Authenticated Encryption (AE) algorithm Rocca [SLN + 21, SLN + 22] and its updated version Rocca-S [NFI24], currently the fastest AE on AES-NI platforms and under submission at IETF.Recently, the round function framework of Rocca has been further analysed in a work that presents optimal round function candidates (in terms of speed) within the framework [TSI23].
More generally, there has been significant efforts to design symmetric primitives relying on AES rounds (and the corresponding processor intrinsics), such as AEGIS [WP14], Tiaoxin [Nik14] or Aerion [BLLS22].We note that most of these primitives have suboptimal throughputs on some recent processors.For instance, the optimal candidate in the Rocca round function framework [TSI23] reaches a throughput of 0.104 cycles per byte on Tiger Lake, while the maximum theoretical throughput is 0.0625 cycles per byte for any candidate with the same number of AES rounds per 128-bit message, as explained in Section 2.2.

Universal Hash Functions and Message Authentication Codes
In this paper, we study the construction of (almost) universal hash functions (UHF) based on AES rounds.UHFs take as input a secret key and a plaintext, and map them to a fixed-length digest.Formally, we consider them as a family of functions indexed by a key (choosing a key corresponds to choosing a member in the family), with two different security notions: almost-universal hash functions (-AU), and almost-XOR-universal hash functions (-AXU), defined as follows: Definition 1 (-AU).A family of functions  :  →  is -almost-universal if: The -AU notion only requires collision resistance on average over a random key.The -AXU notion is a stronger variant to cover an arbitrary output difference, rather than just collisions.In particular, if  is an -AXU family, it is also an -AU family.
UHF security notions are relatively weak, so that they can be fulfilled by purely combinatorial constructions.However, they are quite versatile; in particular, a UHF can be turned into a Message authentication Code (MAC) with a few extra components.

UHF-based MACs.
A MAC also processes a message and a secret key to generate a tag (a nonce-based MAC would also take as input a non-repeating nonce value), but in order to ensure authenticity/integrity of the message, a stronger security notion is expected.It should be hard for an attacker to construct a forgery on a MAC, i.e. generating a valid combination of message/tag without knowledge of the secret key.
More formally, for a key , a nonce  and a message  , a nonce-based MAC  consists of a signing algorithm AUTH  (,  ) that generates a tag  , and a verification algorithm VER  (, ,  ) that returns "valid" if AUTH  (,  ) =  and "invalid" otherwise.A (, , )-adversary against the nonce-based MAC-security of  is an adversary  with access to oracles AUTH  and VER  , making at most  MAC queries to the AUTH  oracle, at most  verification queries to VER  oracle, and running in time at most .We say that  forges if any of its queries to VER  returns "valid".The advantage of  against the nonce-based MAC security of  is defined as where  $ ← −  denotes that  is chosen uniformly at random from the set  of possible keys and where  is not allowed to ask a verification query (, ,  ) to VER  if a previous query (,  ) to AUTH  returned  .Note that  is also not allowed to repeat nonces for AUTH  , but can repeat them for VER  .
MACs are classically built from block ciphers, but also from UHF. Notably, GMAC [Dwo07] and Poly1305 [Ber05] are two popular MACs based on UHF which use polynomial evaluation in a finite field as a UHF.They use the Wegman-Carter-Shoup construction [CW77,Sho96] to construct a nonce-based MAC from UHF.However, it only provides 2 /2 security for a -bit tag with unique nonces, and fails completely when nonces are repeated.The EWCDM construction [CS16] guarantees a significantly higher security as it was proven [MN17] to provide essentially 2  security with unique nonces and even 2 /2 when nonces are repeated.Arithmetic UHFs.There has been a significant effort to design fast UHF based on arithmetic operations: Polynomial hashing (GHASH used in GCM [MV04], Poly1305 [Ber05]), NH in UMAC [BHK + 99], etc.These constructions can be quite fast, and have a proven security level.For instance, GHASH only requires a single multiplication and a single addition in F 2 128 for every 128-bit block of plaintext.This is particularly interesting in environments where instructions enabling fast arithmetic in the finite field of size 2 128 are provided, which is the case of most modern processors intended for a usage in servers and desktop computers.On the other hand, Poly1305 and UMAC rely on integer multiplication.AES-based UHFs.Dedicated design strategies for block ciphers and hash functions are well known, but dedicated Universal Hash Functions (UHFs) have received less attention.In particular, processors that enable fast computation of arithmetic for UHFs often support fast computation of a full AES round as well.Therefore, in this paper we focus on designing fast UHFs based on the AES round.
In this context, we note the interesting PC-MAC construction of Minematsu and Tsunoo [MT06].Based on the analysis of the the Maximum Expected Differential Probability (MEDP) of 4-round AES by Keliher and Sui [KS07], they consider 4-round AES as an -AXU family with  ≈ 1.18 • 2 −110 (under the hypothesis that the round keys are independent).Using this as a building block, they construct a MAC with 4 AES rounds per 128-bit block of plaintext, with provable security.Another interesting work is the EliMAC primitive proposed by Dobrauning et al. [DMN23], which uses 11 AES rounds per 128-bit message block (7 rounds can be precomputed in an offline phase, leaving 4 in the online phase).
We thus aim for fewer than 4 AES rounds per block of message, but the achieved security will be heuristic, instead of relying on a formal security proof.

Our contributions
In this paper, we present a family of UHFs than can reach better performances than state-of-the-art UHFs, by exploiting the extremely high throughput of AES-NI instructions, with flexible parameters that can be adapted to future computing platforms.We select candidates from our family with no differential trail of probability higher than 2 −128 , suggesting that these UHFs are -AU UHFs with  ≈ 2 −128 .Unlike most -AU UHFs, the -AU property of our candidates is not proved but rather heuristic, as we only ensure that no high probability differential trails exist.Our construction uses a novel design strategy compared to previous UHFs or collision resistant round functions, with a (potentially large) internal state separated into two parts: one part updated with non-linear and linear components (the AES round function and 128-bit XORs in our case), influenced by another part updated with linear components only (this second part is not influenced by the first one to reduce dependencies that would complicate both the instructions scheduling and the automated security analysis).Several fresh message blocks are inserted within the second part at each round so as to ensure a low rate.The general idea is that while a large state indeed complicates the attacker's task, updating it entirely can be costly and partial updates might lead to better security/performance tradeoffs.Thus, this separation strategy offers more granularity and draws inspiration from recent (tweakable) block cipher designs, where the tweakey schedule is linear, or Panama [DC98] hash function.
Although this family is too big to be exhausted in practice, we propose a process to iterate over candidates of the family and select some fast and secure ones.We implemented a tool that, given any candidate of this family, automatically computes the number of active S-boxes in the best differential trail, using Mixed Integer Linear Programming (MILP).The MILP model exactly discards linear incompatibilities, improving on heuristic approaches to avoid linear incompatibilities [CHP + 17].In addition, our tool can compile on-the-fly candidates of the family and benchmark them, in order to automatically measure their speed.To our knowledge, this is the first time that on-the-fly benchmarking is performed to filter candidates in an AES-based framework.
To showcase the relevance of our approach, we present an -AU UHF candidate whose round function reaches a speed of 0.067 cycles per byte on Intel Tiger Lake (i5-1135G7).In addition, we show the very first candidate with less than 2 AES rounds per 128-bit message block and 128-bit collision security, namely with 1.75 AES rounds per 128-bit message block.
From -AU UHF candidates of our family, we present two new MAC candidates: LeMac and PetitMac.The former is as of today the fastest MAC on modern desktop/server processors, reaching of speed of 0.068 cycles per byte on Intel Ice Lake (Xeon Gold 5320) for 256 kB messages vs. 0.113 cycles per byte for a MAC based on the round function of [JN16], the fastest state-of-the-art MAC according to our benchmarks.PetitMac is slower, but of a smaller size, thus more suitable for lightweight applications.Even though 6G communications (as targeted by Rocca-S [NFI24] AEAD) would mandate 256-bit keys for post-quantum considerations, both our MAC candidates have 128-bit keys, nonces and tags, as we believe this is largely sufficient for most applications, especially for MACs that do not suffer from "harvest now, decrypt later" attack strategies.We claim that LeMac and PetitMac both provide 128-bit security in the nonce-respecting setting.
Outline.We start with a detailed description of our design goals, and their interactions with the state-of-the-art, in Section 2. In light of this discussion, we decided to focus our efforts on a specific family of UHFs which we present in Section 3. Since this family is very large, we reduce the search space using for instance equivalence classes (see Section 4), and we automate the security analysis using MILP-based methods described in Section 5.The results of our search are presented in Section 6.Finally, we use these results to build concrete primitives in Section 7, while Section 8 concludes the paper.

Design Goals and First Observations
While several AES-based constructions exist, we identified places where there remains substantial room for improvement.Below, we describe the goals that our family of UHFs are intended to fulfil; the family itself will be described in Section 3.

AES-based round functions
As already mentioned in introduction, many designs rest upon the AES round function and the 128-bit XOR to be both secure and efficient, thanks to the AES-NI instructions set in modern processors.Among them, the CAESAR candidates Tiaoxin [Nik14] and AEGIS [WP14] (the latter was selected in the final high-performance portfolio) are competitive AEAD schemes.In terms of throughput, they are outperformed by the building blocks designed by Jean & Nikolić [JN16] and later Nikolić [Nik17a].Recently, the AEAD proposals Rocca [SLN + 21, SLN + 22] and Rocca-S [NFI24] target 6G requirements in terms of speed and security.All of those constructions aim at minimizing the so-called rate [JN16], that is, the number of AES rounds per 128-bit message block.Rocca (during Additional Data processing) and one of the schemes of Jean & Nikolić achieve a rate of 2 for 128-bit security.We will adopt a similar strategy and minimize the rate of the round function.
Goal 1.Our -AU families should use AES rounds as internal components for high software performance, and preferably at the lowest rate.

Instruction scheduling
Modern processors are superscalar processors with out-of-order execution.They can execute several instructions simultaneously, and schedule instructions as soon as the input operands are ready.Moreover the execution units are pipelined: some instructions take several cycle to process, and the execution unit can start processing a new instruction at every clock cycle, with the output being ready some cycles later [Int24].
There are two main metrics to measure the performance of an instruction : Latency: the number of clock cycles between the beginning of  to the return of its result.We denote () the latency of .
Throughput: the number of instructions that can be processed in a given amount of time.
We usually consider the reciprocal throughput, measured in cycles.We denote  () the throughput of .
Processors are composed of several execution units, accessed by ports denoted  1 . . .  .Each port   accepts a certain set of instructions   .At cycle , each execution unit   can process an instruction  ∈   , and returns its result () cycles later.At cycle  + 1, the execution unit   might process another instruction  ′ ∈   (with potentially  ′ = ), even though the instruction  of cycle  has not returned its result yet.The throughput of an instruction  corresponds to the number of ports which can process .
For the AES-based ciphers mentioned in Section 2.1, two types of instructions are extensively used: AES rounds instructions (e.g.AESENC), and 128-bit XORs.Note that in recent processors one can leverage 512-bit instructions (like VAESENC that can process four AES rounds in parallel), but in this work we focus on 128-bit instructions that are now widely available.This setting constitutes a very fair comparison with previous schemes and we can expect that most AES-based designs will greatly benefit from more AES rounds in parallel.
Each instruction has its own throughput and latency on modern processors [Fog22], but we cannot exploit the full throughput of both types of instructions at the same time, because they share ports, as illustrated by Table 1.In particular, on modern Intel processors (Ice On the number of XOR instructions in AES-based constructions.In the case of Intel Processors (Ice Lake and higher), the throughput of AESENC encryption is 0.5, and the throughput of XOR is 0.33; AESENC requires ports 0 or 1, and XOR requires ports 0, 1, or 5.In order to fully exploit the throughput of the AESENC instruction, we need to feed ports 0 and 1 only with AESENC instructions at each clock cycle, thus they become unavailable for XOR instructions, and XOR instructions can only be assigned to port 5. Consequently, if  AESENC instructions are executed at full throughput, there should be at most /2 XOR instructions.Similar observations apply to recent processors, and unfortunately, minimizing the number of XOR of AES-based constructions is not a systematic approach1 .For example, Jean & Nikolić [JN16] present rate-2 candidates with 6 AES per rounds and 9 128-bit XORs, and Rocca's rate-2 round function uses 4 128-bit XORS and 4 AES rounds instructions.As a consequence, regardless of implementation tricks, a full throughput on current modern Intel processors will always remain out of reach for these algorithms.
Dependency chains.In addition to the throughput analysis, dependency chains affect the performance of AES-based constructions (see Section 3.2.2 of [Int24]).As an example, let us denote   the first wire of an AES-based construction at round .If  +1 depends on   , the latency to compute  +1 from   is the sum of the latency of each involved instruction.Then, the latency of the round function is at least the latency of computing  +1 from   .In the decryption mode of Rocca [SLN + 21], we found a cycle of dependency with 6 cycles of latency, whose latency even increases to 8 cycles in practice on Ice Lake processors.This is explained in Appendix A, and we believe that it leads to a maximal theoretical speed of 0.25 cycles per byte of message on recent Intel processors.
Apart from these two points, there are a lot of processor subtleties, which are difficult to exhaustively consider.As a general guideline, we aim at avoiding any pipelining issue and at being as efficient as possible on modern processors.

Goal 2. The instruction scheduling in modern processors should be favorable.
Goal 2 is very reasonable, but is hard to guarantee with pen-and-paper analysis because of the always-evolving, complex, and well-optimized scheduling of modern processors.
In fact, one way to directly evaluate the performance with state-of-the-art instruction scheduling algorithms is to compile and benchmark candidates on-the-fly.This strategy exploits advanced techniques from compilers (e.g.modern gcc) or processors, and remains future-proof, since it can easily be adapted to future processors.Goal 3. Our tool should automatize the benchmarking of candidates.The automatic benchmarking should be adaptable to all processors.

Security
As already depicted, we design a UHF family with the -AU security, and are therefore only interested in collision resistance.UHFs with the security notion of -AXU can then be computed from -AU UHFs, but that is out of the scope of our -AU UHF family.
In order to facilitate the security analysis of our candidates, we consider that the output of one of our -AU UHFs is not of a single word, but rather the entire state composed of multiple 128-bit words.In addition, we consider that the inner state is fully unknown, key-dependent, and of full entropy, so that values of the inner states cannot be exploited to build collisions.Thus, in order to ensure collision resistance, it is sufficient in our case to prevent the existence of high probability differentials of the shape ( ) + ( + ) = 0. We then rely on the following assumption to investigate these.

Assumption 1. The highest probability of a differential trail is a good indication of the highest probability of a differential.
Thanks to Assumption 1, estimating the security level can be done by modeling the differential propagation with a MILP model and this is now a widespread practice [JN16, SLN + 21].

Goal 4. A lower bound on the number of active S-boxes in the differential trails of a candidate should be easily computed with computer-aided tools, such as MILP solvers.
Section 5 will be fully-dedicated to our MILP modeling and its optimizations.

A roadmap to achieve these goals
All those guiding principles lead us toward the family of UHF that we describe in the next section.Our goal are in line with previous works [JN16, SLN + 21]: we want a primitive that favors parallel AES calls to optimize scheduling.However, properly taking this into account means carefully considering the number of 128-bit XORs, and in fact minimizing it-the authors of Rocca already observed the negative impact that AES and XOR used "in a cascade way" could have.As a consequence, we limit ourselves to sparse linear layers.
To compensate the slower diffusion implied by the sparse linear layer, and to broaden our search space, we consider more sophisticated injection techniques inspired by the design (tweak-)key schedules.This could increase the cost of each round (in particular in terms of memory), but it indeed enables the safe use of very simple round functions.This overall structure is similar to that of Panama [DC98], a hash function attacked in [RVPV02].It was based on a large "buffer" and a smaller "inner state", the former being linearly updated using message blocks, and the latter being non-linearly updated using data extracted from the buffer.The separation between buffer and inner state was quickly set aside as several algorithms adopted a similar structure that nevertheless involved a datapath from the inner state to the buffer, e.g.RadioGatùn [BDPA06] and Lux [NBK08].
Our whole construction is presented in the next section.

A Specific Family of Universal Hash Functions
In light of the discussion presented in the previous section, we have settled on a specific family of UHFs, that is both large enough to contain algorithms that are both fast and secure, and that is small enough that we can practically explore vast subsets of it.
The idea is to separate the (potentially large) state into two subparts with different roles: an inner part updated with AES rounds and a linear layer, and an outer part updated only with a linear layer and new message blocks.Each round, words of the outer state are XORed to the inner state (but not the other way).The aim is that each message block is XORed several times into the inner state, so that short differential trails leading to collisions do not exist.This construction is similar to many sponge-like constructions, but in our case the linear outer state allows to save many AES round calls (while sponge-like designs will apply the same function to the full state), and to be easily modelable in MILP.This also resembles a large tweakable block cipher with a large tweak, and a linear tweakey schedule.We chose this structure as it has the potential to offer both high throughput (thanks to its reliance on the AES rounds, the expensive operations being restrained to one subpart of the state, the potentially low rate) and high security (thanks to the sparsity of the round which makes it easier to use automated tools to check for differential attacks).

Notation
Vectors of 128 bits are denoted word, block, register or wire depending on the context.The additions are performed bitwise in F 2 ; they correspond to XORs.The cardinality of a set  is ||.The number of non-zero elements of a vector  ∈ F  2 is denoted Supp().For any ℓ ∈ N, we denote 0, ℓ := {0, • • • , ℓ}.Given a field  , we denote ℳ × ( ) the sets of matrices over  of size  × .We denote ℳ  ( ) (resp.GL  ( )) the set of matrices over  of size  ×  (resp. of invertible matrices over  of size  × ).A diagonal block matrix whose diagonal is made of matrices  0 , • • • ,  ℓ is denoted Diag( 0 , • • • ,  ℓ ).In a block matrix definition, ⋆ denotes an arbitrary block.

Overall structure
The UHF family we consider is described in Figure 1.Each wire on the figure represents a 128-bit value.The inner state is on the left-hand side of Figure 1, and the outer linear message-schedule with memory on the right.Overall, our approach can be seen like a standard Substitution Permutation Network (SPN): the inner state (alternatively denoted ,  , ) is iteratively updated through a round function built by composing a linear layer with a non-linear one.Between each round, the linear message-schedule ingests several blocks of the input message, and produces an injected value  which is added to  to yield .The memory registers of the linear message-schedule, that we denote , keep linear information on previous input message blocks.
Parameters.From now on, by size, we always mean the number of 128-bit blocks.Thus, each member of the family is parameterized by the sizes , ,  of the inner state ,  , , of the input message  , and of the memory , that can be chosen freely.Note that  also corresponds to the size of the injected value  .Once these sizes are fixed, we define a specific instance by choosing the vector  and the matrices ,  .
The Boolean vector  := ( 0 , • • • ,  −1 ), of size , indicates whether a state wire goes through an AES round or not.For any  such that   = 1, the -th wire of the state is called an AES wire.
The  ×  invertible sparse matrix  ∈ ℳ × (F 2 128 ) is used as linear layer.By design, we restrict the coefficients of  to {0, 1}, so that  can be viewed as a matrix of ℳ × (F 2 ).In particular, the output of the linear layer is only composed of copies and XORs of the 128-bit input words.

𝑠 𝑚 𝑟 𝑋
Finally,  is the ( + ) × ( + ) message-schedule transition matrix. indicates how to compute the -word injected-value  and how to update the memory  (of size ).Both are linearly computed using the current memory  and  fresh message words,  .Similarly to , we restrict by design the coefficients of  to {0, 1}:  ∈ ℳ (+)×(+) (F 2 ).

Notation 1 (Time stamp, coordinates and sequences). As the values of the blocks vary over time, we use superscript to indicate the clock (with 𝑡 = 0 as initial clock) while subscripts are reserved for coordinates: for instance 𝑅 𝑡
stands for the -th coordinate of   , that is,  at time .We keep plain characters for generic purposes: e.g. the memory , and use calligraphic letters to denote the sequence throughout time: e.g. := (  ) ∈N .Finally, for any finite subsets  ⊂ N,  ⊂ 0,  − 1 and  ∈ N, we denote sub-sequences and sub-vectors as:   := (  ) ∈ and    := (   ) ∈ .

Round function and message-schedule
Round Function.It is applied on the inner state, and is composed of three layers: • a linear layer   := (  ).
• and an injected-value addition layer where the injected value   of round , generated by the message-schedule, is added to the state:  +1 :=   +   .
In the AES round layer, the AddKey step is omitted.Thus, by using the AddKey step of the AES-NI instruction, the addition of the round-value word is free on AES wires.
Message-Schedule.The linear message-schedule has a memory  of size .Each register contains a linear combination of previous message words.At round ,  new message words are ingested, the -long injected value   is output and the memory   is updated, in a single transition step: As highlighted by the previous equation, it is convenient to decompose  as a block matrix.
Notation 2 ( decomposition).In the following, given a transition matrix  , we will intensively use the following decomposition and notation: Taking advantages of Equations ( 1) and ( 2), we can easily express the injected-values as (recursive) linear combinations of input-messages blocks: can therefore be viewed as a family of || linear combinations, or equivalently as a || ×  matrix where each column represents one of the  message blocks that can appear in the || combinations.We will often prefer the latter point of view.An injected-value sequence  can be obtained from infinitely many matrices  .For instance, infinitely many unused memory registers could be added.It is thus necessary to limit as much as possible this redundancy while exploring the transition matrices  .In the next section, we start by finding a "normal form" for the transition matrices we will study.We then limit our search by defining an equivalence relation between injected-value sequences and finally present and justify our search space.

A Searchable Space of UHFs 4.1 A normal form for transition matrices
The first notable point about transition matrices is that, at clock , only the space spanned by the memory registers (and not the register themselves) matters.Indeed, the same information can be recovered from two different spanning families, only in different representation systems.This is illustrated by the following proposition which is proved in Appendix B.
Proposition 1 (Change of basis for memory registers).Let  be a transition matrix.Let  ∈ GL  (F 2 ).Let us define   ∈ ℳ (+)×(+) (F 2 ) such that: Then   produces the same sequence  as the original matrix  .

Sketch of Proof.
As  is invertible, the rows of ( 00 | 01 ) and (  00 |  01 ) =  ( 00 | 01 ) spans the same space.The multiplication of  −1 to the right of   00 and  10 only adapts the linear operators to the newly-chosen spanning family.
For fixed sizes , , , Proposition 1 in particular states that it is sufficient to explore a single representative per similarity class for the top-left block  00 .We recall the following classical results about similarity that can be for instance found in [DF04, Thm.14, p. 476] or [Gan90,p. 192 1. Similarity is an equivalence relation over ℳ  ( ).We denote it ∼.
2. There exists a unique family ( 0 , • • • ,  ℓ−1 ) of polynomials such that: This representative is called Frobenius Normal Form or Rational Canonical Form.
According to Proposition 2, it is thus sufficient to exhaust all possible Frobenius Normal Forms rather than all  ×  matrices for the top left-hand corner.This decreases the search space by a significant factor: for  = 4, there are 20160 ≈ 2 14.3 matrices in GL 4 (F 2 ), but only 14 ≈ 2 4 equivalence classes 2 .
On top of that, Proposition 1 also allows to get rid of redundant memory registers, as presented by the following corollary, proved in Appendix B.
Corollary 1.Let  be a transition matrix.Let us denote  = rank( 00 | 01 ).Then, there exists an instance using  memory-registers which generates the same sequence .

Sketch of Proof.
If ( 00 | 01 ) is not full-ranked, that is  < , then the memory registers are not independent.An extracted basis (of size ) is thus enough to express any linear combination of memory registers with strictly less memory.
Corollary 1 states that after choosing a Frobenius Normal Form for  00 , and any value for  01 , one can immediately look at the rank of the top half ( 00 | 01 ).If the top half has not full rank, the study of the matrix comes back to the study of an instance with strictly less memory (a smaller ).If the search is done by increasing values of , one can only consider a top half with a full rank.

An equivalence relation for injected-value sequences
Even if we limit redundancies thanks to Proposition 1 and Corollary 1, for most of values of , , , the associated space of message-schedules remains too big.In particular, it cannot be exhaustively searched, especially if a MILP problem needs to be optimized for each instance.
To further reduce the explored space, we first restrict ourselves to matrices  for which rank( 11 ) = .Indeed, if rank( 11 ) < , only a strict subspace of the messages at round  impacts the injected values at this round.This does not directly generate collisions, since the unused messages can be stored in memory and used in later rounds.However, 2 Counting the number of equivalence classes is easier using another normal form for similarity.Any this requires extra registers whose only purpose is to store the unused injected messages of previous rounds, increasing the memory size  without increasing the security.More precisely, after a few rounds, such an instance behaves as if exactly  message blocks impacted the injected values at each round; the message blocks sequence being slightly slid.So from now on, rank( 11 ) = , and in particular,  ⩾ .
Secondly, we take into account our adversary in a scenario where it has a full control over the input differences in message blocks (such as a chosen-plaintext scenario).From this point-of-view, the implementation does not matter, only the actual decompositions of all    as linear combinations of   ′  , ,  ∈ 0,  − 1 ,  ′ ⩽  do.In particular, with  degrees of freedom, such an adversary can choose the differences of  independent    , rather than just the differences of  message blocks    .We thus study injected-value sequences up to linear change of variables of the inputs.
Remark 2. Let  ∈ N ∖ {0}.The lower triangular form of   implies that the equivalence relation preserves the fact that only variables   ′  ,  ′ ⩽  appear in both   and   .
Proposition 3. Linear equivalence of injected-value sequences, as defined in Definition 4, is an equivalence relation.

Proposition 4. [Proved in Appendix B.]
Let  be a transition matrix such that rank( 11 ) = .Then, up to a wire permutation of the inner state,  produces a sequence  which is linearly-equivalent to the sequence produced by ̃︀  , where: Sketch of Proof.We introduce the following decompositions: Because rank( 11 ) = , up to a wire permutation of the inner state, we can suppose that  ∈ GL  (F 2 ).With the following lower-triangular change of variables: 3  0,−1 and  0,−1 are viewed as matrices of dimension  × , see Remark 1.
we can rewrite Equation (3) as: This implies that the sequence , is linearly-equivalent to the sequence generated by the transition matrix ̃︀  .
We can now present the chosen form for the explored transition matrices.
Theorem 1.Let  be a transition matrix such that rank( 11 ) = .Then, up to a wire permutation of the inner state,  produces a sequence  which is linearly-equivalent to the sequence produced by a matrix ̃︀  of the following form: where  is a Frobenius Normal Form matrix.
Proof.First using Proposition 4, we obtain, up to a wire permutation of the inner state, a transition matrix ̃︀  which produces a linearly-equivalent sequence, as in Equation ( 12).We can now use Proposition 1, in order to put the top left-hand block in Frobenius Normal Form.The multiplication of the lower-left part by  −1 does not change the fact that the first row of this block are all-0.The lower-right block is not modified, so Id  still appears on its first rows. ′ has thus the announced form.
The class of matrices presented in Theorem 1 is not only chosen to make the search more efficient, but also for its sparsity to guarantee a small implementation cost.Indeed, the Frobenius Normal Form constitutes a very sparse representative of a similarity class: it is a sparse matrix (a diagonal block matrix) with sparse non-empty blocks (companion blocks).The chosen form for the lower half is also quite sparse with the 0 and Id blocks.

Constraints on the linear layer
Regarding the linear diffusion matrix , it should be implementable with a low number of XORs.However, we must ensure that each inner state block at round  will eventually influence all of them.To this end, we use the following metric of diffusion.Definition 5. Let L be a matrix identical to a binary matrix , except that its coefficients are integers.The diffusion time of  is the smallest integer  such that all coefficients in ( L)  are non-zero.If no such integer exists, we set it to +∞.
We consider integers rather than binary field elements so that additions do not cancel out; this is equivalent to considering the iterations of , such that all XORs in the matrix multiplications are replaced with ORs.Intuitively, this number tells how many rounds are needed to ensure a full diffusion in the inner part, although in some special cases, it is not entirely accurate as there may be some bad interactions between non-AES wires and the linear layer .In the case where all wires are AES wires, this metric is exactly the number of rounds which guarantee that every output wire depends on every input wire.In our search space, we generate matrices  under weight constraints, often with a weight of  + 1 or  + 2 so that  can be implemented with 1 or 2 XORs and ignore matrices with high diffusion time: we mostly use a value of around 2 ×  in this paper4 .

The actual explored space
The search method presented above is optimized but heuristic : we stress that we do not assure the minimal sparsity of the studied transition matrices.Still, the explored space contains promising candidates (see Section 6), that could be further-optimized later on.Nevertheless, exhaustive search remains unreachable.Equivalence relations on  and  could be used, but would (and in practice do) interfere with the previous ones.Instead, we restrict the weight of  and , as described in Section 4.3 and further in Section 6.

Turning Collision Resistance into a MILP Problem
The search space being established, we now focus on assessing the security of the potential UHF candidates, by building an adapted MILP model and then solving it thanks to an optimizer.A MILP model is composed of three objects: variables, representing either real numbers or (modular) integers5 , constraints, that is, inequalities between Z-affine combinations of variables, and an objective function which is a Z-linear combination of variables that need to be maximized (or minimized) when subjected to the given constraints.A MILP solver, such as Gurobi [Gur23], takes as input a MILP model and returns, if it exists, values for the variables that both satisfy the constraints and maximizes (or minimizes) the objective function.

Prior works
The use of MILP modeling for searching differential trails with the highest probability was set to light by Mouha, Wang, Gu & Preneel in 2011 [MWGP12].Several approaches exist depending on the needed level of precision and the available computational power.In theory, by using one MILP variable for each bit of the state at each round, all the non-linear differential transitions could be modeled (at the cost of many constraints).This approach is in practice very costly.For byte-aligned (resp.nibble-aligned) primitives, it is much faster and practical to rather affect a MILP variable to each nibble (resp.byte) of the state.Yet less precise, such a model enables (if it can be efficiently solved) to determine the minimum number of active S-boxes, from which an upper bound on the probability of the best differential trail can easily be estimated.In the case of AES-based ciphers, this method has become standard, as highlighted by Rocca [SLN + 21] or Deoxys-BC [JNPS21].Following their lead, we consider the byte-wise approach.
To do so, we extend Notation 1 so that the byte position appears.

Notation 3.
The second subscript indicates the byte position:   ,ℓ is the ℓ-th byte of    .

Our model
From now on, a candidate has been chosen: , ,  and ,  ,  are now fixed.To these constants, we add , the number of rounds of the primitive to model.

Variables.
Let  ∈ 0,  − 1 be a round number,  ∈ 0,  − 1 be a word number (where the bound  ∈ {, , } depends on the register we look at) and ℓ ∈ 0, 15 be a byte position.We track the differential activeness of every byte throughout the rounds by modeling each byte of the state as a binary variable, that is equal to 0 if the byte is inactive and 1 if it is active.We use lowercase to denote the binary variables.Now, we present constraints that will appear in the definition of more advanced ones.

MDS constraint. It models the relation between an input column of bytes, represented
as the binary variables (  ) ∈ 0,3 , and an output one, represented as (  ) ∈ 0,3 ∈ {0, 1} 4 , through the AES MDS matrix.With an auxiliary binary variable , and the two constraints: either 0 or at least 5 bytes are active.Remark 3. In the above constraints, Σ corresponds to an integer sum, not a modulo-2 sum.
We can now create constraints for each layer of the round function.Let  ∈ 0,  − 1 .

AES-round layer.
Let  ∈ Supp() so that an AES round is applied on the -th wire.The S-box layer does not change the activity pattern, but the linear layer (ShiftRows and MixColumns) needs to be modeled.For any round  ∈ 0,  − 1 , and column index  ∈ 0, 3 , the -th diagonal of    is linked by a MDS relation together with the -th column    .Those relations require a MDS constraint.When  / ∈ Supp(), we simply add the constraints   ,ℓ =   ,ℓ for all , ℓ.

Message-schedule.
The 128-bit linear relations between   ,   ,  +1 ,   given by Equation ( 3) can be modeled with 16 Multiple-XOR constraints (one for each byte).

Injected-value addition. For all 𝑖, 𝑗, ℓ, 𝑌 𝑖+1
,ℓ =   ,ℓ +   ,ℓ is modeled as a Multiple-XOR.Finally, we add constraints on the inputs/outputs of the UHF, and constraints to take advantage of the inherent symmetries of the AES round function.

Input constraints.
At clock  = 0, the state and memory are fully inactive.Thus, Message constraints.If a trail with an inactive first round exists, shifting it by 1 round makes it still a valid trail.Moreover, in the AES, any column (resp.row) plays the same role, so any trail can be shifted so that the first difference appears in the byte of index ℓ = 0.By forcing at least one 0-index first-round-message byte to be active, we facilitate the solving process, without leaving any trail aside.Hence the symmetry constraint: This model will be referred as our basic model.Additionally, we can add to this model some output and/or linear incompatibilities constraints.
Output constraints.We can force the state to be fully inactive at the end: This constraint highly reduces the MILP solution space.However it is a too-strong constraint when  is small: a differential trail over more rounds but with less active S-boxes cannot be captured by the model.In practice, we iteratively increase  to capture more and more trails, until a sufficient number of rounds is reached.

Removing linear incompatibilities.
With the basic model, some obtained activity patterns may not be instantiable into differential trails because of linear incompatibilities, similar to the ones observed on AES [FJP13] or on Deoxys-BC [CHP + 17].Unlike those ciphers, our message-schedule is acting on 128-bit words which enables us to model the linear incompatibilities with exact constraints6 .In our case, for an AES wire of index , we observe that MC ∘ SR ∘ SB(   ) ⊕    =  +1  .Introducing, the variables where LIN := MC ∘ SR, we can rewrite it as:  8) is encoded byte-wise with 3-XORs between   ,ℓ (no hat), v ,ℓ , and x ,ℓ .In practice the model solving is severely slowed by taking these incompatibilities into account.It however often increases the minimal number of active S-boxes by a few.Because of the pros and cons of each of these additional constraints, we parameterize our model depending on them.In Section 6, we explain how we parameterized the models to converge toward promising candidates.Notation 4. Let  ⩾ 1, and lin and output be two Booleans.We denote by Model(, lin, output) the model corresponding to  rounds, where the linear incompatibilities contraints (resp.output constraints) are considered if lin = True (resp.output = True) and not otherwise.

A word on solutions
As already mentioned, a solution to these models consists in an activity pattern, which, if it is instantiable, minimizes the number of active S-boxes.There is however no a priori guarantee that it actually can be instantiated as an actual differential trail.Nevertheless, if it is instantiable, and if all transitions can occur with maximal probability, then the instantiated trail would have a probability of   , where  is the number of active S-boxes, and  = 2 − is the probability associated to the differential uniformity  of the -bit S-box.Thanks to Assumption 1, this higher bound on the probability of the best differential trail enables to estimate the level of security of any candidate (once the solver terminates).Section 6 presents our experimental results.

Search strategy
In order to find good candidates, we proceed as follows.
1. First, we fix some numerical values for , , , and  := |Supp()|, a maximum number of XORs to implement  and  , and a diffusion time threshold for .
2. Then, we generate random candidates for , , and for  according to Section 4.
4. For each candidate with a sufficient number of active S-boxes (i.e. more than 24), we generate assembly code corresponding to the round function, and benchmark it on a recent CPU.If the software performance is high enough, we keep the candidate.
5. At last, we select one of the final candidates based on performance/security trade-off, and perform a last MILP solve of Model(, lin=True, output=False) with high  to guarantee the security of the candidate.
In Step 2, in practice,  is generated randomly among vectors of  elements with hamming weight ,  is generated from a random element of the symmetric group   (F 2 ), of which  0s are replaced by 1s (the implementation of  thus requires  XORs at most), and  is generated by looping over the set of possible Frobenius Normal Forms, until the XOR-cost is less than .The rest of the matrix  is generated, line by line, by making sure that the XOR-cost constraint is satisfied.In our search,  and  are empirically randomized.
In Step 3, by making  bigger, we go from very restrictive and quickly-solved models to more complete but slower ones.Optionally, Step 4 can be executed before increasing , to discard non-performing candidates and avoid time-consuming MILP-solves.

Running the Search.
In practice, we select  ∈ {2, 3, 4, 12, 20}.At each point, if the number of active S-boxes falls under a security threshold, the candidate is discarded.The security threshold is fixed to 20 active S-boxes, but by using lin=True in a later solve, the minimal number of active S-boxes might increase.Between the runs with  = 12 and  = 20, we automatically generate a C implementation, compile it on-the-fly and benchmark it.If the speed (in cycles per byte) of the candidate, falls under a speed threshold, the candidate is discarded.This threshold depends on the parameters (chosen in Step 1) and of the processor used in the benchmark.
Because our candidates rely on AES-NI instructions, their speed is upper-bounded by  ×  cycles per 128-bit, where  is the throughput of AESENC, and  the rate (see Section 2.1).Thus, candidates with a speed close to this bound are considered promising.In our case, we benchmark on an Intel 11th Gen Core i5-1135G7 (Tiger Lake family), with a throughput  = 0.5, and we mainly target round functions with rate  = 2.Those round functions cannot go faster than 1 cycle per 128-bit state, i.e. 0.0625 cycles per byte.For rate-2 candidates, if  = 8, we set the speed threshold to 0.08 cycles per byte; if  < 8, very few candidates are faster than 0.08 cycles per bytes, so the threshold is increased accordingly.
For each remaining candidate, we finally solve Model(20, lin=True, output=False) in order to obtain a final bound on the number of active S-boxes. 7This heuristic finds candidates with both good performance and security but may not be the fastest approach.Still, it is much faster than simpler approaches such as solving the slow-but-accurate Model(high , lin=True, output=False) directly, or benchmarking every candidate before running the fastest Model(low , lin=False, output=True).
Choosing the numerical parameters.We chose numerical parameters, based on the following experimental observations.First, for a fixed rate, increasing  (and therefore ) tends to improve the performance.Moreover, when other parameters are fixed, increasing  or  tends to increase the security.Finally, we limited the sizes of the state to  +  < 16.
We thus looked for candidates with a high  and multiple memory registers.We also explored lighter candidates, and propose good candidates for smaller values of , typically  ∈ {2, 4, 6}, which are not as fast but might be parallelizable in some scenarios.Finally, for  = 1,  = 1,  = 1, we looked at a rate-2 construction that lies slightly outside the scope of our family by replacing any message block   0 with odd  by 0.

Results of the search
The results of our search are given in Table 3.For each set of numerical parameters, we give the total number of candidates we considered ("Total"), the number among them that satisfied the security threshold ("After Sec."), and the number of among those that also satisfied the speed threshold ("After Speed").As we can see, the vast majority of the candidates do not satisfy our demanding criteria, but a broad-enough search allows us to find promising candidates.The case of  = 1 is peculiar as such candidates are inherently slower (they are not parallelizable), which is why the bottom right cell of the table is left empty.The properties of the most promising candidates are given in Table 2. Interestingly, for a fixed rate, a higher weight  usually means a higher speed.Although we did not perform a dedicated search to reduce the number of registers used in rate-2 candidates, we note that PetitMac's round function (Figure 3) requires 6 registers in total, and can be implemented without any additional temporary register8 .Therefore, this improves on the result of Nikolic [Nik17b, Appendix C], which does not find a rate-2 candidate with less than 8 registers; the candidates found may moreover require additional temporary registers in the implementation (see [TSI23] for similar concerns on Rocca).3 1 A message is added every other round. 2 There is 1 inherent XOR in the transition matrix.Every other rounds, the message accounts for 2 additional XORS.We thus obtain a MAC whose security relies only on the PRF security of AES and the -AU security of , the former being a standard assumption, and the latter being a consequence of our MILP-based analysis.
-AXU family .We build the family  using the sum hashing construction from [CW77, Proposition 8].Given two -AXU family  1 :  1 →  and  2 :  2 → , this construction yields an -AXU family  : Concretely, we take the AES block cipher as an -AXU family (the -AXU security of AES is a consequence of its the security as a PRF), and define the family  as: where each AES is keyed independently. is a 2 −128 -AXU family assuming that the AES is a secure PRF, and the composition of the 2 −128 -AU family  and the 2 −128 -AXU family  yields a 2 −127 -AXU family  ∘  using the composition result from [Sti92, Theorem 5.6].

EWCDM.
The MAC itself follows the EWCDM construction by Cogliati and Seurin [CS16]: This construction uses a nonce to obtain high security, but it still provides security up to 2 /2 queries if the nonces are repeated (or omitted).
When used with unique nonces, EWCDM was initially proved secure up to 2 2/3 queries, but a more recent result proved security up to essentially 2  queries: assuming that the adversary makes less than 2  /67 queries, Mennink and Neves [MN17] proved that: We use the EWCDM construction because it provides significantly higher security than the more common Wegman-Carter-Shoup construction WCS[, ] 1,2 (,  ) =  1 ( ) ⊕  2 ( ).Indeed, Wegman-Carter-Shoup only provides 2 /2 security with unique nonces, and fails completely when nonces are repeated.
Initialization.While the family is indexed by the secret initial state, we suggest to derive it as follows: the branch with index  is initialized to  Kinit (), where Kinit is 128-bit secret key, and  is the AES-128 block cipher.
LeMac.It is our ultra-fast MAC algorithm.It takes as input a 128-bit nonce and a 128-bit key, and returns 128-bit digest.It is based on the round function summarized in Figure 2, which corresponds to the fastest promising candidate we found for  = 8.A detailed algorithm is provided in Appendix D.
PetitMac.For cases where the high parallel potential of LeMac might not be an advantage (e.g. on smaller processors), we propose instead PetitMac, which is based on the promising candidate we found for  = 1 (see  Security.We claim that LeMac and PetitMac offer 128-bit security in the nonce-respecting model, meaning that an attacker with advantage close to one requires a data complexity close to 2 128 , or a time complexity close to 2 128 .In the nonce-misuse setting, we claim that an attacker with advantage close to one requires a data complexity close to 2 64 , or a time complexity close to 2 128 .

Benchmarks
In order to evaluate the performance of LeMac, we performed comparative benchmarks on several recent hardware architectures from Intel and AMD.We compare LeMac with the following constructions: Rocca [SLN + 21] and Rocca-S [NFI24]; AEGIS128 [WP14] and AEGIS128L [WP13]; Tiaoxin-346 v2 [Nik14]; The rate-2 round function of Jean and Nikolić [JN16], with the same initialisation and finalization as LeMac.Rocca, Rocca-S, AEGIS128, AEGIS128L and Tiaoxin-346 are authenticated encryption algorithms, therefore they provide more features than LeMac, but we believe they still provide a reasonable comparison point when used in their associated data-processing mode.
For the benchmarks, we use a hardware performance counter to measure the number of cycles for the execution of the primitive, with various message lengths.All MACs are compiled with gcc 12.2.0, and we run the code multiple times, in order to measure performance when the data and code are loaded in the cache.We use the perf program to set up a performance counter for the number of elapsed cycles10 , and the rdpmc instruction to read the performance counter with low overhead.On Intel CPUs, we obtain the same results using the rdtscp instruction, but on AMD CPUs the counter read by rdtscp (or rdtsc) is independent of the core frequency and does not actually count CPU cycles.
The results are shown in Table 4.We observe that LeMac essentially reaches the maximal possible performance for a rate-2 scheme on these CPU architectures: the Haswell and Skylake architectures compute at most 1 AES round per cycle, corresponding a limit of the Jean-Nikolić round function also have rate 2, but they don't allow enough parallelism to reach this bound.
We have also implemented and benchmarked PetitMac on a microcontroller setting.More precisely, our benchmarks were run on the STM32F407VG microcontroller, which is based on the ARM Cortex-M4 processor.For the AES round implementation we used the T-table-based one written in ARMv7-M assembly from [SS16] while we implemented the round function in C code.The code was compiled using arm-none-eabi-gcc 10.3.1 with the -O3 optimization flag and the processor was clocked at 24MHz to take advantage of zero wait-states.Processing 16384-byte messages required 299509 clock cycles (witout the initialization and finalization), leading to 18.3 c/B.This performance places PetitMac as a very competitive MAC on microcontrollers, even though it was not directly designed for that platform (AES round is probably not the best starting point).
As expected, PetitMac is not competitive on high-end desktop, because we have to perform two sequential AES rounds per input block, and the latency of the AES instruction is the bottleneck.

Conclusion
In this article, we introduced a novel family of extremely fast UHFs, optimized for servers/desktop computers with AES-NI.Our general construction is large enough to offer high granularity to contain interesting security/performance tradeoffs, while ensuring a manageable automated security analysis with MILP.Our strategy to search for good candidates within this family is fully automated and adaptable to the performance profiles of future processors.We showcased the validity of our approach by proposing concrete UHFs and corresponding MACs schemes, largely improving over the state-of-the-art on recent processors.Notably, our proposal LeMac is currently by far the fastest MAC on the high-profile use-case of AES-NI platforms.

A On the speed of Rocca in decryption mode
In the decryption mode of Rocca, we found the following dependency chain from [4] to   [4] (with the notation of [SLN + 21, Figure 1]), composed of an AESENC and 3 XOR instructions: We notice that this dependancy chain only appears in the decryption mode, since the value  1 of [SLN + 21, Figure 1] needs to be computed from  [4].On Ice Lake and Tiger Lake architectures, this dependency chain first appears to have a latency of 6 cycles, which implies that the round function can not be faster than 6 cycles per 2 × 128 bits of message, or equivalently 0.19 cycles per byte.

Bypass delay
We measured the performance of the round function of Rocca in decryption mode on Tiger Lake (i5-1135G7) and measure around 0.34 cycles per byte.We believe that this corresponds to a latency of around 10 cycles.As far as we can tell, there is an additional delay (a bypass delay) when the output of an AES instruction is used as input to a non-AES instruction.In particular, the latency of the dependency chain of Equation 9increases by 4 cycles (to a total of 10 cycles of latency): the two XOR instructions on the left of Equation 9 take in input respectively  [3] and AESENC(XOR([0], [4]),  [2]) which are both output of an AESENC instruction.The latency can be reduced to 8 cycles with a fake XOR instruction added at the beginning of the round :  [3] ← XOR( [3], 000), so that at the beginning of each round,  [3] is not the direct result of an AESENC encryption.This was tested experimentally and increases the speed to around 0.25 cycles per byte on Tiger Lake (i5-1135G7).It seems that these 8 cycles of latency can not be reduced, and we therefore believe that Rocca in the decryption mode can not run faster than 8/(2 × 16) = 0.25 cycles per byte on Tiger Lake processors.

B Full Proof for Reducing the Search Space
Proof of Proposition 1.Let us denote for any  ⩾ 0,    ,    the respective memory registers and round-message at clock  produced by   .By adapting Equation (3) to   , we obtain: By design,  0 = 0 and  0  = 0 because the memory is initialized as such.In particular  0  =   0 .Let  ⩾ 0 and let suppose that    =    .Then by injecting    =    into Equation (10) and simplifying we get,  Because   is a linear combination of   ′  where  ′ < , this change of variables corresponds to a lower triangular block matrix   (whose diagonal is only made of  blocks).

D Algorithms D.1 LeMac
A complete description of LeMac is provided in Algorithm 1, where  is a key-less AES round.All the subkeys are derived by encrypting a counter with the master key.The state of the UHF is initialized with such subkeys.During the UHF finalization, each branch of the inner state goes through 10 independent AES rounds with subkeys that were derived as encrypted counters.We derive only 18 subkeys, and use those in a rolling fashion in each different branch.The idea is to save space by not having to store 10 ×  = 90 different 128-bit subkeys for this final step.Figure 4 shows five rounds of the UHF used in LeMac.

D.2 PetitMac
As with LeMac, we give a complete description of PetitMac in Algorithm 2. The notations are the same as above.All the subkeys are derived by encrypting a counter with the master key.The state of the UHF is initialized with one such subkey.During the UHF finalization, we encrypt the state and the memory registers using 10-round AES, with  subkeys generated in a rolling fashion.We XOR the outputs together, and combine the result with the nonce using the EWCDM construction.

Figure 1 :
Figure 1:  stands for a key-less AES round.Each choice of the size parameters , , , the Boolean values   , and the linear matrices ,  defines an instance of the framework.

Figure 2 :Figure 3 :
Figure 2: Two rounds of the UHF used in LeMac.For more iterations of the round function, refer to Figure 4 in Appendix D.

Figure 4 :
Figure 4: Five rounds of the UHF used in LeMac.

Table 2 :
Table of the retained candidates over different parameters sets.Speeds were measured on a Intel 11th Gen Core i5-1135G7 (Tiger Lake) for different sizes of message.

Table 3 :
Number of tested and passed candidates for different settings.Candidates were generated so that they satisfy the diffusion threshold.The search time is given in core hours.The search was performed on Cascade Lake Intel Xeon 5218.

Table 2 ,
and Figure3for its round function); it has a rate of 2, and ensures the activation of at least 26 S-boxes during absorption.PetitMac takes as input a 128-bit nonce and a 128-bit key, and returns 128-bit digest.
This first proves by induction that    =    for any  ⩾ 0. Let  ⩾ 0. According to Equation (10),    =  10  −1    +  11   .Replacing    by    , we obtain:Proof of Corollary 1.If  =  then  generates  and has  memory-registers.Let us now suppose that  < .In that case, we can find  ∈ GL  (F 2 ) such that the first  −  rows of (  00 |  01 ) =  ( 00 | 01 ) are all-0.(00−1 |  01 ) naturally shares the same property, and according to Proposition 1,   produces the same round-message sequence.But the  −  first empty rows in   indicates that the first  −  memory registers will be zero at all time  ⩾ 0, and therefore will never impact the output sequence.canthus be adapted by removing the  −  null rows in the upper half, and removing the corresponding  −  columns in the left-hand half.The obtained matrix  ′ generates the same sequence with  memory registers.Proof of Proposition 4. By hypothesis, rank( 11 ) = , so at each round, the information of the  independent new message blocks is fully contained in  of the round-value blocks.In other words, there exists  indices  = { 0 , • • • ,  −1 } such that for any ,  , where  ∈ GL  (F 2 ) (and  ∈ ℳ ×(−1) (F 2 )).Up to a wire permutation, let us assume that  = 0,  − 1 .In that case, ( 10 | 11 ) can be decomposed, such that  appears in it: =  10  −1    +  11   =  10   +  11   =   ; which proves the announced equality for any  ⩾ 0.