MOE: Multiplication Operated Encryption with Trojan Resilience

In order to lower costs, the fabrication of Integrated Circuits (ICs) is increasingly delegated to offshore contract foundries, making them exposed to malicious modifications, known as hardware Trojans. Recent works have demonstrated that a strong form of Trojan-resilience can be obtained from untrusted chips by exploiting secret sharing and Multi-Party Computation (MPC), yet with significant cost overheads. In this paper, we study the possibility of building a symmetric cipher enabling similar guarantees in a more efficient manner. To reach this goal, we exploit a simple round structure mixing a modular multiplication and a multiplication with a binary matrix. Besides being motivated as a new block cipher design for Trojan resilience, our research also exposes the cryptographic properties of the modular multiplication, which is of independent interest.


Introduction
Most modern cryptographic systems rely on the fundamental assumption that the hardware on which they are implemented is trustworthy. This assumption is, however, violated when the hardware manufacturer becomes malicious, and can mount attacks at the hardwarelevel, including hardware Trojans or hardware counterfeiting. Perhaps most devastating are hardware Trojans [TK10, BHBN14, XFJ + 16], which are stealthy malicious modification of the integrated circuits -the heart of any electronic hardware. Hardware Trojans may have catastrophic consequences for security, including deliberately weakened cryptographic devices or malfunction of control systems in nuclear power plants.
Unfortunately, the dangers resulting from malicious hardware have drastically escalated due to the inherent distributed nature of modern hardware manufacturing. Since setting up top-notch foundries that can produce integrated circuits at 14nm or beyond is very costly -estimates suggest almost US $15 billion to setup the next generation fabs [Rib14] -the majority of ICs are produced offshore by untrusted foundries. Currently only 13 foundries control more than 90% of the global IC production market, and none of them is located within the EU [Ins14]. In this situation, as little as a single malicious foundry can have major negative impact on a large fraction of modern security systems.
In view of this, various types of countermeasures have been introduced in order to mitigate the threats arising from hardware Trojans. At a high-level these countermeasures can be classified into two main types. The first type are detection-based solutions which aim at spotting the presence of a Trojan through device inspection [ABK + 07, AARP10]. Unfortunately, the effectiveness of Trojan inspection significantly decreases with the growing complexity of the IC, and in most cases becomes infeasible for practically deployed devices. The second type of countermeasures are the so-called prevention-based methods which aim at making the insertion or exploitation of a hardware Trojan more difficult. Examples of preventive countermeasures include split manufacturing [IEGT13], or input scrambling [WS11], but up to now both their design and analysis have been ad-hoc and only allows for protection against weak types of Trojans.
Recently, several papers have initiated a more formal approach to designing and studying the security of Trojan countermeasures, see in particular [DFS16, MCS + 17, AKM + 18]. These works formally define security models, propose novel countermeasures and prove their security against broad and well-defined classes of Trojan attacks. In order to achieve Trojan resilience, their main ingredient is to rely on techniques from Multi-Party Computation (MPC), where the IC internally runs an MPC protocol thereby preventing the adversary from deliberately activating the hardware Trojan. While at a technical level these works use similar ideas (i.e., relying on MPC techniques), the concrete constructions and the level of security that they can achieve are quite different. In particular, while [MCS + 17, AKM + 18] aim at preventing an adversary from breaking the underlying cryptographic primitive (e.g., by extracting the secret key from the device), in [DFS16] Dziembowski et al. consider a much stronger security definition called robustness. The robustness property guarantees that the device under attack behaves correctly (i.e., it still executes its specification) even when run in an adversarial environment. The latter means that the adversary has full input/output control over the device under attack.
In this paper, we focus on the approach given in [DFS16] as it gives a stronger security guarantee under the reasonable condition that the number of times the device is used in real life is bounded. 1 To illustrate the strength of the model of [DFS16] let us consider a simple example, where a device is implementing an authentication protocol for controlling an aircraft. The security property of [MCS + 17, AKM + 18] guarantees that a potential hardware Trojan inside the device does not allow an external adversary to take over the control of the aircraft (since even in the presence of the Trojan the authentication device retains its security). However, their security notion does not protect against malfunction, i.e., a malicious manufacturer may plant a Trojan that when activated makes the authentication device always fail. Clearly, such malfunctioning can have devastating consequences for our aircraft example. On the contrary, the countermeasure proposed in [DFS16] prevents such denial-of-service attacks. Our example also illustrates that a security guarantee that holds for a bounded number of executions is meaningful in practical settings, as there is a natural bound on how often the aircraft is going to be used.
In [DFS16] the authors present a general countermeasure based on a passively secure 3party computation protocol. A general purpose countermeasure allows to protect arbitrary computations implemented on a device against Trojan attacks. Dziemboski et al. achieve this by relying on general purpose MPC techniques, which unfortunately leads to several drawbacks. First, they require a so-called testing phase where prior to using the device, the input/output behaviour of each individual component of the system is intensively tested. For standard block ciphers, this results into rather involved testing, where the entire communication between the parties running a 3-party computation protocol is tested. Second, the increase of the overall circuit size and the computational overheads needed for the Trojan countermeasure of [DFS16] compared to an unprotected device is significant. In this work, we investigate whether these two shortcomings can be addressed. To this end, we focus on one of the most important primitives of cryptography, and design an efficient Trojan-resilient block cipher, which combines small computational overheads with minimal requirements on the testing phase.

Our contributions
Our main contribution is the design and the security analysis of MOE, a new block cipher with an innovative structure that allows for efficient computation on secret shares by mostly relying on linear operations. Obviously, if the cipher only consists of linear operations for the xor, we cannot hope its construction to be secure, so our design idea is to mix linear operations over various different underlying groups. After in-depth analysis of the possible choices, we picked two groups, that are (Z/2Z) n and (Z/2 n Z). The latter corresponds simply to a modular multiplication, an operation that was used back in the 1990s for the design of IDEA [LM91] and inspired the design of ciphers such as MESH [NRPV04], MMB [DGV93] and WIDEA-n [JM09]. We provide an extended analysis of the cryptographic properties of the modular multiplication by 3 by evaluating its algebraic degree together with its differential and linear properties. We next study how the modular multiplication behaves when associated with the binary matrix multiplication to form one round of encryption. By introducing the notion of change branch number, we are able to compute bounds on the probability of differential characteristics. We also discuss other attacks and evaluate the security of a small scaled version, and conclude that MOE seems resistant against basic attacks.
In addition to this, we show that our design decisions lead to the desired outcome in regard to the implementation of MOE as a Trojan-resilient cipher. More precisely, we consider the methodology introduced in [DFS16] and compare the performances of MOE with the one obtained in [BDFS18] for the AES and Mysterion, a bitslice-oriented cipher. The first notable advantage of our cipher is that it allows a simplified testing phase: only the input/output behaviour of the shared circuits implementing MOE must be tested (which can be done with a single plaintext/ciphertext pair) while when implementing the other ciphers all the intermediate computations must be verified (which requires a large dictionary of plaintext/ciphertext pairs). 2 This improvement represents a significant step in the direction of Trojan-resilient block ciphers that can be deployed and tested on-the-fly, as it avoids the previous situation where one either needs a way to monitor all the communications (which will make the test much more expensive in time) or use a dedicated board with all circuits trusted but one (which allows testing only the output result, but requires to plug/unplug the circuits to test on this dedicated board). The second improvement brought about by our design is a decrease in the communication complexity, resulting in an increased throughput and better robustness guarantees. The gain is significant compared to the AES, a bit less compared to ciphers that are already optimized for multiplicative complexity (for other applications like masking). This last point seems to indicate that in terms of communication complexity, the limits that one can reach are similar for block ciphers designed according to very different principles. Finally, a last drawback of [DFS16] was the need of a 3-party protocol. The structure of MOE, and more precisely the linear nature of its round operations, enables a much simpler 2-party protocol which reduces the overall hardware cost. Similar to [DFS16] our construction requires a small trusted circuitry. The only downside of our result is that this trusted circuitry increases in complexity -albeit only by a small factor. We believe it is an important direction for future work to improve on the design of MOE (or provide alternative block cipher designs) that minimize the size of the trusted master. 3 We finally emphasize that our design goals are different than the ones used for MPCfriendly or masking-friendly ciphers, see e.g., [GLSV15, ARS + 15, GRR + 16, DEG + 18, AAB + 20, GLR + 20]. While the latter mostly focus on minimizing the number of non-linear operations, the use of a small trusted master (responsible for the sharing operations) in our Trojan-resilient circuits implies that we rather focus on minimizing the number of transitions from one group to another. Yet, again, the (pseudo) multiplicative depth of our cipher (defined as the number of such transitions) is close to the actual multiplicative depth in these other ciphers, suggesting that both proposals are close to the minimum required. The latter observation, together with the in-depth investigation of the cryptographic properties of the modular multiplication, is of independent interest.
Structure of the paper. We first describe our new approach to Trojan-resilient block ciphers in Section 2 and compare it with the generic solution of Dziembowski et al. Based on this, we design a new block cipher named MOE that we describe in Section 3. One peculiarity of MOE is the use of modular multiplication as a block cipher ingredient, a choice that we carefully justify in Section 4 by discussing its cryptographic properties. We start by looking at the algebraic degree of this operation and then give an extensive (and to the best of our knowledge, the first) analysis of the differential and linear properties of a special case, that is the multiplication by 3. We then provide a security analysis of our cipher (Section 5), and in particular our analysis of the modular multiplication allows us to prove that no high-probability differential characteristics exist. We conclude by discussing the implementation of MOE on a prototype Printed Circuit Board (PCB) combining four commercial FPGAs and compare the obtained performances with the ones reached with a generic MPC implementation of both the standard AES and a bitslice-oriented cipher.

Our approach to Trojan-resilience
In this section we start by recalling the framework of Dziembowski et al. [DFS16], provide a high-level description of our construction and develop a formalization for analyzing its security. Notice that for now we view the underlying round operations of the MOE cipher as a black-box, and do not dive into the details of our concrete instantiation.

Trojan attack setup
We follow the work of Dziembowski et al. [DFS16] and consider a setup involving three parties: the trusted designer of the device, a trusted tester and a malicious manufacturer. In this setting the designer of the IC sends the IC's specification (in some hardware description language) to the manufacturer who produces the hardware and delivers it for testing and final assembly. The testing is carried out in a trusted environment by a special party called the tester, which checks whether the devices operate according to their intended specification. We only consider black-box testing, i.e., testing if the input/output behavior of the devices match with their specification. Such a setting of outsourcing the manufacturing process is very common and widely used in modern IC production due to the high financial cost for setting up state-of-the-art chip foundries.
While a vast number of different hardware Trojan attacks has been discussed in the literature, we focus on a well-defined (yet still broad) class of possible attacks that informally can be described as "logical attacks". More concretely, recall that a hardware Trojan consists of a triggering mechanism and a payload. The triggering mechanism activates the hardware Trojan, while the payload describes the malicious behavior that is carried out by the Trojan (e.g., revealing secret keys or malfunctioning of the device). At a very high-level one can distinguish between "physical" and "logical" Trojans. In the first case, the triggering and payload are carried out in some physical way. This may for instance be an activation of the Trojan by running it in an environment of higher temperature, and/or leaking the secrets via physical side-channels. On the other hand, logical Trojans assume that the Trojan triggering is carried out via logical inputs delivered to the device, and the payload is received by the adversary via a logical output. As in [DFS16], we focus in this work on logical Trojans. Logical Trojans include many natural Trojan attacks such as cheat codes (activation is triggered by a hard-to-guess input) and time bombs (activation is triggered after a certain number of executions) [WS11,DFS16].
Let us now describe the framework of Dziembowski et al. in general terms. First, a single malicious manufacturer produces a set of mini-devices D j i , where every triple D j := (D j 1 , D j 2 , D j 3 ) aims to compute a 3-party protocol and is denoted as a sub-device. These devices supposedly implement the desired functionality and are delivered back to the designer for testing. In the testing the designer checks if the produced mini-devices have the same input/output behavior as the desired functionality that the devices are supposed to implement. After the testing of these sub-devices has been completed successfully, they are assembled together with a so-called trusted master to build the final device D. One may think of the trusted master as a coordinator that controls the computation among the different mini-devices and carries out some simple computations. Naturally, we want that these computations are as simple as possible as otherwise the full circuit can directly be implemented using the technology used for building the trusted master. Dziembowski et al. show however that some minimum trusted computations are necessary when we aim to achieve strong Trojan-resilience guarantees (i.e., robustness).

Trojan countermeasure and cipher design 2.2.1 Multiparty computation as a Trojan countermeasure
Let us consider cheat codes as one particularly dangerous Trojan attack. In a cheat code the Trojan gets activated by running the device on a certain hard-to-guess input (the "cheat code"). Once this input is provided the Trojan delivers its payload and the device starts to malfunction. Cheat codes are dangerous because they are nearly impossible to detect by functional testing. This is due to the fact that they are hard-to-guess by design. To effectively prevent cheat codes we may however use techniques from MPC and secret sharing. Consider a setting with 3 parties denoted by P 1 , P 2 and P 3 , where each party holds its input x 1 , x 2 , respectively x 3 . A secure MPC protocol for a function f allows P 1 . . . P 3 to securely evaluate f (x 1 , x 2 , x 3 ) such that nothing is revealed about the parties' individual inputs except for what is implied by the evaluated output f (x 1 , x 2 , x 3 ).
The high-level idea of using multiparty computation to protect against cheat codes is the following. Before the adversarially generated input x enters the device it is shared using a secret sharing scheme (e.g., Shamir's secret sharing) into shares x 1 , x 2 and x 3 . This secret sharing is done on a trusted master device, and hence is done without influence from the malicious circuit manufacturer. Each of the shares x i is then given to a mini-device D j i that was produced by the manufacturer and supposedly implements the functionality computed by party P i in the MPC protocol. The devices then emulate a secure function evaluation of f . From the security guarantee of the secret sharing scheme and the multiparty computation protocol, it is guaranteed that the mini-devices do not learn anything about the shared input x. Thus, this approach takes away the adversary's control over the inputs x, thereby preventing targeted activation by a cheat code.
The above description gives the high-level idea of the basic construction outlined in [DFS16]. Dziembowski et al. observe however that the MPC approach alone does not suffice to protect against more powerful Trojan attacks. For instance, multiparty computation by itself cannot protect against time-bombs (i.e., Trojans that get activated after a certain number of executions). This is the case because the Trojan activation is now not triggered via an input but just once a certain threshold of activation is reached. Even worse, standard MPC does not protect against malicious devices undoing the effect of secret sharing by just reconstructing the shares via subliminal channels. This motivates the approach of [DFS16] to combine MPC with testing, where the individual devices are tested for correct computation, thereby preventing subliminal channels and time-bombing attacks. Let us continue by providing more details on the approach of [DFS16].
Consider a stateful arithmetic circuit Γ describing the desired functionality that we want to outsource to a (possibly) malicious manufacturer A for production. One may think of Γ as a specification of an AES block cipher, where the state corresponds to the secret key of the AES. To protect against Trojans, instead of letting A produce a device D that implements Γ, we first transform Γ into a new algorithm Γ that is hardened against Trojan attacks. At a high-level, Γ consists of two components. First, a set of mini-circuits Γ j i , and second a specification of the master circuit M. As discussed above, the role of the master circuit is to manage the communication between the mini-circuits and carry out some simple trusted computation (e.g., the secret sharing of the inputs). Every triple of mini-circuits (Γ j 1 , Γ j 2 , Γ j 3 ), with 1 ≤ j ≤ λ and λ the number of sub-circuits, performs a 3-party computation of the target functionality. In the computation the trusted M takes the role of carrying out the communication between the parties, and additionally runs some simple trusted pre-and post-processing (essentially secret sharing the inputs and reconstructing the outputs). Looking again at the AES example from above, each of the λ sub-circuits would implement a 3-party computation protocol securely evaluating the AES on shared inputs corresponding to the adversarially chosen plaintext. The set of is then given to the malicious manufacturer A who produces devices D j i that implement the corresponding mini-circuits Γ j i . For completeness, Appendix A recalls the 3-party protocol used in [DFS16].
When the designer receives the devices back from the manufacturer, he first carries out some functional testing of each D j i . This is done by a tester T DSF which runs each D j i and interacts with it through its interface by providing as inputs some values that D j i would expect when it is run in the real environment. If the input/output behavior of each D j i matches with the corresponding specification of Γ j i , then the designer assembles the final device D by combining the mini-circuits D j i with the trusted master M. Dziembowski et al. then show that the input/output behavior of this final assembled device D is with overwhelming probability identical to the input/output behavior of the trusted specification Γ for a limited number of executions, even for adversarially chosen inputs. This security notion on which we focus is called Trojan robustness in [DFS16].

High-level idea of our efficient Trojan resilient block cipher
Our block cipher design follows a very similar approach as the one described above. The main difference however is that due to our novel design of the round function we do not need to apply general purpose protocols for secure 3-party computation. Indeed, since all the operations inside a round of the block cipher are linear and the trusted master carries out the secret sharing and the reconstruction, we can evaluate the round operations directly on secret shares without any interactions between the components. This is possible because for linear secret sharing schemes, linear operations are easy to compute on shared data. Let us now take a closer look at our high-level design shown in Figure 1.
Let MOE denote the cipher that we will detail in the later sections. For now it will only be important that MOE requires operations which are linear over some group. To this end, we have chosen two distinct groups in which this linearity is described. After a study of the different possibilities which we provide in Appendix B, we settled for the following two types of round operations: the multiplication by a matrix of GL n (F 2 ), and the multiplications by an odd integer α modulo 2 n . 4 To simplify notation we denote these two operations by and , and will write * ∈ { , } as a placeholder for one of them. Moreover, we let share * denote the secret sharing operation over group operation * , and let reconstruct * denote the corresponding reconstruction operation. The specification of MOE consists of λ sub-circuits, where each sub-circuit Γ j has exactly the same structure (hence we will sometimes abuse notation and omit the parameter j). Each sub-circuit Γ consists of 2κ mini-circuits ((Γ 1,0 , Γ 1,1 ), . . . , (Γ κ,0 , Γ κ,1 )), where each pair of mini-circuits represents a field operation of MOE computed in the shared-domain. Note that these 2κ mini-circuits can be implemented with only 2 physical devices. More precisely, denote by X i the input to the i-th operation of MOE and with Y i the corresponding output. Let (X i 0 , X i 1 ) be a secret sharing according to the underlying group used by this operation, i.e., ( Notice that due to the linearity of the operation, it holds that Y i = reconstruct * (Y i 0 , Y i 1 ). This completes the description of the computation carried out by the (untrusted) mini-circuits {(Γ i,0 , Γ i,1 )} i . The specification of these mini-circuits will be outsourced for production to the malicious manufacturer A, who will return a set of devices {(D i,0 , D i,1 )} i . The latter will be used together with the trusted master during the assembly of the final device D.
As part of each of the λ sub-devices D j (this corresponds essentially to the computation of Γ j ) the master M will carry out share * and reconstruct * for each of the 2κ operations. Since share * is a randomized algorithm this means that the master M has access to a trusted source of randomness. This is a strong assumption, but as explained in [DFS16] this randomness can be generated by the untrusted devices thanks to an efficient Trojan-resilient PRG. Moreover, as in [DFS16] the master computes a final majority of the output produced by the λ sub-devices. Majority is here defined by computing the majority bitwise which, as discussed in [BDFS18] is necessary to keep the cost of the master low, and is possible because after the testing phase the output of all the sub-circuits is equal to the output of the trusted specification. We finally need to handle the key addition. The simplest option for this purpose is to handle it in the trusted master. In this case a small trusted memory is required. To minimize the complexity of the master, we may, however, also carry out the key addition by distributing key shares among the untrusted mini-devices, and let them carry out the key addition. In this case a secure provisioning phase must distribute the shares of the key to the mini-devices during deployment. The master circuitry M implemented on trusted hardware controlling the mini-devices {(D i,0 , D i,1 )} i which were produced by the malicious manufacturer A, form the final device D. In the following, we will sometimes write C ← D(K, P ) for running the final device D with input key K and plaintext P producing ciphertext C. Similarly we will use the notation C ← D j (K, P ) for the execution of the j-th sub-device of D. Notice that the latter does not involve the computation of the majority that is part of the trusted master circuitry.

Comparison with the generic proposal of Dziembowski et al.
While the construction of Dziembowski et al. has the important advantage of offering a general purpose countermeasure (i.e., a compiler that can protect any computation against Trojan attacks), our construction has the following main advantages making it particular appealing for real-world implementations.
• Simplified testing phase: One fundamental limitation of the generic solution of Dziembowski et al. is that the testing phase must verify the correctness of all the intermediate results of the computation (including their communication). The latter is critical to prevent subliminal channels between the mini-devices. For instance, in [DFS16] if we do not test the necessary communication between the parties running the 3-party computation, then the mini-devices may just exchange their secret shares that they have received from the master. This would enable an adversary to easily exploit cheat codes. The main advantage of our construction is that the testing phase is significantly simpler than the one used in [DFS16]. The main reason for this is that we only need to test correctness of the input/output behavior of the entire block cipher instead of testing all individual components produced by the malicious manufacturer. Since the sub-circuits are shared, this further means that the test can be done by storing a single correct plaintext/ciphertext pair. 5 • Reduced communication complexity: Our construction provides Trojan-resilience with a reduced number of communication rounds between the trusted master and the mini-circuits, which is the main factor limiting the implementation throughput.
• Reduced hardware cost: Our construction can provide Trojan-resilience with two untrusted mini-circuits per sub-circuit while the construction in [DFS16] requires three mini-circuits per sub-circuit as it relies on general MPC techniques.

Protecting against Trojans: robustness vs. security
As discussed in the introduction, in this work we aim to achieve the stronger security guarantee of robustness introduced in [DFS16]. Informally, robustness guarantees that after the testing phase is completed successfully, with high probability the final assembled device D operates according to its specification. This is formalized via a robustness game in Figure 2 that we tailored to the case of a block cipher. The robustness game considers a setting where the untrusted mini-devices {(D i,0 , D i,1 )} i are produced by a malicious manufacturer A. The manufacturer is allowed to arbitrarily change the logical description of these mini-devices. This means that they can implement an arbitrary circuitry (Γ i,0 ,Γ i,1 ), which deviates from the intended specification (Γ i,0 , Γ i,1 ). Similar to [DFS16] we only consider Trojans that are triggered via logical inputs and payloads that are delivered through the logical outputs. Let us now take a closer look at the robustness game ROB MOE (T, A, η, t, λ, K).
Let A denote a malicious manufacturer, and T be a trusted testing algorithm that will be defined in more detail below. The purpose of T is to test whether the potential (K,.),...,D λ (K,.) (1 k , λ, t, K) = false, then return 0; Let P 1 ← A(1 k ); For i = 1 to η repeat: malicious devices {(D i,0 , D i,1 )} i satisfy their corresponding specification. Further, let MOE denote a trusted reference implementation of our cipher and K be a secret key. In addition, the robustness definition is parameterized by the following two parameters: η denotes the number of executions for which we want the final assembled implementation to work correctly after testing, and t is the maximal number of tests that we carry out during the testing phase. In [DFS16] it was proven that we can guarantee the implementation D to work correctly for η executions, and that we require η t. At a high-level the robustness game as shown in Figure 2 proceeds in three phases. In the first phase, the adversary produces the untrusted mini-devices {(D j i,0 , D j i,1 )} i,j which are then assembled together with the trusted master to build the λ implementations D 1 , . . . , D λ . Notice that each D j is built from a set of mini-devices {(D i,0 , D i,1 )} i and uses parts of the trusted master M for secret sharing and reconstruction of the intermediate values. In the testing phase, the tester algorithm T checks the correctness of each implementation D j with respect to a reference implementation MOE. The tester T has oracle access to the partial assembled devices D j initialized with key K and will be discussed in more details below. In the last phase, the final device D (see Figure 1, where we essentially add the majority computation as part of the trusted master) is executed η times. In each such run, the adversary A is allowed to provide inputs for D in order to, e.g., trigger the hardware Trojan. We say that the adversary A wins the game ROB MOE if the game outputs 1. This happens if A manages to activate the hardware Trojan during one of the η executions despite the testing phase being completed successfully. The hardware Trojan is considered activated if the device D MOE outputs a value that differs from the value that would have been produced by the trusted reference implementation of MOE. Otherwise the game outputs 0 indicating that the adversary lost the game.

Simplified testing via T
One key feature of our Trojan-resilient cipher is that it allows for a simplified testing, where only the cipher's final input/output pairs must be tested. 6 This is in contrast with [DFS16] where all the communication between the mini-devices produced by A need to be tested for correctness. Our construction achieves this beneficial property due to the fact that there is no direct communication between the mini-devices D i,0 and D i,1 . This can be obtained because the round operations are linear over some group and hence they can compute on secret shared values without communication between D i,0 and D i,1 . In addition, communication between (D i,0 , D i,1 ) and (D i+1,0 , D i+1,1 ) has to go via a fresh secret sharing (i.e., via the trusted master), and hence (D i,0 , D i,1 ) cannot directly communicate with Tester T D 1 (K,.),...,D λ (K,.) (1 k , λ, t, K): Let P * be some plaintext in the plaintext space and C * = MOE(K, P * ). For j ∈ [λ] repeat the following: Sample t j ← [t] uniformly at random and repeat t j times: Query P * to the oracle D j (K, .) and denote by C the returned value.
If C = C * return false. Return true. (1 k , λ, t): Set the initial state of the devices {D i } i to 0 % E.g., sets the key of the devices For j ∈ [λ] repeat the following: Sample t j ← [t] uniformly at random and repeat t j times: Sample random 2-out-of-2-sharing r, s representing the shared input If View(D j ( r, s)) = View(Γ j ( r, s)) return false. Return true.

Figure 4:
The tester T DSF from [DFS16] slightly adjusted to our notation. View(D j ( r, s)) and View(Γ j ( r, s)) denote the random variable of the input/output behavior of D j , respectively Γ j when run on inputs r, s and mini-circuits communicate throught the master.
(D i+1,0 , D i+1,1 ) as well. We formalize our testing procedure via the tester T presented in Figure 3, where we focus on the case when the key is stored on the trusted master. It is easy to extend the testing algorithm to the case when the key is stored on the untrusted devices. For comparison, the tester T DSF from [DFS16] is given in Figure 4. The main differences between these two testing algorithms are in the inner part of the for loop. While our simplified tester T only needs to compare whether the plaintext/ciphertext pair matches with input/output produced by an honest evaluation of MOE cipher (i.e., via comparison with the pair (P * , C * )), the original tester T DSF from [DFS16] is significantly more complicated. In particular, in T DSF we require that the entire input/output communication of the devices D j matches with the corresponding ideal specification Γ j . To this end, in Figure 4, T DSF uses the check View(D j ( r, s)) = View(Γ j ( r, s)).

Trojan robustness of MOE
We are now ready to prove the Trojan robustness of our construction MOE, which is summarized in the theorem below.
Theorem 1. Let t, η, λ, k ∈ N >0 with η < t be natural numbers. For any malicious manufacturer A, K ← {0, 1} k chosen uniformly at random we have: where the randomness is taken over the randomness of the ROB MOE game.
Proof. We follow the proof approach of [DFS16]. To this end we first argue that the inputs that the mini-devices ,b∈{0,1} receive in the testing phase are identical to what they receive when run as part of the fully assembled device D. More precisely, we denote by view j i,b (K, P ) the random variable that represents the inputs taken by the mini-device D j i,b when D is run with input plaintext P and secret key K. The randomness of view j i,b (K, P ) is taken over the randomness of the secret sharing done by the trusted master. The next lemma follows from the fact that the mini-devices only receives inputs that went through a secret sharing gate using uniform randomness for the sharing function, with ≡ denoting equivalence of the distributions.

Lemma 1. For any
, for any plaintext inputs P, P and any secret key . This lemma implies that the view of each mini-device is independent of the plaintext P . Hence, the distribution of inputs that D j i,b receives in the testing phase is indistinguishable from the view that it receives in the online phase. Given the above the proof of the theorem follows the proof of Theorem 1 in [DFS16] using a series of hybrid games.
: This is the same as the robustness game defined in Figure 2 except for the difference that we replace the mini-devices D j i,b by some abstract circuit specificationΓ j i,b with the same input/output behavior as D j i,b . Hence, we directly get that the probability that A wins in this game is identical to the probability that A wins in ROB MOE (T, A, η, t, λ, K).
Denote byΓ j the specification of the j-th component consisting of (Γ j 1,0 ,Γ j 1,1 ), . . . , (Γ j κ,0 ,Γ j κ,1 ) and the corresponding parts of the master circuits M that controls these mini-circuits. ROB 2 differs from ROB 1 as follows. While in ROB 1 the game outputs 1 when for some of the η iterations C i = C i , in ROB 2 we output 1 when for more than λ/2 of the λ componentsΓ j the ciphertext output by this component differs from the execution of the correct specification Γ j . Since the output is part of the views, we get that: 3. Game ROB 3 MOE (T, A, η, t, λ, K): In this game we replace the malicious input plaintext P provided by A with some fixed plaintex P . By Lemma 1 we get: Using Lemma 4 in [DFS16] we can show that for η < t: , which concludes the proof of the theorem.

Eliminating the randomness from the master M
The need for trusted randomness in the master may sound contradictory with the requirement of only using a very limited number of trusted gates. Trojan-secure randomness can however be obtained via a Trojan secure PRG as discussed in [DFS16]. To this end, a Trojan secure PRG can be constructed by XORing the output of multiple untrusted PRGs.
As long as one of the PRGs works correctly, the result of the XORing is guaranteed to be pseudorandom, and hence is sufficient to be used as randomness for internal protocol computations, i.e., for the sharing of intermediate values before entering the mini-devices. Concretely, for our use case we just xor the result of λ/2 cryptographic PRGs which achieves a security bound of η t λ/2 , similar to the one we target for robustness. Since these PRGs do not need to have randomized inputs, their testing is also simple as we only need to verify their final outputs rather than their intermediate computations.
The main impact of replacing trusted randomness for sharing by randomness generated by a Trojan-resilient PRG is that we move to the computational setting. This means that in Theorem 1 we only consider PPT adversaries and obtain an additional loss in the robustness bound compared to Theorem 1 of negl(k), where k is the security parameter. More precisely, we obtain the following corollary: Corollary 1. Let t, η, λ, k ∈ N >0 with η < t be natural numbers. For any malicious PPT manufacturer A, K ← {0, 1} k chosen uniformly at random we have: where the randomness is taken over the randomness of the ROB MOE game.

Description of MOE
In this section, we introduce a concrete instance of the approach introduced in Section 2 and propose a new block cipher called MOE. Its name stands for Multiplication Operated Encryption to reflect that all round operations are based on multiplication, namely a multiplication by a binary matrix and a modular multiplication by a constant. In order to mimic the API of the AES, it encrypts blocks of 128 bits using a 128-bit key. The encryption routine consists in iterating 4 times a step function made of 6 operations: The different operations, all working on the full 128-bit state, are as described below. A graphical representation of one step can be found in Figure 5.
K j is the key addition, i.e., the function K j : x → x ⊕ K ⊕ c j , where K is the 128-bit master key and the {c j } 0≤j<9 are 128-bit round constants.
A 3 interprets the 128-bit internal state as the binary representation of an element of Z/2 n Z and multiplies it by 3 modulo 2 n .
M interprets the 128-bit internal state as an element of F n 2 and multiplies it with the 128 × 128 binary matrix M .
is the compositional inverse of A 3 (the multiplication by the inverse of 3 modulo 2 n ). Note that unlike in many ciphers such as the AES, the last call to M is not omitted. Another key addition (K 8 ) finishes the evaluation of the cipher.
To fully specify our proposal, we need to describe the round constants c j together with the binary matrix M . Since as we will prove later a random matrix M has the necessary properties, we do not impose any specific structure on M beyond its invertibility and we simply generate a random matrix and check if it has full rank. To do so, we suggest to follow the approach proposed by the designers of LowMC [ARS + 15] which uses a self-shrinking generator [MS95] based on the LFSR of the Grain cipher [HJMM08]. The bits produced are used to fill the matrix row by row. We check the invertibility of the matrix once it is fully specified. If it is not, we repeat the process using the next keystream bits until it is invertible. Once the matrix is produced, we use the next bits of the keystream to form the nine 128-bit round constants c 0 · · · c 8 .

Security claim
We claim 127 bits of security as long as the amount of plaintext/ciphertext pairs the attacker has is smaller than 2 64 . The claimed security level of MOE is thus reminiscent of the one for FX constructions like PRINCE [BCG + 12] or QARMA [Ava17]. For such ciphers, a generic attack with complexity 2 k−d exists when 2 d plaintext/ciphertext pairs are available, so that attacks are expected to become practical as d gets closer to n/2. Our claim is stronger in that we claim 127 bits of security for all d < 64 rather than 127 − d.
The reason for bounding the data complexity is two-fold. First, the Trojan resilience can only be ensured for a limited number of encryptions, namely the number of tests that have been performed. In this context, it does not make sense to provide full codebook security. Second, limiting the data complexity means we can use fewer rounds to prevent attacks faster than brute-force. It thus improves the performance of our algorithm. Besides, MOE is intended to run on devices with a low throughput (see for example Table 5 in Section 6). Thus, in practice, we do not expect manufacturers to actually enforce this data limitation "manually", it will instead be a side-effect of the speed of the intended platforms.

Comparison with recent design strategies
Especially in the last few years, we have witnessed an increasing number of innovative designs focusing on efficiency in MPC and FHE settings as well as proof-friendly design strategies to be deployed, e.g., in smart contracts. Those designs on the one hand try to minimize the number of AND operations, and, more deviating from traditional symmetric primitives, are often defined over vector spaces over fields of odd characteristic and in particular over prime fields for odd primes.
Minimizing the number of AND-gates (which can be the number of AND operations per encrypted bit or the AND-depth) is the aim of designs like LowMC [ARS + 15] and RASTA [DEG + 18]. While RASTA achieves this by making large part of the design noncedependent, LowMC uses partial non-linear layers (used previously in Zorro [GGNS13] for efficient masking), and fixed but randomly generated binary linear layers.
Working over larger fields has been brought forward by MiMC [AGR + 16], a cipher that uses a block-wide (or half-block wide in the case of the Feistel-variant) S-boxes defined as the cube mapping over the binary extension field. Actually, even though the motivation was very different, MiMC can be seen as an iterated version of the KN-cipher [NK95] which to our knowledge is the first cipher to use a large finite field for its design. MiMC was further generalized to GMiMC, deploying a generalized Feistel structure with the cube mapping as the round function [AGP + 19]. Such ciphers all have in common the fact that their round function has a simple univariate representation over the relevant finite field.
The idea of working over non-binary fields was used, e.g., in the cipher Rescue [AAB + 20]. Here, S-boxes are defined as power mappings over prime fields and the linear layer deploys MDS matrices over vector spaces over the prime field.
Combining the idea of partial non-linear layers with making use of larger and potentially non-binary fields is the core of the Hades design strategy [GLR + 20] (and of its Hash instantiation Poseidon [GKK + 19]). Here, the idea is to use some full S-box layer in the first and last rounds to ensure good statistical properties together with a number of partial S-box layers in the middle rounds to ensure diffusion and good algebraic properties.
Recall that the KN-cipher, that was designed to be provable secure against several statistical attacks, was broken with the invention of the interpolation attack [JK97] making use of its simple algebraic structure. This is a fate shared by some of its successors.
It turns out that it is non-trivial to analyse the new designs, in particular the ones defined over non-binary prime fields . Interestingly, some of those ciphers could be broken with algebraic attacks (e.g., Jarvis, Friday [ACG + 19]) due to the simplicity of their algebraic structure. For others, it turns out that more care has to be taken with respect to the components, e.g., the MDS matrices used in Hades designs have to fulfill additional properties as described in [BCD + 20,KR20].
Our approach has some resemblance to the one used for LowMC in that MOE also relies on a large binary matrix generated randomly. However, unlike LowMC, the non-linear transformations we use are dense and have a very high algebraic degree, meaning that algebraic attacks over F 2 are not a threat. On the other hand, MOE differs significantly from algorithms like MiMC or Rescue: in those ciphers, all operations are defined in the same finite field in such a way as to allow the round function to have a simple univariate representation. It is not the case in MOE as the two main operations are defined over different mathematical structures (F n 2 and Z/2 n Z).

Justification of our design decisions: cryptographic properties of modular multiplication and general structure
The aim of this section is to evaluate the cryptographic properties of the multiplication by a constant modulo 2 n (in particular when the multiplier is equal to 3) in order to prove that its use in our proposal leads to a cipher that resists basic attacks.
We introduce the notation A α given as follows: Definition 1. Let α be some element of the modular ring Z/2 n Z. We denote A α the following function: Note that A α is a permutation if and only if α is odd.
We start by surveying the use of modular multiplication in previous designs (Section 4.1). We next discuss several cryptographic properties of A α , namely the algebraic degree for any odd α (Section 4.2), and, for α = 3, the differential and linear properties (Section 4.3 and Section 4.4). Beyond their significance for our design MOE, these analyses are also of independent interest. The only work formally treating of the cryptographic properties of the multiplication modulo 2 n we are aware of is the study of the so-called S-functions by Mouha et al. in [MVDP11] which focuses only on their differential properties.

A Brief History of Modular Multiplication in Cryptography
The use of modular multiplication in symmetric cryptography is not new: already in the 90s, it was used to build symmetric algorithms. Most prominently, the 64-bit block cipher IDEA [LM91] uses multiplications modulo 2 16 + 1 to mix subkeys with the internal state. The other operations it uses are the bitwise xor and the addition modulo 2 16 . Its authors used operations belonging to different algebraic groups in order for them to be incompatible-that is, no pair of operations satisfies a distributive or associative law. This design principle was meant to increase the cryptographic strength. To the authors' credit, no attack against IDEA significantly improving upon brute-force has been found in the single key model. IDEA inspired various other constructions, namely MESH, MMB, WIDEA-n and MARS [NRPV04, DGV93, JM09, BCD + 98]. Modular multiplication also appeared in other (now broken) ciphers, that used it to mix the key with the internal state: Nimbus, MultiSwap and xmx [Mac00,Scr01,MNSV97]). It was also employed in the SHA-3 candidate Shabal [BCCM + 08]. An overview of some characteristics of these different algorithms is given in Table 1. Note that the multiplicative differential is a variant of differential attack introduced in [BCJW02]. It studies the propagation through the cipher of pairs of the form (x, αx). It allowed attacks against Nimbus, MultiSwap and xmx.
We can learn some lessons from the previous uses of modular multiplication. On one hand, using this operation to mix the key with the internal state may lead to weak keys as in IDEA, WIDEA-n, and xmx. On the other hand, the multiplication with a constant provides interesting cryptographic properties, such as an algebraic degree and non-linearity increase, for a cost which may be as low as a single processor instruction.

Sum representation and algebraic degree
The computation of A α (x) can be represented using the binary representations of its operands as the following sum modulo 2 n .
We call it the sum representation. This tool is the core of our proof of the next theorem.
Theorem 2. The maximum degree that A α can take is equal to n − 1 and it is reached if α is congruent to 3 modulo 4.
Proof. By referring to the sum-representation, we can easily see that bit number i of A α (x) depends linearly on bits x i and non-linearly (in a broad sense) on x 0 to x i−1 so its maximum degree is equal to i. Hence, the most significant bit is at most of degree n − 1. We now consider the modular multiplication by an integer α of the form α = 3 + 4k with k ∈ Z. We start by treating the case k = 0 and then move to the general case.
We introduce the function µ i (x) that corresponds to the value of the carry of bit number i. Namely, the function µ : F n 2 → F n 2 is given by It is easy to see that the coordinates of µ(x) follow the recurrence formula: where maj(a, b, c) = ab ⊕ bc ⊕ ac denotes the majority function. We observe that the first non-zero carry is µ 2 (x) and is equal to x 0 x 1 , and in particular deg (µ 2 (x)) = 2. We also have that: ) and we obtain that: This gives the following relations linking the degrees of the bits of A α : This proves that the maximal degree is reached for α = 3. We remark that y i contains only one monomial of degree i (the product of x 0 to x i−1 ) so we can write y i (x) as Let n > 1. We now consider the case k > 0. The mapping A α can be expressed as . As previously we denote y = 3 × x, and we also use the notation t = 4k × x and = A α (x) = y t. We also let h = min i>1 {α i = 1}.
Using these notations, the value of A α (x) is given by the following modular sum: Similarly to the case α = 3, the bit with index i in is given by i = y i ⊕ t i ⊕ δ i where δ i denotes the carry and follows the relation: We observe that deg(δ h+1 ) = deg(y h ) = h, and this is our base case. We make the induction hypothesis that for all j < i we have deg(δ j ) ≤ j − 1. To show that the property holds for bit number i, we make the following additional remarks, deduced from the sum-representation: • deg(t i ) ≤ i − 2 since it depends linearly on x i−2 and non-linearly on x 0 to x i−3 , The last term depends only on x 0 to x i−2 so its degree is lower than or equal to i − 1. We then focus on the first two terms. With the remark made previously, we can rewrite them as: where f i−1 is of degree i − 2 and where t i−1 and δ i−1 (x) depend respectively on x 0 , ..., x i−3 and x 0 , ..., x i−2 . This new expression makes clear that the first two terms of δ i (x) are at most of degree i − 1 and so is δ i (x). Thus, i = y i ⊕ t i ⊕ δ i contains the degree i monomial x 0 x 1 x 2 ...x i−1 (present in y i ) that is not canceled since the other terms are of smaller degree. This concludes the proof.

Differential properties
An algorithm for DDT coefficients. The following theorem allows an efficient computation of any ddt coefficient of the multiplication by 3. The full proof is in Appendix C.1 but we also provide a proof sketch below.
Theorem 3. Let A 3 (x) = 3 × x mod 2 n for n ≥ 2 and let D n A (a, b) be a coefficient of the Difference Distribution Table ( This quantity can be evaluated in time O(n) using that: and by the following induction rule Sketch. We study µ : which is extended-affine equivalent to A 3 rather than A 3 itself. We denote its i-th coordinate µ i . The proof works inductively over the index of the coordinate considered, from the lowest weight to the highest weight. At step i, we count the number of solutions of µ i (x + a) + µ i (x) = b as a function of the number of solutions of The key trick in our proof is to separate the solutions of µ i (x + a) + µ i (x) = b where µ i (x) = 0 from those where µ i (x) = 1. We deduce an inductive formula for computing any coefficient in the ddt of µ. We finally use the extended-affine equivalence of µ and A 3 to obtain a similar algorithm for A 3 .
Sierpinski triangles. The induction described in Theorem 3 has a surprising consequence: the indicator function of the ddt of η(x) = A 3 (x) ⊕ x forms a pattern similar to the Sierpinski triangle, as can be seen in Figure 6. We provide more details in Appendix C.2. Bounding the DDT coefficients. Theorem 3 will also allow us to bound D n A (a, b) using only the value of a. To prove this, we first need the following lemma.
where ν i (a, b) is equal to either 0 or 1 and is given by: Sketch. Theorem 3 provides us with a way of computing D n A (a, b) for any a, b. It works by scanning the bits of a and b from lowest weight to highest weight and updating a starting value by either: 1. setting it to zero, 2. leaving it unchanged, or 3. increasing it, in which case it is at most doubled.
A careful study shows that this case 3 only occurs when ν i (a, b) = 1. The bound follows.
As before, the details of the proof of Lemma 2 are in Appendix C.3. This bound can be exploited via a simple observation stated in the following lemma.
Lemma 3. The following implication always holds for i > 1: If we add the bottom equation from the left and the top one from the right, we obtain a i−1 ⊕ a i−2 = 0. Thus, if a i−1 = a i−2 , it is impossible for both systems to be satisfied.
A naive combination of this lemma with Lemma 3 would give that log 2 D n A (a, b) ≤ 1 + n − |{i ≤ 1, a i = a i−1 }|. Indeed, for each position in which a i = a i−1 , Lemma 2 imposes that the maximum value of the sum of ν i (a, b) is decreased by 1. However, this approach does not work out of the box. If a i−2 = a i−1 and a i−1 = a i , then it is always true that ν i (a, b) + ν i+1 (a, b) ≤ 2 and that ν i+1 (a, b) + ν i+2 (a, b) ≤ 2 but it is not always true that Thus, bounding the maximum value of the sum by n minus the number of i such that a i−1 = a i is wrong.
Nevertheless, we can write a similar bound if we add the condition that if a i−1 = a i then we do not take into account whether a i = a i+1 . We then take into account the quantity aw(a) defined as: and it then holds that: The number of non-zero coefficients in the NAF of a is called the arithmetic weight of a and is denoted aw(a). For a ∈ F n 2 , aw(a) is the arithmetic weight of the integer with a as its binary representation.
We have shown that is possible to bound D n A (a, b) using the arithmetic weight of a. We formalize this result into the following theorem.
Theorem 4 (Main Bound). Let a be an element in F n 2 and let D n A (a, b) be a coefficient of the ddt of A 3 : x → 3 × x mod 2 n for n > 2. Then, it holds that: In particular, it is possible to bound D n A (a, b) independently from b.
This bound is illustrated in Figure 7(a) where: • the first curve was obtained by actually computing the ddt of A 3 to find the maximum coefficient in each row, and • the second curve was obtained using Theorem 4.
As we can see, the actual maximum is indeed lower than the bound of Theorem 4. Though coarse, this bound is sufficient in practice to serve as the basis for the design of a block cipher as we show in Section 5.1.

Linear properties
Theorem 5. Let (a, b) be a pair of input and output masks. Furthermore, let: i be such ..., b i , 0, ..., 0) and:  be a linear function. The Walsh coefficient Otherwise, it can be deduced from values of W A (., .) for smaller masks using: The proof consists purely in computations given in Appendix C.5.

On the unsuitability of SPN constructions
Our aim is to design a 128-bit cipher where all operations are linear (in different groups) to ease the implementation of secret sharing inside each round. The most costly part in such an implementation is the secret sharing itself, so we want our cipher to require as few sharing/recombinations as possible while remaining secure. A natural approach would consist in using the wide trail strategy to design a secure Substitution-Permutation Network (SPN). However, we demonstrate here that this approach would require a high number of sharing and recombinations to give a secure cipher. By contrast, our proposal relying on applying both a multiplication by 3 and a binary matrix on the full state of the cipher can be proven resistant to differential attacks with a reasonable complexity. The wide trail argument was introduced by Daemen and Rijmen who famously used it to design the AES [AES01]. It allows to prove a simple bound on the maximum expected probability of a differential trail covering r rounds of an S-Box-based cipher in two steps. First, we show that the maximum probability of a differential for the S-Box used is upper bounded by a certain quantity: where u is the differential uniformity [Nyb94] of S. Then, we show that any differential trail covering r rounds activates at least a(r) S-Boxes and conclude that the expected probability of any single trail covering r rounds is at most (u2 −n ) a(r) . When the cipher is a SPN, we can use the branching number b of the linear layer to have that a(2) = b.
We could build an n-bit block cipher using multiplication-based S-Boxes operating on m bits, where m divides n. As we allow ourselves arbitrarily complex linear layers, we can have the optimal bound a(2) = n/m + 1 by building the linear layer from an MDS code.
As we need to be able to compute the DDT of the S-Boxes considered and since we need that m divides n = 128, m = 16 is the maximum size of the S-Boxes we consider.

Using multiplications as S-Boxes.
The simplest approach would consist simply in using multiplications by constants modulo 2 m as S-Boxes. The non-linear operation would then correspond to the multiplication by a diagonal matrix of elements in Z/2 m Z. However, for any α ∈ Z/2 m Z, Pr α × (x ⊕ 2 m−1 ) ⊕ (α × x) = 2 m−1 = 1, meaning that the differential uniformity of a multiplication by a constant is always maximum. As a consequence, a wide trail argument cannot work; it would only bound the maximum expected differential probability with 1.

Building S-Boxes with several multiplications.
A simple fix for this issue is to build the S-Boxes by combining two multiplications modulo 2 m interleaved with a multiplication with a binary matrix of size m × m. The multiplication coefficients and the binary matrix could further be optimized to lower the differential uniformity of the whole construction as much as possible. However, while such an approach could be used to build a secure block cipher, it would lead to a higher number of sharing/recombinations than the design strategy we describe in Section 3.
Indeed, as first shown by the attacks targetting SPNs with a "SASAS" structure [BS01, BS10] which were later generalized to more rounds [BKP16], at least 4 S-Box layers are needed to prevent integral attacks. Precisely, the structural attack against SASAS from [BS01,BS10] only needs 2 2m plaintext/ciphertext pairs which, for m ≤ 16, is well under our cap of 2 64 plaintext/ciphertext pairs. Furthermore, each such round would contain four secret sharing and recombinations as each round would consist in a key addition, a multiplication in (Z/2 m Z) n/m , a multiplication by a binary matrix of size n × n, another multiplication in (Z/2 m Z) n/m to finalize the evaluation of the S-Box layer and, finally, another multiplication by a binary matrix of size n × n to provide diffusion between the S-Boxes.
On the top of this, several more rounds would need to be appended to this construction to take into account the fact that a parallel layer of S-Boxes lends itself to an efficient key recovery over several rounds. Indeed, the key material at the input/output of each S-Box can be brute-forced separately and then recombined in a second step as done, e.g., with the partial sum technique introduced in [FKL + 01].
Thus, each round requires 4 sharing/recombinations, at least 3 are needed to prevent distinguishers and then at least another 2 such rounds to prevent key recoveries. The total number of sharing/recombinations in each cipher evaluation is thus at least equal to 20. On the other hand, while our proposal requires 2 sharing/recombinations per round and 6 rounds to prevent distinguishers, the full diffusion of its round function implies that 2 additional rounds are sufficient to prevent key recovery. Thus, it needs a total of only 16 sharing/recombinations per encryption.

Security analysis of MOE
In this section, we list the results of our attempts at attacking MOE as well as the justification of its security against various attacks. Our best distinguishers are listed in Table 2. Key recovery attacks are made extremely difficult by the strong diffusion of the M layer and the fact that A 3 operates on the full state. We leave the description of such attacks as an open problem. We additionally assessed experimentally the security of small-scale variants of MOE against differential and linear attacks. The results we obtained are consistent with the argument outlined below and give us confidence in the sanity of our design approach. These experiments are described in Section 5.2.

Proof of security against single-trail differential attacks
In this section, we bound the probability of differential trails in MOE. We show that the majority of binary matrices lead to a cipher that does not have high-probability characteristics after only a few rounds. The starting point of our reasoning is the bound established in Theorem 4: informally, the more changes in the input difference, the lower the probability of the characteristic. Consequently, a lower bound on the number of changes in the vectors in each differential trail implies a lower bound on the probability of all characteristics. This observation made us opt for a step function that alternates A 3 and its inverse for the non-linear layer. In this way, the probability that a differential characteristic covers the 3 operations A −1 3 , M and A 3 can be bounded by a quantity depending on the arithmetic weights of the output of A −1 3 and the input of A 3 . Since these two vectors are related by M , we can reformulate our problem as finding the minimum number of changes present in the input and output of the binary matrix M . Given the similarities of this notion with the one of differential branch number, we denote this as the Change Branch Number (CBN) of M . We denote by C k n the number of n-bit vectors with an arithmetic weight of k. Its value is given by the following theorem.
Theorem 6. Let n be a positive integer. The number of n-bit vectors with exactly an arithmetic weight of k is given by: Proof. To ease our enumeration, we split the solutions in two sets, depending if a change is present in the last 2 bits or not. If not (first set), it means that exactly k changes start in the n − 2 first bits of the vector. In case a change is positioned in the last 2 bits (second set), we have k − 1 changes in the first n − 2 bits, and the last change of this smaller vector must start at the n − 3th bit at the latest. 7 In the following, we denote by P η κ the number of possibilities for choosing a valid set of starting indexes of κ changes among η index possibilities.
Let us next consider the first set. To build a solution in this set we start by choosing the positions of the k changes in the n − 2-bit vector (P n−2 k possible choices) and then fix their values ("01" or "10"). It is easy to see that once these elements are fixed, all the bits between two changes are uniquely determined: indeed, there is only one solution that does not contradict either the total number of changes or their starting positions. However, the bits following the last change are less constrained and can take 2 values: either all-zero or all-one. This implies that the size of the first set is equal to P n−2 k × 2 k × 2. We now consider the second set. A change is present in the last 2 positions, so after positioning the other k − 1 changes only one solution is possible for the other bits. There are P n−3 k−1 possibilities for positioning the k − 1 changes, and each of the k changes can take 2 values. Consequently, the size of the second set is P n−3 k−1 × 2 k . We finally have to determine the value of P η κ . The problem of positioning the start indexes of κ changes in a vector of η bits can be seen as partitioning η − κ bits with the following conditions, where the x i are integers corresponding to the number of bits separating two change starts: where x 0 and x κ represent the number of bits before the first change and after the last change, respectively, so can be null, whereas the other x i are at least equal to 1 as a change is made of 2 bits. By subtracting 1 to all the x i , 0 < i < k, the conditions can be reformulated as follows:

The number of solutions of this equation is given by a famous combinatoric theorem and is equal to
. This concludes the proof as it shows that the number of n-bit vectors with exactly k changes is:

From the CBN of a random permutation to the one of M
To estimate the expected CBN of a random linear permutation, we first compute the distribution of the CBN of random (non-linear) permutations using Theorem 6. In a second step, we checked experimentally that this distribution is unchanged when we restrict ourselves to random linear permutations.
Recall that aw(x) is the arithmetic weight of x ∈ F n 2 and we denote the change branch number of a transformation T as CBN(T ). The CBN of a random permutation S verifies the following: The approximations stems from the assumption that the input and the output of S are independent. The inequality aw(x) + aw(S(x)) has to hold for all x, so assuming independence, it is the product of the probabilities that it holds for a specific x. Those values are then grouped into x values of a given aw(x) = x. The expression of the probability that the change branch number is equal to t is easily deduced. We experimentally checked this result. Table 3 shows the distribution of the CBN of 500 matrices picked at random from GL(24, F 2 ) along with the expected distribution deduced from Formula 1. The two distributions are very close. We found similar results for values of n up to 28. Consequently, Formula 1 gives an accurate estimate of the CBN one can expect for a random matrix of GL(n, F 2 ). Further, we observe that the distribution reduces to two values when we increase n, as illustrated in Figure 8. For instance, for n = 64, only the Table 3: Comparison of the CBN distribution deduced from Formula 1 with the experimental distribution obtained with 500 random matrices of GL(24, F 2 ). values of the CBN equal to 11 and 12 have a meaningful probability. For n = 128, which is the case we are interested in, we have a change branch number equal to 24 roughly 80% of the time and a change branch number equal to 23 for the remaining 20%. Overall, most mappings share the same (or very close) branch number and that for high values of n the distribution reduces to 2 values. Note that the CBN is upper bounded by the minimal distance. For n = 128, the Gilbert-Varshamov bound gives that a random linear code has with high probability a minimum distance of 31, while we found that the CBN will be of 24: our results are consistent with this bound.

From the CBN to a bound on differential characteristics
By combining this CBN bound with Theorem 4, we obtain a bound on the probability of any differential characteristic on one step of MOE. We denote by δ 1 , δ 2 and δ 3 the differences at the input of A −1 3 , M and A 3 respectively and δ 4 the difference at the output of A 3 : By iterating the 4 operations A −1 3 , M , A 3 and M three times, we obtain a succession of operations for which we can prove that there are no differential characteristics with probability greater than 2 −66 . Given the cap on the data complexity we consider, this is sufficient to prevent the existence of distinguishers based on differential characteristics. It appears that a reasonable choice would be to fix the number of non-linear layers to 8: 6 to be safe against differential characteristics, plus 2 as a security margin. 8 Recall that M is dense so that inverting 1 round, even if only partially, will require substantial key guessing, which makes us believe that our choice is sensible.
Furthermore, while we proved that there exists no exploitable differential characteristics covering 6 rounds of MOE, whether there are in fact such characteristics covering only 4 rounds is an interesting open problem. Indeed, in light of our experiments in Section 5.2, such characteristic may not even exist.
So-called Multiplicative differentials [BCJW02] have been used to easily find differential trails in some ciphers using modular multiplications. As we prevent the existence of any high probability differential trail, we are in particular safe from those found in this way.

Experimental results on small scale variants of MOE
Unfortunately, we were not able to find a clean argument proving the security of our algorithm against linear attacks as we did for differential attacks. However, due to the fact that all of the operations of MOE operate on the full state, it is very easy to design variants of MOE operating on smaller blocks.
As detailed below, we seized this opportunity and conducted experiments on variants of our cipher with block sizes from 8 to 16. In short, we experimentally computed the maximum coefficient in the DDT and in the LAT of MOE n (n being the block size) for different keys, matrices M and round constants. Note that while cryptographers usually work on differential trails-it is what our security argument against differential attacks relies on as well-this approach deals with differentials and linear approximations directly. To put it differently, such experimental results are not directly based on the study of patterns propagating throughout the rounds. Therefore, it has the advantage of taking into account the possible clustering of differential or linear trails. In what follows, we describe our approach in more details.
Recall that we limit the data complexity of attacks we want to prevent to 2 n/2 . The number of plaintext/ciphertext pairs needed to mount a differential distinguisher for a block cipher instance E k is essentially: and, for a linear attack: Both probabilities are taken over all possible keys. In the differential case, we have: so that we can estimate a bound on the maximum of this probability by computing the maximum coefficient in the ddt of several keyed instances MOE n ki for k i ∈ K, K ⊂ F n 2 , and looking at the maximum of these quantities which we denote q d : . As we cannot consider all possible keys, we assume q d (F n 2 ) ≈ q d (K) for the subset K we experimentally consider. We want to have D differential > 2 n/2 . We deduce from Equation (2) that it is equivalent to having the inequality max a =0,b (Pr [E k (x ⊕ a) ⊕ E k (x) = b]) < 2 −n/2 . Using what we just established, it holds that: In order to prevent the existence of differential distinguisher using less than 2 n/2 plaintext/ciphertext pairs, it is thus sufficient to have 2 −n q d (K) < 2 −n/2 or, equivalently; Similarly, for the linear case, we look at the maximum coefficient in the LAT of several instances MOE n ki and then keep the maximum of those quantities. We denote the result q : and we assume q (F n 2 ) ≈ q (K) for the subset K we experimentally consider.
Again, our aim is to have D linear > 2 n/2 . We deduce from Equation (3) that it is equivalent to having: a condition which we re-write: We have established that an estimate of the maximum of this probability can be bounded with 2 −n−1 q (F n 2 ) ≈ 2 −n−1 q (K), it is therefore sufficient to have: which is equivalent to: To experimentally assess the security of MOE n for smaller values of n, we thus proceed as follows. For each n, we looked at 20 different matrices M i and sets of round constants and, for each such instance, at a set of 20 different master keys denoted K i . We computed the maximum coefficient in the DDT and the LAT of the corresponding permutations and deduced q d (K i ) and q (K i ). We then plotted the average, minimum and maximum of log 2 (q (K i )) 4 3n and log 2 (q d (K i )) 2 n in Figures 9 and 10 respectively. As we can see, for 3-round MOE n , some linear and differential distinguishers with a data complexity under 2 n/2 may exist as neither log 2 (q (K)) 4 3n nor log 2 (q d (K)) 2 n are consistently under 1 for the values of n considered. Furthermore, they seem to stabilize just above this number as n increases. Still, we can deduce that any differential or linear  distinguisher covering 3 rounds would need an amount of data close to 2 n/2 . We stress that such distinguishers could not be improved using a cluster of differential (or linear) trails as their effect is already taken into account by these experiments.
We also see that no linear or differential distinguisher will cover 4 rounds with a data complexity under 2 n/2 . Indeed, the two quantities we consider are mostly under 1 for the small block sizes n we experimentally investigated. Further, they decrease as n increases. We can thus expect that a differential or linear distinguisher covering 4 rounds needs much more than 2 n/2 plaintext/ciphertext pairs. This result is consistent with the bound we have put on the differential probability of 6-round MOE when n = 128 as we have shown that no differential trail covers 6 rounds with a probability higher than 2 −66 . In fact, it seems like our bound is rather loose and 4 rounds may actually be sufficient.

Other attacks
While our proof and experimental results cover both differential and linear attacks, other types of attacks exist. We summarize our corresponding results below.

Variants of differential and linear Attacks
As evidenced by our experiments, the clustering of differential or linear trails are not a threat to MOE. The diffusion is both very fast (due to the M layer), and does not interact in any particular way with the non-linear operation. As a consequence, this experimental observation is not surprising. These properties of M are also behind the security of MOE against truncated differential attacks.

Impossible differentials
Impossible differential cryptanalysis [Knu98] is a natural attack strategy given that our non-linear layer admits a probability one differential characteristic and that the cipher uses few rounds compared to most algorithms in the literature. In addition, this technique was proven efficient to break up to four and a half rounds of the multiplication-using cipher IDEA [BBS99], a result that was for long the best attack.
In the following, we present an impossible differential that covers the 7 transformations M, A 3 , M, A −1 3 , M, A 3 , M , that is 3 rounds out of the 8 that define the cipher. We will see that turning this differential into an efficient attack seems complicated, which shows that impossible differential attacks are not a concern for MOE.
As a starting point, remark that for any choice of α (odd) and for α = 3 in particular, a 1-bit difference standing in the most significant bit gives the same difference after A α . By extending this characteristic through the linear layer, we obtain a differential that covers the 3 steps M, A 3 and M and goes from M −1 (0 · · · 01) to M (0 · · · 01). To build the impossible differential from this, we connect two such differentials by a non-linear step A −1 3 . Given the structure of the ddt of A α (see Appendix C.4) it is easy to see that more than 3/4 of the transitions are impossible. Consequently, with a high probability, the differential going from ∆ X = M −1 (0 · · · 01) to ∆ Y = M (0 · · · 01) over the steps M, A 3 , M, A −1 3 , M, A 3 , M is impossible. Note here that the construction of the impossible differential holds for any choice of M and regardless of the non-linear layer, as soon as this layer has probability one (non-trivial) characteristics. However, the non-linear layer impacts the probability that the transition from a probability one characteristic to the other is impossible. A good point (for the attacker) is that our previous discussion on the ddt (see Section 4.3) implies that checking if the transition is impossible can be done efficiently.
A possible idea to use this differential to mount an attack would be to consider a reduced version of MOE with 4 non-linear steps, with the previous impossible differential positioned at the beginning (see Figure 11). Figure 11: Impossible differential on a reduced version of MOE with 4 non-linear steps.
The idea would then be to ask for the encryption of pairs of messages with the input difference ∆ X = M −1 (0 · · · 01). We could test a key by inverting the last linear and nonlinear layer from the ciphertext and check if the difference is equal to ∆ Y = M (0 · · · 01). The good point (again for the attacker) is that the definition of A 3 allows to make guesses with a reduced cost: indeed, τ consecutive output bits of A 3 can be computed with the knowledge of τ + 1 consecutive input bits and one bit of carry (this property can be seen on the sum representation of A 3 ). If this property offers the possibility to make partial independent key guesses and to combine them later, the total time complexity would still be prohibitive. Since we don't have a strong filter on the ciphertext difference, we would have to roughly examine all the pairs. Each of them would allow to cancel one key. Taking into account the data limitation of 2 64 plaintext/ciphertext pairs, we conclude that the impossible differential is hard to exploit. Indeed, even as a simpler 3-round distinguisher it would require the full code-book (and thus the time needed to compute it) and a negligible amount of memory: for each pair, if the impossible output difference is observed then the permutation observed cannot be 3-round MOE. It is thus impossible to exploit given the limitation we impose on the data complexity.

Invariant subspaces and 0-sums
Multiplications by constants in Z/2 n Z exhibit many invariant subspaces in the sense of [LAAZ11] (see below). However, the permutation M very thoroughly disrupts this pattern and it is thus impossible to use such spaces for an invariant subspace attack. It is nevertheless possible to exploit this property along with the low algebraic degree of the low weight bits of A −1 3 to obtain a wide array of 0-sum distinguishers against 2-round MOE (where the last call to M has been removed).
Let us show how such distinguishers would work. Let s ≤ n be an integer, let a ∈ F n−s 2 be a constant, and let S s (a) = {x||a, x ∈ F s 2 } be an affine subspace of F n 2 of dimension s where the s bits of highest weight take all possible values and where the n − s bits of lowest weight are set to a. Such sets have the following properties: • K j (S s (a)) = S s (a ), where a is the n − s bits of lowest weight of a ⊕ K ⊕ c j , • A 3 (S s (a)) = S s (a ) where a is the n − s bits of lowest weight of A 3 (a), and where a is the n − s bits of lowest weight of A −1 3 (a). Using the terminology first introduced in [LAAZ11], these sets are invariant subspaces. However, such structures are completely broken by the F 2 -linear layer M and are thus impossible to use to build an invariant subspace attack.
Nevertheless, they yield a powerful 0-sum distinguisher for 2-round MOE, i.e., for Indeed, as we have established in the proof of Theorem 2, the bit at position i in the output of A α is a function of degree i. This holds in particular for A −1 3 . The sum of a function of algebraic degree d over any affine space of dimension strictly greater than d is equal to 0. As the image by M of S s (a) is an affine space of dimension s, it holds that: where i < s and (e 0 , ..., e n−1 ) is the canonical basis of F n 2 . As a consequence, the permutation R corresponding to MOE reduced to 2 rounds exhibits a wide-array of 0-sum distinguishers where the sum over an affine space of dimension s yields a 0-sum over s − 1 bits. Due to good diffusion of M , it is hard to turn this distinguisher into an attack faster than brute-force. Still, these observations lead to the existence of simple 0-sum distinguishers needing 2 s chosen plaintexts, a time corresponding to the corresponding encryption and a negligible amount of memory as only the sum needs to be stored.

Slide attacks
This type of attack, introduced in [BW99], targets ciphers using identical round functions. Given that different round constants are added during each round, MOE resists slide attacks. Besides, even in the absence of round constants, distinguishing a slid pair covering M , A −1 3 , M , A is hard because all operations operate on the full state. The resulting key recovery would also have a high complexity.

Notes on the key-schedule
The key-schedule we use is trivial as it simply adds the round key every time along with some round constants. This leads to a complementation-like property that can speed up an exhaustive search of the key by a factor of 2, much like in the DES. It is the reason why we only claim 127 bits of security against brute-force. Recall also that we do not make any security claim in the related-key setting. We detail this property below.
Consider the encryption of a plaintext P with a key K. We compare this result with the encryption of (P ⊕ M (∆)) under the key (K ⊕ ∆ ⊕ M (∆)) where ∆ represents the difference with only one bit set to 1, positioned in the MSB. As can be seen in Figure 12, the difference between the two executions is equal to ∆ at the input of the first non-linear layer. Since any modular multiplication sends ∆ to ∆ with probability one, the difference remains constant after this step. The obtained characteristic is iterative on the 3 steps of key addition, linear layer and modular multiplication, and we can see that it spreads with probability one through the whole cipher. Consequently, regardless of the number of rounds, the difference between the two executions can be predicted with probability 1: As for the DES, we can use this property to speed-up a brute force attack. The attacker starts by asking for the ciphertexts C and C corresponding to a random plaintext P and to P = (P ⊕ M (∆)). She then exhaustively considers one key out of each pair (K, K ⊕ ∆ ⊕ M (∆)) and encrypts P with it. If the obtained ciphertext is equal to C, she concludes that her guess is correct. In case she obtains C ⊕ ∆, she concludes that the key used in the cipher differs from her guess of (∆ ⊕ M (∆)). If none of these relations holds, she continues browsing the keys. The speed-up factor is of one half. Note that this attack works for any choice of α used in the modular multiplication, for any choice of M and for any number of rounds. However, we could easily fix this by choosing a key schedule, or by using different M matrices in each round. Figure 12: Complementation-like property of MOE: if we denote ∆ = (0 · · · 01), the differential characteristic depicted here holds with probability 1.

Algebraic attacks
As discussed in Section 3.3, multiple ciphers have been recently designed to be used in specific settings where operation over fields of large degree are desirable. Many of these algorithms have been targeted with efficient algebraic attacks that leveraged the low univariate degree or the simple algebraic structure of the round function over the field used. We expect MOE to be safe from such attacks. Indeed, it relies on operations that are defined over different structures, namely F n 2 and Z/2 n Z. Thus, its round function does not have a simple representation using polynomials with coefficients in either structure (an argument, e.g., found in [BIP + 18]). Such polynomials would be both dense and with a high degree, and thus would be unusable to mount an attack against MOE.

Performance evaluation
We conclude the paper with an investigation of the performances reached by MOE in a Trojan-resilient setting, based on prototype Printed Circuit Board developed in [BDFS18] implementing traditional block ciphers based on the Trojan-resilient compiler of [DFS16].
In that work, the authors implemented the AES Rijndael and the bitslice cipher Mysterion [JSV17] thanks to the MPC protocol in [AFL + 16] and performed an analysis of the performances in two steps. First, they confirmed the reduction of the trusted area that such Trojan-resilient circuits allow; second, they quantified the throughput that can be reached for both ciphers, and its impact on the robustness bounds. In this section, we follow the same steps for the analysis of MOE, and use the same hardware as [BDFS18].

Trusted area requirements
The key hypothesis of the previous Trojan-resilient architecture is that the trusted area is small (and at least, significantly smaller than the one of the functionality to implement). The rationale behind this hypothesis is that it should be easier to apply heuristic methods for hardware Trojan detection on the trusted master than on a complex circuit. Concretely, this reduction of the trusted area is achieved in two main directions. Firstly, an unprotected implementation needs trusted memories which takes a large proportion of the design area. For example, storing the key and the state in a low-area PRESENT implementation represents half of it [BKL + 07]. Our hardware Trojan architecture gets rid of this requirement by storing shares in untrusted components. Secondly, the logic used for the computations (i.e., S-boxes, matrix multiplications, etc.) is also outsourced to untrusted components leaving only sharing logic in the trusted design. This has the advantage of making the trusted area (per sub-circuit) independent of the target computation (which, for example, would come in handy if a cipher suite had to be implemented in a trusted manner).
Compared to the proposal of Dziembowski et al., the main variation in our case is that the trusted part has to deal with two types of secret sharing operations, namely additions in F 2 and Z/2 n Z. Considering a bit-serial interface (which is optimal w.r.t. the size of the master [BDFS18]), the first operation requires a single XOR gate, while the second operation requires a serial adder (composed of a full adder and a register for the carry). As a result, the trusted area per sub-circuit which is worth 16[GE] when only Boolean sharing is considered increases to 26[GE] in our case. One interesting additional feature of MOE is that the constant multiplication A 3 can also be performed by the trusted party, hence saving a few communication rounds for a minimal additional area cost. Since this operation can be executed based on an addition, the serial adder used for the sharing over Z/2 n Z can be recycled to compute A 3 , given that an additional register is used to shift the serial bits, leading to a total trusted area of 31[GE] per sub-circuit.
For the rest, the majority vote can be implemented identically to the proposal in [BDFS18] since for a given robustness bound, our Trojan resilient cipher requires the same number of sub-circuits as the generic MPC solution.
Considering the variant of our architecture where the key addition is performed by the mini-circuits, and the trusted party is responsible of the secret sharing/reconstruction and A 3 operations together with the majority vote among the sub-circuits, it leads to the values in Table 6. Recalling that a state-of-the-art implementation of the AES requires 2400[GE] [MPL + 11], our trusted area in MOE is still one order of magnitude smaller.
We mention that the cost of the multiplications in the untrusted FPGAs is not a primary performance metric in our case. Indeed, since each separate mini-circuit only has to implement a multiplication with a random Boolean matrix or with 3 in Z/2 n Z they only consume a small percentage of low-cost commercial FPGAs.

Throughput and Robustness Bounds
Data throughput is the most relevant metric to compare block ciphers in a Trojan-resilient setting: increased throughput enables both to speed up the testing phase (potentially improving the robustness bounds) and to encrypt data at a higher rate during the online phase. For two algorithms running on the same physical support (and in particular, relying on the same communication performances), the data throughput mainly depends on the communication complexity and the majority vote circuit [BDFS18]. Hereunder, the performances of MOE are compared to the ones of AES and Mysterion.
We first evaluate the communication complexity, which is the key factor influencing the time spent in data transfer for a single encryption. For AES and Mysterion, the communication complexity comes from the multiplicative depth of the algorithm to implement. Thanks to the protocol of [AFL + 16] used in [BDFS18], each multiplication requires a single field element to be transferred between two untrusted parties (without latency). Additionally to the multiplication, the initial sharing and final reconstruction of the plaintext and ciphertext also add one communication round each.
In MOE, the communication complexity is mostly proportional to the number of times one has to switch from F 2 to Z/2 n Z. Each time this happens, the shares are transferred from the untrusted parts to the trusted ones, opened and reshared before being transmitted to the untrusted parts again. The latter can be performed just as the sharing described in [BDFS18], which is executed in a single cycle. It leads to a communication complexity directly proportional to the number of bits to re-share. The overheads due to the initial and final sharing are similar to the ones of AES and Mysterion.  Table 4 contains the communication complexity of each of the investigated block ciphers (including plaintext sharing and ciphertext reconstruction). Taking advantage of the optimizations in [MBPV05], the execution of the AES S-Boxes requires 320 bits to be exchanged between untrusted parties while Mysterion and MOE only require 128 bits to be transferred per round. Additionally counting the initial sharing and final reconstruction, it leads to a total of 3, 456 bits to be exchanged on the communication bus for the AES. Thanks to its efficient bitslice representation (with a multiplicative depth of one AND gate per round), this number is reduced to 1, 792 bits for Mysterion. It is further reduced to 1, 664 for MOE (thanks to the possibility to perform the operation A 3 in the master, which saves a few additional communication rounds).
We follow with evaluations of the data throughput, which not only depend on the communication complexity but also on the ability to perform a majority vote in the trusted party. The latter depends on the number of sub-circuits λ (the evaluations in [BDFS18] set it at 940/λ[Mbps] using the high-speed communication interfaces of the FPGAs). Yet, since this majority vote is only performed once, when producing the ciphertext, its negative impact on the performances is amortized when increasing λ, as illustrated in Table 5. For λ = 1, the majority vote can be skipped making the encryption throughput directly related to the communication complexity from Table 4. For λ > 1, the reduced communication complexity impact is decreased because of the larger time needed for the majority vote. Putting things together, we can compare the robustness bounds that can be achieved with the different ciphers. Namely, by fixing the number of online executions and the time devoted to the testing phase, we can compute the number of sub-circuits λ required to reach a certain robustness bound (say a probability of hardware Trojan attack lower than 2 −80 ), and therefore the size of the trusted master needed for this purpose. 9 Table 6 contains the results for testing phases of one and seven days. It mostly confirms theoretical expectations and illustrates the lower number of sub-circuits required by MOE compared to the AES, at the cost of an increased trusted area. Gains over Mysterion are more limited due to its throughputs that is quite comparable to MOE.
Note that thanks to its simplification, we can expect that the testing of MOE is performed exactly at these throughputs. By contrast, it may not be the case of AES and Mysterion. As mentioned in introduction, if all the intermediate values need to be communicated to the tester, it may become the bottleneck of the testing phase. And if a dedicated board is used for the testing (with all mini-circuits but one trusted), it still requires the ability to plug/unplug each tested chip on/from this testing hardware. So in general, we believe the simplified testing phase of MOE makes a significant step in the direction of Trojan-resilient block ciphers that can be deployed and tested on-the-fly. The cost reductions allowed by MOE are further highlighted by comparing the total number of mini-circuits in the Trojan-resilient circuits, since one sub-circuit is made out of three mini-circuits in the generic solution, and only two with MOE.
Note that despite using similar operations, ARX-based block ciphers would also compare negatively with MOE in a Trojan-resilient setting, because of the larger number of transitions between fields it requires (which is the limiting factor for the throughput). 9 We do not use the bound of Theorem 1 and rather compute Pr [ROB = 1] = λ i= λ/2 λ i · ( η t ) i · (1 − η t ) λ−i directly, which leads to tighter results [DFS16].

Conclusion
In previous works, Trojan-resilience has been approached by adapting current ciphers to fit this specific requirement. In this paper, we have shown that better performances can be obtained if we design an algorithm from the ground up for this purpose. The gains are substantial: a Trojan-resilient implementation of MOE can be up to 2 times faster than one of the AES, and its testing is greatly simplified. Furthermore, MOE is the first cipher of its kind, namely one where all operations are linear. While our aim was to design a cipher optimized for a Trojan-resilient implementation, we can expect the solution we came up with to have applications beyond this use-case. For example, its trivial implementation with secret sharing may be useful in the context of multi-party computation and seems also appealing in the context of masking. Another possible research direction could be in using the scalability of the structure of MOE to design block ciphers with other block sizes, sponge permutations or other primitives which would share MOE's easy secret-shared implementability. Finally, we leave as an open problem the further investigation of the linear properties of the multiplication by 3 modulo 2 n .

A Trojan robust construction from [DFS16]
For completeness we recall the construction of [DFS16] based on a passively secure 3party computation protocol. The formal construction is given in Figure 14 and taken from [DFS16]. The construction works as a general compiler that can protect any computation against Trojan attacks. To this end, the compiler takes a specification of an algorithm modeled as an arithmetic circuit Γ as input, and outputs a protected circuit Γ . At a high-level Γ consists of multiple mini-circuits (Γ 1 , Γ 2 , Γ 3 ) that take the role of the parties in the passively secure 3-party computation protocol. Since the compiler works gate-by-gate, Figure 14 presents transformations for each single arithmetic operation in Γ. As usual, linear operations are almost for free and do not require any interaction between the mini-circuits Γ i . On the other hand the non-linear operations are costly and in particular require interaction between the mini-circuits (the communication is also illustrated in Figure 15). The high communication complexity of the non-linear gates is also the reason for the high costs of the tester T DSF as essentially the entire communication between the mini-circuits (Γ 0 , Γ 1 , Γ 2 ) has to be tested for correctness.

B General theorems about the possible candidate groups
The first important thing we need to figure out is how to implement the linear operations, that is, what are the computations made by the untrusted mini-circuits. This problem can be reformulated as defining a finite Abelian group and selecting one of its automorphisms.
We hereafter refer to the paper Automorphisms of Abelian Groups [HR07] of Hillar and Darren to figure out the possible options for implementing the linear operations.
The first well-know result that we use is the following: In the remaining of the paper, we focus on the two following extreme cases for p = 2: • ∀i ∈ {1, · · · , t}, e i = 1. In this case, the automorphism group of H 2 corresponds to the set of invertible matrices GL t (F 2 ).
• H 2 = Z/2 n Z. The automorphisms of H 2 are the multiplications by an integer α so that α mod 2 = 1. These are the modular multiplications by an odd integer.

C Differential and linear properties of the multiplication by 3 modulo a power of 2 C.1 Proof of Theorem 3 (DDT Formula)
We consider the function η : x → (3 × x mod 2 n ) ⊕ x which maps F n 2 to itself. It is extended affine equivalent to the multiplication by 3. When η operates on n bits, we use the following notation for its DDT coefficients: The coordinates of η are functions η i (x) such that: The last line is obtained using the facts that (x, y, z) → xy + (x + y)z is the majority function and that . Note that η i does not depend on x j for j ≥ i, meaning that η i has all of its inputs in F i 2 . The following notations are used throughout this proof: • We let D n η (a, b) be the DDT coefficients of η when it operates on n bits. • We let S z i be defined for z ∈ F 2 and i ≤ n − 1 as follows: • We further denote λ z i = |S z i | so that, in particular, D n η (a, b) = λ 0 n−1 + λ 1 n−1 . • We let T be the function truncating the highest weight bit: one. Thus, for each x ∈ S 0 n−2 ∪ S 1 n−2 we can build a unique x n−2 such that (x 0 , ..., x n−2 ) ∈ S 0 n−1 ∪ S 1 n−1 . We conclude that the following holds: where |S 0 n−2 | = |S 1 n−2 |, so that: T (b)) .

C.2 On the Sierpinski triangle pattern of the DDT of η
The Sierpinski triangle is a fractal that is named after Wacław Sierpiński, who described it in 1915. It can be seen as a full equilateral triangle with smaller triangle-shaped parts removed. Among the various ways to build it, one consists in shrinking and duplicating shapes. We starts with a full equilateral triangle, and shrink it to make it two times smaller. We take 3 copies of this smaller triangle and position them so that they touch in their corners. This newly formed figure is our new starting point: we shrink it, make 3 copies of it and assemble it, and so on. When looking at the indicator function of the ddt of η (Figure 6 for instance), we can observe a pattern that is very similar to a Sierpinski triangle, with some additional parts. What's more, the pattern of the ddt for a given value of n looks like a refined version of the pattern obtained for n − 1, reminiscent of a version with more details that could have been obtained by another iteration of the shrinking and duplicating algorithm (see Figure 16 for an illustration of this phenomenon).  We just give a very informal idea of this relation: • From Lemma 4, we have that every coefficient D n η (a, b) is a function of the coefficient of D n−1 η (T (a), T (b)) in the sense that if the function is non null (so if there is a motif in the considered area of the ddt at n), then it looks like the one we have at n − 1. Morally, this is the same as shrinking the motif to use it to build next iteration (like in the Sierpinski triangle construction): the motif in the ddt at n − 1 is copied in the ddt at n, but this one is twice bigger. Figure 17 depicts the maximum pattern we would obtain if all the motifs of n − 1 would be present.
• The situation is made difficult by the additional parts (A and B in Figure 17) that mess with the pattern, and the obtained ddt does not look like a refined version of the smaller instance. Hopefully, as seen in Lemma 4, some parts are erased (namely a part of C in the left half, A and C in the right half), and thanks to that we obtain something that meets our expectations (see Figure 18). Figure 18: When we take into account the null parts, the pattern at n looks very similar to the one at n − 1. The fractal structure is visible.

C.3 Proof of Lemma 2 (DDT bound)
Lemma 5. All the following propositions are true for any a, b in F n 2 and any i ≥ 2: i is equivalent to x ⊕ a ∈ λ 1 i , meaning that the first two propositions hold in this case. If b i = a i then δ i (a, b) ∈ {0, 1}. Let us consider those cases separately: We deduce the lemma.
The following theorem tells us under which condition the sequence λ 0 i + λ 1 i increases. Theorem 10. Let i ≥ 2 be some integer. Then the inequality: holds iiif the conditions in Equation (8) and Equation (9) are satisfied, that is if: and if: Proof. If δ i (a, b) = 1 or δ i (a, b) = 3, then λ 0 i + λ 1 i = λ 0 i−1 + λ 1 i−1 . Thus, in order for the sequence to increase, it is necessary that δ i (a, b) ∈ {0, 2} which is true if and only if We then consider the cases δ i (a, b) = 0 and δ i (a, b) = 2 separately. This very general theorem is a bit complicated to use in practice. The following corollary gives a simpler -though slightly coarser -interpretation.
Proof. We prove the theorem by induction over i using the following induction hypothesis for any i ≥ 1: Let us first consider i = 1. In this case, as given in Theorem 3, λ z 1 = 2 if and only if a 0 ⊕ a 1 ⊕ b 1 = 0 and a 0 ⊕ b 0 = 0; it is equal to 0 otherwise. Thus, the total number of solutions is initialized with: • 0 if ν 1 (a, b) = 0, in which case the bound always holds, or • λ 0 1 + λ 1 1 = 4 = 2 1+ν1(a,b) if ν 1 (a, b) = 1. The hypothesis thus holds for i = 1. We now assume that the induction holds for some i ≥ 1. According to Theorem 10, if λ 0 i + λ 1 i > λ 0 i−1 + λ 1 i−1 then a i−1 = b i−1 and one of the following two conditions must hold: • a i = b i and a i−1 = a i−2 , or • a i = b i and a i−1 = a i−2 . We ignore the remainder of the condition in this case as we are only interested in necessary conditions.
As a consequence, if ν i (=) 0 then at least one of the conditions for λ 0 i + λ 1 i > λ 0 i−1 + λ 1 i−1 to be true fails to hold. As a consequence, λ 0 i + λ 1 i = λ 0 i−1 + λ 1 i−1 and the induction hypothesis still holds. On the other hand, λ 0 i + λ 1 i ≥ 2(λ 0 i−1 + λ 1 i−1 ) is always true, so that the induction hypothesis also holds in this case.
In the end, it is true that λ 1 i + λ 0 i ≤ 2 1+ i j=1 νj (a,b) for all i ≥ 1 and in particular for i = n − 1. Since D n A (a, b) = λ 1 n−1 + λ 0 n−1 , we obtain the lemma.

C.4 On the shape of the DDT
In this section, we give a rough estimate of the proportion of impossible difference transitions of the non-linear layer A 3 , and show that many of them can be easily characterized by using an affine equivalent function. 11 We denote by R the function reversing the order of the bits in the binary decomposition, namely: R : F n 2 → F n 2 x = x n−1 , · · · , x 2 , x 1 , x 0 → x 0 , x 1 , x 2 , · · · , x n−1 11 Note that our reasoning can easily be generalized to Aα in general. By applying this function to both the input and output of A 3 , we obtain a function that is affine equivalent and which ddt looks more organized: as depicted on Figure 19, it shows wide ranges of null coefficients grouped into squares.
The patterns we observe on the example of Figure 19 seem to indicate that for any n we have at least a fraction of 1 8 + 2 n i=1 ( 1 4 ) i impossible difference transitions. In the following, we show that they can be easily understood and that we can deduce a same amount of null coefficients in the ddt of A 3 .
We denote by y the output of R(A 3 (R(x))). In the sum representation, this gives: Which makes clear that the output bit number i of R(A 3 (R(x))) is linear in x i and non-linear (in a broad sense) in x j for j > i, property that we rewrite as: The following observations can be visualized on Figure 20 (for n = 6) but are valid for any value of n. The two hatched squares numbered 1 on Figure 20 each cover one fourth of the ddt and correspond to two types of impossible transitions: the down left one illustrates that a transition from a (truncated) input difference with the MSB set to 1 never leads to an output difference with the MSB equal to 0 while the right top square expresses the opposite situation (a difference with a null MSB never leads to an output difference with the MSB set to 1). This impossibility is clear from the expression of the MSB of the output: y n−1 = x n−1 . Consequently, half of the transitions are impossible. The other squares can be explained in a similar way just by referring to Equation (10): • For two inputs that are equal on the first i − 1 MSB and different at position i, Equation (10) implies that the output difference in bit i is equal to 1 while the output difference in the bits of higher indices must be zero. Consequently, all the transitions from such an input difference to an output difference with the i first MSB equal to 0 are impossible. In Figure 20, this corresponds to the leftmost squares.
• Likewise, two inputs that do not differ on the i first MSB cannot lead to an output difference where the i − 1 first MSB are null while MSB i is set to 1. This case is the symmetric of the previous one and defines squares of the same size.
For i varying from 1 to n, this defines a fraction of impossible transitions that is equal to 2 n i=1 ( 1 4 ) i . To conclude and obtain the announced lower bond we remark that an additional fraction of 1 8 of the transitions are impossible. They are deduced from the linear equation y n−2 = x n−2 ⊕ x n−1 when x n−1 = y n−1 = 1 (squares denoted "a" on Figure 20). Given that A 3 is affine equivalent to R(A α (R(x))), these observations translates into the same number of impossible differentials for A 3 , with expressions that can be deduced. We consider the function µ : x → (3 × x mod 2 n ) ⊕ x ⊕ (2 × x) which maps F n 2 to itself. It is extended affine equivalent to the multiplication by 3. Its coordinates are functions µ i (x) such that: where (x, y, z) → xy + (x + y)z is the majority function. Note that µ i does not depend on x j for j ≥ i, meaning that µ i has all of its inputs in F i 2 . Let i be the position of the highest 1 in n. Then W µ (a, b) = 0 unless a j = 0 for all j ≥ i as µ i does not depend on x j for j ≥ i. We can write W µ (a) b as: We write F i 2 as the following direct sum: We deduce the following lemma.

C.5.2 Deducing an induction for the multiplication
We have found a recursive formula for the Walsh spectrum of µ : is a linear function. The Walsh coefficients of A : x → 3 × x mod 2 n are given by: where u T is obtained by transposing the matrix representation of u, so that u T (x) = x ⊕ (x/2). As before, we define i as the position of the 1 of highest weight in b. Then: where L(a ⊕ u T (b)) is well defined because a i = b i = 1, so that the 1 of highest weight in a ⊕ u T (b) is at most at position i − 1. We simplify L a ⊕ u T (b) ⊕ u T (T i (b)) into: Furthermore, u T (2 i−1 ) = 2 i−1 + 2 i−2 . We deduce Theorem 5, namely that W A (a, b) can be computed as the following sum: Theorem 5 explains some visual patterns that we observe in the lat of A 3 in Figure 7b. First, the non-zero coefficients are in n different squares with indices i, each defined as the set of points with coordinates (a, b) where 2 i ≤ a < 2 i+1 and 2 i ≤ b < 2 i+1 . These correspond to the condition that W A (a, b) = 0 unless 2 i ≤ a < 2 i+1 when 2 i ≤ b < 2 i+1 . Another implication of the theorem is that for a and b between 2 i and 2 i+1 we have |W A (a, b)| = |W A (a + 2 i−1 , b + 2 i−1 )|. This can be visualized on Figure 7b: if we cut into 4 equal squares any of the non-zero squares of the figure, the top left one is equal to the down right one, and the top right one has the same pattern as the down left one.