Perfect Trees: Designing Energy-Optimal Symmetric Encryption Primitives

. Energy eﬃciency is critical in battery-driven devices, and designing energy-optimal symmetric-key ciphers is one of the goals for the use of ciphers in such environments. In the paper by Banik et al. (IACR ToSC 2018), stream ciphers were identiﬁed as ideal candidates for low-energy solutions. One of the main conclusions of this paper was that Trivium , when implemented in an unrolled fashion, was by far the most energy-eﬃcient way of encrypting larger quantity of data. In fact, it was shown that as soon as the number of databits to be encrypted exceeded 320 bits, Trivium consumed the least amount of energy on STM 90 nm ASIC circuits and outperformed the Midori family of block ciphers even in the least energy hungry ECB mode ( Midori was designed speciﬁcally for energy eﬃciency). In


Introduction
Energy efficiency has become an eminent research discipline particularly in the context of lightweight cryptography [BBI + 15, BBR16, BMA + 18, BDE + 13, KDH + 12].Low-energy consuming encryption solutions are critical, for example, in battery-driven devices that run on tight budgets like portable devices, medical implants, sensor nodes or active RFID tags.Power and energy are correlated parameters, as energy is essentially the time integral of power, and power is simply the rate of energy consumption.In a nutshell, energy is a measure of the total electrical work done by the battery source during the execution of any operation, i.e., E = P dt.
Hence, a less energy-hungry operation drains the battery less and is important for applications that run on tight energy budgets.
Power/energy consumed in semiconductor circuits come from two principal sources: dynamic and static.Static power is accounted for by the leakage current and other current drawn continuously from the power supply.This type of power is generally not dependent on the frequency of the clock driving the circuit.Dynamic power, on the other hand, is due to the charging and discharging of load capacitances in CMOS circuits.Each 0 → 1 / 1 → 0 transition contributes to the dynamic dissipation, and hence this component varies directly as the clock frequency.Since energy consumed in an operation is roughly equal to the product of the average power and the time taken for it, this implies that the leakage energy increases with any increase in the physical time required to do a task (which can occur if we lower the clock frequency).Dynamic energy, on the other hand, would by a similar logic be independent of the frequency of the signal clocking the circuit.In this framework, there have been numerous previous works that have investigated the energy efficiency of block ciphers.In [BDE + 13, KDH + 12], an evaluation of several lightweight block ciphers with respect to various hardware performance metrics, with a particular focus on the energy cost, was done.In [BBR16], the authors looked at design strategies like serialization and round unrolling and the effect it has on the energy consumption required to encrypt a single block of data.They concluded that in a low-leakage environment, at high enough frequencies, the energy consumed for encrypting one block of plaintext was actually independent of the clock frequency of the circuit, (the authors of [KDH + 12] also had independently come to the same conclusion).This is because if the leakage power is low, then the lion's share of the energy consumption is due to the dynamic component, which is basically given by the sum total of all the glitches produced in the circuit which is generally independent of the clock frequency.The readers will note that the frequency has to be high enough for the above observation to hold.Otherwise, at lower clock frequencies, the physical time taken to encrypt becomes larger and even small leakage power results in significant enough energy consumption of the order of the dynamic energy.Then the total energy increases monotonously as the frequency decreases.In [BBR16], it was also proved that encrypting one block of plaintext for any r-round unrolled implementation (given the above conditions and low leakage environment) had a quasi-quadratic form Here, A, B, C are constants and R is the number of iterations of the round function prescribed for the design.Ar 2 + Br + C denotes the energy consumed per cycle and 1 + R r is the total clock cycles required to encrypt.This expression was arrived at due to the following arguments: Since an r-round unrolled structure has r copies of the round function circuit connected serially one after the other, the glitches (which are really due to transients at beginning of the clock cycle) produced due to signal delays in the i-th round function, are compounded in the (i + 1)-st round function and are compounded further in the (i + 2)-nd round function (see [BBI + 15, Figs.1,2,3,4]).It was then shown that the power consumed in each round function formed a simple arithmetic sequence.Since the total power consumed is a sum of these r terms of the sequence, it results in a quadratic function in r.Multiplying this with the total time taken to encrypt, i.e., 1 + R r gives us the required expression.We can see that although an r-round unrolled cipher consumes more energy per cycle for increasing values of r, it takes fewer cycles to complete the encryption operation itself.
In the realm of stream ciphers, no energy model of the sort is currently known.However, in another work by Banik et al. [BMA + 18], some broader conclusions about the effects of unrolling stream cipher circuits were made.They show that an unrolled stream cipher circuit that produces multiple keystream bits in one clock cycle is more energy-efficient in an asymptotic sense, i.e., when the encryption of multiple data blocks is considered instead of a single block.In fact, it was shown that for over 320 bits of data, Trivium consumed the least amount of energy on STM 90 nm ASIC circuits and outperformed the Midori block cipher family.For asymptotically large amount of data, the regular Trivium circuit reached its point of optimality relatively late at r = 160, and at this degree of unrolling it was around 9 times more energy-efficient than Midori-64.These findings are reflected in Figure 1, indicating that the baseline Trivium design is a fitting starting point from which new low-energy constructions can be derived.[BMA + 18] using the STM 90 nm cell library process at a clock frequency of 10 MHz.Added to the plot are figures for the energy consumptions of Subterranean-Deck, and the designs Trivium-LE(F), Trivium-LE(S), Triad-LE that we propose in this paper, for the same standard cell library and operating frequency.Figures are reported for short messages (1 to 10 blocks of 64-bits) and longer messages (1-100 blocks).Legend entries highlighted in blue and green have a security level of 80 and 128 bits respectively, whereas Triad-LE offers 112-bit security.
The reasons why a heuristic energy model for stream ciphers appears to be harder to conceive are manifold.For one, stream ciphers circuits are often not more than a single large register bank whose outputs are fed into a thin combinatorial layer, e.g., in Trivium the state update function only consists of 12 two-input logic gates.This means that for small r the energy consumption of the algorithm is almost entirely determined by the storage elements, i.e., the contribution of the round function circuit is insignificant.Further note that when r is small the switching activity of the state update function heavily depends on the underlying cell library process and can thus vary widely.Only for large r the energy consumption of the round function layer renders itself decisive, however it becomes increasingly complex to reason about the circuit as the algebraic complexity of the underlying equations grows unmanageable, thus preventing any deeper analysis of the involved switching activity.This stands in contrast to block ciphers where the unrolling factor r is usually small and thus the complexity of the round function circuits remains bounded.
Analogously, the reasons why some hardware stream ciphers outperform block ciphers in energy efficiency are also many.Most hardware stream ciphers (like Trivium and the Grain family) are designed with a few register locations at the beginning being untapped, i.e., not used in register update.This allows for efficient hardware unrolling, so that, unlike block ciphers, each individual round in these stream ciphers can be implemented in parallel and hence does not increase the circuit depth.As such, the glitches produced in the circuit of round i do not increase the glitches in round i + 1, at least when the circuit is unrolled for small values of r.Perhaps the most important reason is that stream ciphers perform the key-IV setup only once and then are able to encrypt multiple bits of data without having to do it again.For example, an implementation of Trivium that is unrolled r = 128 times, would only need 1152 128 = 9 clock cycles to complete key-IV setup and takes 100 more cycles to encrypt up to 12800 bits of data.The most energy-efficient implementation of Midori64 (at degree of unrolling r = 2), needs 8 cycles to encrypt every 64-bit block of data, and hence would need 8 * 12800 64 = 1600 cycles to encrypt the same length of data which is around 15 times more.Consequently, lightweight stream ciphers are preferable when factors like energy and throughput are concerned.

Contributions
In this paper, we investigate unrolled stream cipher constructions and make some fundamental discoveries about their energy consumption behaviour.More specifically, our contributions can be summarized as follows: 1. Perfect Tree Energy Model.Our first contribution in this paper is to reimplement r-round unrolled stream cipher circuits in a generic more energy-efficient manner.We shall define shortly the concept of a circuit strand, which basically comprises of the logic functions involved in one register update.We demonstrate that rather than following the approach in [BMA + 18], if we adopt a technique in which each strand is implemented separately as a unit and the circuit synthesizer is prevented from performing any inter-strand optimization, then the power consumption increases in a slower manner with the respect to the degree of unrolling r.
Trivium is especially suited for this restricted mode of compilation and reaches its point of optimality in the fully unrolled setting at r = 288.This optimal energy is significantly lower than the 160-round circuit reported in [BMA + 18] under the same operating environment.
This tessellation enables us to partition the entire circuit into smaller units which are obviously the strands.Since these are interconnected, it gives rise to a natural tree structure among them in the following way: a strand j is a child node of strand i, if the output of j is one of the inputs of i. Hereafter, by observing the variation of the power consumption in these strands, it is possible to deduce a strong correlation between the power consumed by each strand its position in the above tree, which leads to the definition of a tree-based metric that correlates the energy consumption to a wide range of stream ciphers, namely: Thus this leads to the proposal of the fist formal energy model for stream cipher constructions akin to that for block ciphers in [BBR16].
2. New Energy-Optimal Stream Ciphers.By leveraging the obtained energy model, we are able to show that register tap positions significantly affect the energy efficiency.Hence, our next attempt is to design new energy-optimal ciphers, where our approach is to change the register tap positions of the original Trivium cipher.However, the change of the register tap positions also affects the security, and we carefully chose these positions without decreasing the claimed security level, i.e., the 80-bit security of Trivium.We present two candidates, which we call Trivium-LE(F) and Trivium-LE(S), that consume around 10-15% and 25% less energy than Trivium, respectively.Note that Trivium-LE(F) is conservative with enough security margin, and Trivium-LE(S) is challenging with a thin security margin.As shown in Figure 1, both constructions stand as the currently most energy-efficient encryption primitives in the literature when at least 24 bytes are encrypted.The energy efficiency of Trivium-LE(F) outperforms known ciphers, and the structure is also useful to design an energy-efficient message authentication code.We present Trivium-LE-MAC whose update function inherits to Trivium-LE(F) but the message is absorbed instead of key-stream generation.
It is important to note that our model makes it, for the first time, possible to design stream ciphers for hardware environments that are specifically optimized in terms of energy consumption as the metric is both simple and widely applicable.We also applied the same strategy to Triad-SC, which supports 112-bit security, because it seems to be the most promising for the energy efficiency due to the shorter state size of 256 bits.By altering tap locations, we present one candidate, which we call Triad-LE, that lower the energy consumption than the original Triad-SC.

Comparison with Other Works
Note that previous major works in the field of energy efficiency [BBI + 15, BBR16, BMA + 18] were limited in their approach in the sense that their findings were restricted to a 90 nm standard cell library and energy was computed at 10 MHz throughout.This was feasible as 90 nm standard cells have very low leakage and at a frequency of 10 MHz or higher the contribution of the leakage energy to the total energy consumption was minimal.Since the dynamic component of the energy is constant with respect to frequency, as a result, at all frequencies upwards of 1 MHz the energy consumption was more or less constant (see [BBR16, Fig. 1]).However, we present our findings for 4 different standard cell libraries in which the underlying transistors have sizes 90 nm (TSMC), 65 nm (UMC), 45 nm and 15 nm (NanGate) respectively and therefore we do not ignore leakage energy.
Although for presentability, we report results at certain fixed frequencies for each library, primarily to bring out the dynamic part of it, the energy trends that we present hold across libraries and a wide range of clock frequencies, and we argue that convincingly in the paper.When this is not possible, for space constraints, the results are reported at clock frequency 10 MHz for the TSMC 90 nm and UMC 65 nm libraries and at 1 GHz for the NanGate libraries.This is done so that the dynamic energy component is the dominant contributor of the total energy consumption (for better comparison with [BBI + 15, BBR16, BMA + 18, KDH + 12]).All energy figures are reported for encryption of 1.28 Mbits of data and are generated after a timing simulation of around 10000 test vectors on the corresponding netlist post-synthesis.
Note that in real-world, on-chip implementations of these circuits, typically there are more sources that cost energy like (a) energy consumed in the clock-tree or (b) energy consumed when the device is idling.In this work, we do not focus on these issues primarily because they are common to all circuits.Instead our focus will be on the energy consumed by the circuit itself.

Outline
In Section 2, we present the effects that different compiler directives used to synthesize stream cipher circuits have on the energy consumption.Section 3 details the obtained heuristic energy model.In Section 4, we propose energy-optimal Trivium variants and an energy-efficient message authentication code.Subsequently, in Section 5, we study recent Trivium-like, Grain-like and Subterranean-like constructions proposed in the literature and show that our derived energy model works for these designs too.The paper is then concluded in Section 6.

Restricted Circuits
Combinatorially heavy circuits, such as the increasingly complex algebraic state update equations in r-round unrolled stream ciphers, induce synthesis tools to produce optimized architectures in terms of circuit area.They also introduce a gap when it comes to reasoning about the overall energy consumption, which is significantly hindered as the synthesized circuits have mutated into opaque, garbled constructions.
We find that imposing a regular structure which is exclusively composed of simple combinatorial logic gates in which the state update function is replicated unaltered across different r in an unrolled setting yields equivalent if not better power figures for basic as well as more feature-rich cell libraries when compared to the highly optimized circuits of the Synopsys Design Compiler synthesis tool.We define one such structure as follows: We define each individual logic block as a strand of the following form: A feature-rich library with 3-pin linear cells can implement one strand with 3 gates (1 NAND2, 1 XNOR2, 1 XNOR3), hence the entire Trivium combinatorial layer then consists of 10 gates in total (9 for the 3 strands and one 3-input XOR gate for the output function).
A simpler library that only consists of 2-pin linear logic elements such as the NanGate cell library family requires 14 gates for the combinatorial layer.A full description of Trivium is given in Appendix A.
In this respect, we investigate several circuit and compilation directives supported by the Synopsys Design Compiler.
• Regular.The entire circuit is compiled with the regular compile command which moderately attempts to optimize the synthesis result.In this setup, the synthesizer is free to choose the mapping and the corresponding optimization.The compiler may choose to not respect the boundaries between two strands and make any optimization it deems fit.This is actually equivalent to the implementation strategy of [BMA + 18], i.e., in which the compiler has the freedom to optimize given the logical representation of the update function.
• Restricted.Same compilation directive as in the regular configuration, i.e., compile, however the synthesis of the state update function is restricted to the logical mapping, where the state update circuit for r = 1 is simply replicated for higher degrees of unrolling.Under this directive, the compiler puts together each strand separately and is forced to respect the boundaries between 2 strands.Thus when used as such, the compiled circuit consists of exactly 3r strands for an r-round unrolled construction.
• Ultra.The circuit is synthesized using compile_ultra directive which is a high-effort routine that optimizes beyond the entity boundaries and often yields the most areaand latency-efficient constructions.Here too, the compiler may choose not to respect strand boundaries.
One of our empirical findings is that for Trivium circuits compiled under the Restricted directive, the increase in the power consumption (for encrypting a given number of data blocks) is much slower (with respect to the degree of unrolling r) than circuits compiled under the Regular or Ultra directives.2Note that a more fundamental answer to the question whether the energy figures increase or decrease when a cipher is further unrolled is directly linked to its latency and power consumption.
Let L(r) be the total number of clock cycles required to encrypt a fixed-size plaintext block in the r-round unrolled setting and denote by P (r) and E(r) the power and energy values respectively.It is crucial to note that L(r) will decrease and consequently P (r) will increase as r increases and thus the value of r which minimizes E(r) = P (r) • L(r) (this is true for block ciphers too) was exactly the problem studied for block ciphers in [BBR16] and for stream ciphers in [BMA + 18].
In Figure 2 and Figure 3, we detail the energy and area simulation results for four standard cell libraries (TSMC 90 nm, UMC 65 nm and NanGate 45 and 15 nm) over a wide range of frequencies.The choice of frequencies was indeed library specific: so that the critical path of the circuit was well below the clock period even when the circuit was fully unrolled.This obviates the need for the compiler to use higher drive strength based cells just to get a positive slack (i.e., ensure clock period larger than critical path), which alters the basic character of the circuit for different values of r and prevents a fair evaluation.Hence for the faster NanGate library based circuits we used the frequency range 1 MHz to 1 GHz, and for the other libraries we used the range 0.2 MHz to 100 MHz.We find that the circuits compiled in the restricted mode are by far the most energy-efficient of the three.Its energy consumption more or less decreases monotonously for r ≥ 150, which suggests that if r is allowed to vary up to 288, then the fully unrolled cipher, i.e., r = 288, is the best setup for energy constrained environments (though not always).This empirical observation naturally allows us to segue into the next round of results in Section 3 where we look more closely at the circuits compiled under the restricted mode.

Perfect Tree Energy Model
For the remaining experiments, we look to investigate unrolled Trivium circuits with r = 288 since they achieve maximum throughput and deliver close to the best energy efficiency for all libraries across a wide range of frequencies.Though it was theoretically possible    to unroll more, it would require more silicon area and improve energy efficiency only fractionally more.Since the circuits are compiled in restricted mode, it is possible to see how much power each strand consumes.We commence by introducing some notations and definitions that will help us formalize the write-up better.We commence by introducing some notations and definitions that will help us formalize the write-up better.
As mentioned in Section 2, each state update function of Trivium consists of three strands t 1 , t 2 , t 3 , i.e., Definition 2 (i-th Strand).Denote by t i (r) the strand for equation t i in the r-th unrolled round with i ∈ {1, 2, 3} and r ∈ {1, . . ., 288} such that each successive t i (r) can be recursively defined as: where t 1 (r) = x 94−r , t 2 (r) = x 178−r and t 3 (r) = x 1−r whenever r ≤ 0.
Figure 4 shows the power consumed in each of the strands t i (r) for increasing values of r for 2 of the libraries we experiment with in this paper.We had expected the power in the strands to increase monotonously with r as in block ciphers, but the figure clearly suggests that the increase is far from monotonous.The red marks represent the strands whose power consumption experiences a sudden dip.This observation seemed at first to be counter-intuitive, and so we set about trying to understand this curious phenomenon.We first observed that all t 1 (r)'s (for 1 ≤ r ≤ 66) consume the same power until t 1 (67) whose power consumption is considerably larger (note the red to black jump in Figure 4 around r = 66 for t 1 (r) for all the libraries).All inputs to t 1 (r) (for 1 ≤ r ≤ 66) come directly from the register.Thus in some sense their input nodes are all at a distance 0 from the register.However, one of the inputs of t 1 (67) comes from the output of t 3 (1) and thus not all its inputs are at distance 0 from the register.This delay imbalance in the input wires gives rise to more glitches in the internal circuitry of t 1 (67) and this hints at one of the reasons why it consumes more.Further consider the boundary around r = 93.At r = 94, the power consumption of t 1 (94) drops.It is easy to see that all the inputs of t 1 (94) are at distance 2 from the register, whereas the inputs of t 1 (93) are still unbalanced with respect to the delay from the register.This led us to believe that delay imbalance plays a major role in determining how much power the strands consume.
Through the Looking Glass.In order to verify the above phenomenon, we looked at the internal timing diagrams of both the strand pairs (a) t 1 (66) and t 1 (67), and (b) t 1 (93) and t 1 (94), presented in Figure 5 (the circuit was synthesized using NanGate 45 nm cell library and clocked at 1 GHz).Let us examine t 1 (66).The first 2 input pins x 1 , x 28 , according to the circuit synthesizer, have an average delay of 0.09 ns from the clock edge at which the new inputs are written on to the registers.As a result, the output of the first XOR gate in the strand i.e., x 1 ⊕ x 28 is only moderately glitchy.Over 4450 clock cycles this net switches logic only 2271 times, as found by a post-synthesis timing simulation on the netlist.On the other hand, in t 1 (67), x 27 is at a delay 0.09 ns whereas the other input t 3 (1) is at an average delay 0.25 ns.The output of the corresponding XOR gate x 27 ⊕ t 3 (1) is glitchier as compared to x 1 ⊕ x 28 , it switches 4512 times in the same interval.This clearly indicates that t 1 (67) consumes more power.Conversely, consider t 1 (93).The first 2 input pins x 1 , t 3 (27) have delays 0.09 ns and 0.25 ns from the clock edge.Hence the net  x 1 ⊕ t 3 (27) switches around 4665 times in the same interval.However, in t 1 (94), the pins t 3 (1), t 3 (28) have delays 0.24 ns and 0.25 ns.Hence, many of the glitches produced by them cancel out and the XOR net t 3 (1) ⊕ t 3 (28) switches 2551 times in this interval.This indicates that depth-balanced strands consume less power than unbalanced ones.

Circuit to Tree
In order to formalize the above phenomenon, we found that the circuit strands are connected naturally in a well-defined graphical topology.Each unrolled strand can be translated into a 5-ary tree with the root node as the output bit whose subtrees are other unrolled strand trees or leaf nodes.
Definition 3 (Unrolled Strand Tree).Let T i (r) be the 5-ary unrolled strand tree corre-t1(66) t1( 66    sponding to the unrolled strand equation t i (r).The child nodes of the strand T i (r) are therefore all the 5 nodes T j (u) for which the corresponding terms t j (u) are present in its recursive definition as per Definition 2.
Example 1.To make the link between unrolled strand equations, and their respective trees clearer, we give 3 examples of varying complexity.The unrolled strand trees T 3 (1), T 3 (100) alongside T 3 (200) are displayed in Figure 6.Note that terms that appear several times in an unrolled strand equation result in duplicate nodes in the corresponding unrolled strand tree.This is to ensure that the equations are a one-to-one representation of the actual circuit.
We can further classify our unrolled strand trees as either perfect or imperfect according to the following definitions.
Definition 4 (Perfect m-ary Tree).A perfect m-ary tree is a tree in which all non-leaf nodes have m children and all leaf nodes are at the same depth.
Clearly, the unrolled strand trees in Trivium are 5-ary.Further, remark that in Figure 6, T 3 (1) and T 3 (200) are perfect unrolled strand trees while T 3 (100) is imperfect due to having leaf nodes at different depths.In the example in the previous subsection clearly, T 1 (66), T 1 (94) were perfect trees whereas T 1 (67), T 1 (93) were not.This gives us a very good understanding of the power consumption of strands vis-à-vis the position of the corresponding nodes in the circuit tree graph.A strand evidently consumes less power if the node it occupies in the circuit graph houses a perfect tree.
Let us try to argue this inductively.A tree is 5-ary perfect if and only if all of its 5 child nodes are also perfect.Thus it is easy to see that in a perfect tree all its input nodes are at approximately the same average delay from the register.This being so all perfect trees tend to consume less power.On the other hand a tree is imperfect if and only if one of its child nodes is also imperfect, due to which the gate output corresponding to this imperfect child node is considerably more glitchy.This excess glitch from the child node would naturally be carried forward in the parent strand making its output glitchier and thus causing it to consume more dynamic power.This observation naturally leads us to the next question: is it possible to have a general Trivium-like stream cipher (with tap locations perhaps different from the original Trivium specifications) that is more energy-efficient and also secure at the same time?The translation of circuit to an equivalent algebraic topology may have given us a quick way to check this.Since perfect trees consume less dynamic power, a variant of Trivium (with different tap locations) is likely to consume less energy if its circuit tree graph has a larger total number of perfect trees.
Let us provide more arguments as to why the above makes sense.Consider two configurations of Trivium: Trivium-A and Trivium-B with different tap locations (both synthesized in restricted mode).At a degree of unrolling equal to 288, the circuits of both these variants consist of exactly the same amount of gates and flip-flops.Since the leakage power in a circuit depends directly on the total silicon area, both these circuits are likely to consume the same leakage power.Furthermore, the circuit graphs of both these variants have exactly the same amount of nodes.If for example Trivium-A has more perfect trees in the graph than Trivium-B, then it automatically implies that Trivium-A has fewer imperfect trees than Trivium-B, which more or less implies that Trivium-A is likely to be the variant that consumes less dynamic power.Since the leakage power is the same, this means that the Trivium-A consumes less total power and hence less total energy.This of course should hold irrespective of the standard cell library used to synthesize the circuit or the frequency of signal used to clock the circuit.
We can estimate the total number of perfect trees in a generic Trivium configuration.To ease notation we will denote the total number of perfect trees among all strands t i (r) as S(T i ) such that the total number of perfect trees in the circuit is S(T ) = i S(T i ).More formally, let f be a function from the set of all trees to {0, 1} such that f (T i (r)) = 1 if and only if T i (r) is a perfect tree, and is 0 otherwise: then S(T i ) = r f (T i (r)).Below, we report the distribution of perfect unrolled strand trees in the original Trivium.In Trivium, we have S(T 1 ) = 105, S(T 2 ) = 144, S(T 3 ) = 93, and hence S(T ) = 339.Note that there are no perfect unrolled strand trees of depth 4 or larger.

Enumerating Perfect Trees
In the following, let us consider a generic Trivium layout in order to determine configurations that yield a high number of perfect trees and consequently lower the power consumption.Definition 5. Denote by Trivium(X, n) a generic Trivium configuration composed of n chained registers (X 1 , . . ., X n ) such that X j is the j th register's leftmost forward tap, X f j is the feedback tap and X op j is the output tap.See Figure 7 for a schematic depiction.Note that X op j is essentially the final tap location of the j th register (this is required to ensure the one-to-one nature of the Trivium update).The figure does not explicitly show the taps for the AND gates, as we will show that if both the AND taps are between X j and X op j then it does not affect the total number of perfect trees in the circuit graph.
Note that this notation corresponds to n update function strands hence the unrolled strand tree of t j (r) is T j (r).
Example 2. The original Trivium specification composed of three update function strands is congruent to Trivium(X, 3) where 111 with an additional non-linear gate between the leftmost and output tap in each register.
Finding configurations that lead to an increased number of perfect trees seems nontrivial as the search space is enormous.Additionally, a closed-form solution that evaluates the exact number of perfect trees for a given circuit Trivium(X, n) appears equally hard.A brute-force solution consists of individually creating the unrolled strand tree for each equation and checking that all leaf nodes are at the same distance from the root.However, this approach is expensive and hard to optimize apart from ordinary parallelizations.Nevertheless, transcribing the problem into a recurrence relation offers some remedy to this issue.
Lemma 1.Given an arbitrary, generic Trivium(X, n) circuit composed of n registers, the total number of perfect unrolled strand trees S(T ) in the fully unrolled setting is given by + , where y + = max{y, 0} and f l (X j ), g l (X j ) are recursively defined functions for 1 ≤ l ≤ n of the form such that f 1 (X j ) = 0 and g 1 (X j ) = min X j , X f j+1 .The number of perfect trees of depth t for the j-th strand is S(T j )| depth=t = (g t (X j ) − f t (X j )) + .Hence the total number of trees of all depths is S(T j ) = n l=1 (g l (X j ) − f l (X j )) + and thus the lemma follows.
We remark that since there are n registers indexed 1 to n the value of j + 1 (resp.j − 1) refers to addition (resp.subtraction) modn in the set {1, 2, • • • , n}.Further note that a tree is perfect if and only if all its subtrees are perfect.
Proof.(Intuition) From Figure 8, we can see that there are certain values of r for which the circuit for t j (r) produces a perfect depth 1, depth 2 tree etc.We define two families of functions f t , g t such that f t (X j ) + 1 is the minimum value of r for which t j (r) corresponds to a perfect depth t tree, and similarly g t (X j ) is the maximum such value of r.It stands to reason that the total number of depth t trees produced in this range of r is g t (X j ) − f t (X j ).Note that, obviously if g t (X j ) ≤ f t (X j ) for some t then there do not exist any depth t trees.It remains to show that f t and g t can be recursively defined.The full proof is considerably involved and is given in Appendix B.
To conclude let's argue why the number of perfect trees is independent of the AND gate taps as long as they are to the right of the leftmost tap X j .It is intuitively not difficult to reason why and let us argue with the help of our previous example: t 1 (66) corresponds to a perfect tree but t 1 (67) is imperfect.This is because in the process of unrolling X 1 + 1 is the first value of r at which t 1 (r) no longer takes inputs directly from the register.Thereafter, it does not matter where exactly the AND taps are as long as they are to the right of X 1 : all subsequent values of r until X op 1 continue to produce imperfect trees.
# Perfect depth t trees r Verification: In order to verify our hypothesis (at least empirically) that (a) the number of perfect trees is actually a good indicator of the energy consumption of a generalized Trivium circuit, and (b) that the above holds irrespective of the cell library used to construct the circuit or frequency of the signal used to clock it, we performed an extensive simulation experiment.We generated a large number of Trivium circuits with random taps and calculated the number of perfect trees with the help of the recursion formula given above.We synthesized each circuit in restricted mode using the 4 cell libraries used in all of our experiments and computed the total power consumed at a wide range of frequencies.The results are plotted in Figure 9.Not only is there a strong negative correlation between the power consumed (and hence energy) and the number of perfect trees, the results hold across libraries and clock frequencies as claimed in Section 3.1.For each cell library the same trend is visible across all frequencies.Similarly, since the leakage power of each random Trivium instance is the same and frequency independent (say it is equal to P l ), and since decreasing the frequency (alt.increasing the clock period by ∆T ) only increases the physical time required for encrypting a fixed size plaintext block by an amount proportional to ∆T , hence it follows that the leakage energy of each Trivium instance increases by an amount proportional to P l • ∆T when the frequency is decreased.
Since the dynamic energy is frequency independent, hence all other things remaining the same, when only the frequency is varied, it is equivalent to translating each energy scatter plot by a constant amount along the Y(energy)-axis.
Note that even for configurations with same number of perfect trees, there may be a slight variation in energy consumption, but this variation is negligible as the number of perfect trees increase.This really depends on how badly the imperfect trees are configured in the graph, i.e., configurations with large number of trees with wide variation of delays at their input nodes tend to consume more energy.To model such situations when the number of perfect trees is small, one can think of secondary metrics like the distribution D(x) of number of trees where the absolute difference of the maximum and minimum depths of leaves in the tree is equal to x (note D(0) is the number of perfect trees).It is easy to see that configurations for which D(x) is lower for higher values of x (i.e.lesser number of highly imbalanced trees) are better for energy.Also note that the graph tells us that to get any significant decrease in energy consumption over the original specifications of Trivium (around 10-20%) one needs at least 500 perfect trees.

Post-Routing
Power measurements of integrated systems are usually carried out at the gate level on the post-synthesis netlist and do not account for effects that normally arise after the circuit has been mapped into silicon which are mainly due to parasitics introduced by interconnects.Hence, in this post-routing setting, the obtained post-synthesis figures would need to be reevaluated as to obtain a more accurate picture.In our case, we repeated the experiments from Figure 9 post-route using the Cadence Innovus place-and-route implementation system for the TSMC 90 nm cell library and report that the perfect tree model applies in almost the same magnitude as in the post-synthesis setting.On average, the added interconnect circuitry imposes an area penalty of roughly 4-5% and thus does not affect the overall results as shown in Figure 10.

Energy-Optimal Variants of Trivium
Before we start to look for more energy-efficient Trivium configurations with more perfect trees, let us once again look at the recursion relationship we have just stated.Note that most perfect trees are at depth 1.In order to increase the number of degree 1 perfect trees, it is obvious that we need to have higher values of g 1 (X j ) = min{X j , X f j+1 }, i.e., each tap location should be chosen towards the end of the register.Naturally, it is not possible to choose each tap location only energy efficiency reasons as the new configuration must be as secure as the original Trivium.Since the search space is large, we decided to follow the following criteria, inherited from the original Trivium: A: The linear tap locations X i , X f i and X op i for all i, are chosen from the multiple of 3. In other words, X i , X f i , and X op i are divisible by 3 for all i.

B:
The locations of AND gates are fixed such that these two inputs are not divisible by 3. In Trivium, X op i − 1, X op i − 2 are chosen for all i.However, as discussed in the previous section, the impact on the energy consumption is negligible as long as the number of perfect trees is the same.Therefore, we change the AND location to X i + 1, X i + 2.Then, the number of perfect trees never changes, and the number of times that AND gates are applied increases according to the increase of the number of rounds.Thus, this choice is profitable for the security without increasing the energy consumption.
C: Each tap location for X i and X f i is larger than 64 such that a 64× parallel implementation is possible in the software.

D:
Under the condition where the output of each AND gate is approximated to 0, we denote by the maximum correlation in a linear combination of keystream bits.
In Trivium, = 2 −72 , but it is quite robust against linear attacks because at least 2 144 keystream bits are required.For a cipher targeting 80-bit security, ≤ 2 −40 is necessary.
In particular, A and B are two of the most important criteria in the design philosophy of Trivium.Thanks to them, we can expect that in criterion D is the highest correlation even when the condition where the output of each AND gate is approximated to 0 is removed.It is primarily because of the following reason.Under parameters following A and B, the whole cipher is divided into three sub-ciphers, and each sub-cipher is only connected non-linearly.In D, we first evaluate the correlation under the restriction, where the output of AND gate is always approximated to 0. In other words, only one sub-cipher is active, and the other two are inactive.Of course, this restriction is not exhaustive.However, intuitively, we are unlikely to find a better distinguisher beyond this restriction.Because, if at least one output of AND gate x • y is approximated to x, y, or x + y instead of 0, it implies at least two sub ciphers are active.It intuitively increases the number of active AND gates and makes constructing linear distinguishers with high correlation much harder.
Table 1: List of configurations and associated security parameters.represents maximum linear bias.T is the complexity of guess-and-determine attack.c represents an additional cost required to do Gaussian elimination to solve a set of linear equations to recover the internal state.The first row represents the parameters for the original Trivium.As a result, three criteria A, B, and C allow us to reduce the number of candidates to 28534800 ≈ 2 24.8 , and exhaustive search is possible.We exhaustively searched for the best candidates, i.e., the number of perfect trees is maximized, for correlation ∈ {2 −72 , 2 −68 , 2 −64 , 2 −60 , 2 −56 , 2 −52 , 2 −48 , 2 −44 , 2 −40 } in D. In Table 1, we list the best candidates for each .In addition, we also applied Maximov and Biryukov's Guess-and-Determine attack [MB07] on each of the candidates and list the result.In this attack, the weakness of the multiple-of-3 choice is exploited, and this attack shows Trivium has 80-bit security but it does not have 128-bit security even if the key length is simply extended to 128 bits.Note that this attack has many parameters and scenarios.The complexity listed in Table 1 is the so-called scenario T1, i.e., the time complexity is minimized under the condition that solving only a linear system is enough to recover the key.
It is clear from the table that an increase in the number of perfect trees is generally accompanied by an increased maximum linear bias and decrease in the complexity of the guess-and-determine attack.Considering c ≈ 2 16 , all parameters would have 80-bit security, but the security margin is very marginal for parameters whose is close to 2 −40 .The parameter in row 2 is the best one whose correlation is as low as the original Trivium, but the number of perfect trees is not over 500.

Trivium-LE(F) and Trivium-LE(S) 3
Having established a set of potential configurations, we proceed to the proposal of two energy-optimal Trivium-like designs.
A graphical depiction of those parameters is given in Figure 11.This choice gives us a decrease in energy of around 15% over the original Trivium and still provides us with some headway over the margins of security.We therefore propose this parameter set as a more energy-efficient variant of Trivium and call it Trivium-LE(F).
We keep the key-IV setup and initialization routines for Trivium-LE(F) same as Trivium.
For completeness, we round off this section with a preliminary security analysis.Since only the tap locations are modified in Trivium-LE(F), all types of attacks against Trivium can be applied against Trivium-LE(F).Three important attacks against Trivium are discussed below.
Linear Distinguishing Attack.In order to achieve 80-bit security, there should not be linear distinguishers whose correlation is higher than 2 −40 .As we already discussed in the section before, the best correlation is 2 −72 when outputs of AND gates are approximated to 0. While it is unlikely to find better distinguishers due to the multiple-of-3 property, we heuristically evaluated the case where these outputs are not approximated to 0. As we expected, we could not find better linear distinguishers with correlation higher than 2 −72 .
Maximov and Biryukov's Guess-and-Determine Attack.This attack mainly exploits the multiple-of-3 property of Trivium, and it should be effective because Trivium-LE(F) also inherits the multiple-of-3 property.This attack first divides the internal state into three sub states and consists of two phases.In the first phase, we first guess one of three sub states at some time.In the second phase, assuming that the sub state is guessed correctly, we next recover the rest of the bits, i.e., 288 × 2/3 = 192 bits.Then, we guess outputs of any AND gates and collect keystream bits, which are linearly represented by the internal state.In the so-called scenario T0, no output of any AND gates is guessed.When we use T0 to attack Trivium-LE(F), the time complexity is c • 2 74.0 , which is the same as the attack against Trivium in the same scenario.However, only 96 linear equations are collected for the second phase and it is not enough to recover the remaining 192 bits.Thus, we need to solve a nonlinear system but an efficient algorithm is not known.In scenario T1, outputs of some AND gates are guessed to collect enough linear equations to recover the remaining 192 bits.When we use T1 to attack Trivium-LE(F), the time complexity is c • 2 81.3796 , where 48, 45, and 44 outputs of AND gates are guessed for each register.Then, we can collect 192 linear equations for the second phase, and an efficient algorithm such as the Gaussian elimination is available.Considering c ≈ 2 16 , Trivium-LE(F) is secure enough against this attack.
Cube Attack.Unlike the attacks above, the target of the cube attack is the initialization phase of the cipher.The cube attack was initially introduced in [DS09].The original attack was experimental and its aim was to find linear or quadratic superpolies.However, after the division-property based cube attack was proposed [TIHM17, TIHM18], the theoretical security estimation is possible, and nowadays, the best cube attacks against Trivium are based on the division-property based method [WHT + 18, HLM + 20, HIJ + 19].Cube attacks exploit low algebraic degree in the initialization.The first keystream bit is regarded as the output of the Boolean function f k (iv).To execute cube attacks, the superpoly has to be recovered, and it becomes impossible several rounds after the degree of f k (iv) reaches 80.In practice, the best cube attack against Trivium is 842 rounds [HLM + 20, HSWW20], and the degree reaches 80 in 840 rounds.Therefore, we first investigated the algebraic degree on f k (iv) by using the bit-based division property [Tod15,TM16] and the left plot in Figure 12 shows the increase in the upper bound of the algebraic degree.Thanks to changing the location of AND gates, the algebraic degree of Trivium-LE(F) increases faster than Trivium, and the degree reach 80 in 780 rounds.Moreover, to conservatively evaluate the degree of the superpoly to be high enough, we also investigated the upper bound of the algebraic degree on f (k, iv) by using the bit-based division property.The right plot in Figure 12 shows the increase in the upper bound of the algebraic degree.About 900 rounds show the upper bound is full, i.e., 160, and it implies that the degree of the superpoly is unlikely to be lower even if we use the 80-dimensional cube.In both cases, f k (iv) and f (k, iv), Trivium-LE(F) is more secure than Trivium against cube attacks.Thus, we conclude that Trivium-LE(F) has a large security margin against cube attacks.

Trivium (original) Trivium-LE (F)
Figure 12: Increase in algebraic degree with respect to the #initialization rounds.
We have found that this cipher is algebraically weaker than Trivium, in as much as the algebraic degree of its output bit increases more slowly.It needs 1050 and 1200 rounds to reach the upper bounds of the degree of f k (iv) and f (k, iv) be the full, respectively.Compared to the original Trivium, the increase of the degree is about 25% slower.Therefore we suggest that for a safe security margin, the number of initialization rounds used with this variant is 288 × 5 = 1440.Note that in terms of an 288 times unrolled circuit, this variant only takes 1 extra clock cycle to initialize, and so asymptotically speaking the energy consumption does not increase due to this extra cycle.For space constraints we defer the security analysis to Appendix C.

Trivium-LE-MAC
In addition to the stream ciphers Trivium-LE(F) and Trivium-LE(S), we also propose a message authentication code (MAC) scheme called Trivium-LE-MAC, which is designed by slightly modifying the round function of Trivium-LE(F).To realize a MAC scheme whose energy consumption is competitive with the stream cipher Trivium-LE(F), it should absorb a 1-bit message into the internal state every round function.While the easiest method is simply XORing the 1-bit message with any 1 bit in the internal state, it is not secure enough against forgery attacks.To guarantee forgery security, we evaluated the lower bound in the number of active AND gates with an MILP-based method when two different messages are absorbed.After exhaustive experimentation we found that a 1-bit message has to be XORed to at least 3 positions of the internal to be secure against forgery attacks.For example, one possible choice is to XOR the 1-bit message with three output bits of state update function, i.e., t 1 , t 2 , t 3 .From an energy perspective, it is advisable that these injections take place as close as possible to the registers inputs, i.e., to locations a u1 , b u2 , c u3 for smaller values of u 1 , u 2 , u 3 .If we model the message inputs as zero-depth nodes, then it makes each strand t i (r) correspond to 6-ary trees.It is, for example, easy to see that the first 1 ≤ r ≤ X i − u i strand trees for t i (r) are all depth 1 perfect 6-ary trees.Hence lower values of u i intuitively make sense.
On the other hand, we chose the injected positions by respecting the multiple-of-3 property to efficiently evaluate the resistance against forgery attacks with an MILP-based method.Specifically, in the constructed model, the message difference is only allowed to be injected at clock cycles 3j 1 + j 0 (j 1 ≥ 0) when the non-zero difference is first introduced at the clock j 0 .Moreover, the output difference of the active AND gate is always assumed to be 0. The goal is to minimize the number of active AND gates in a trail available for the forgery attack, i.e., the difference of the whole internal state becomes zero after a certain number of clocks.We evaluated all possible candidates of the three injected positions (a 1+3i0 , b 1+3i1 , c 1+3i2 ) where 0 ≤ i 0 ≤ 30, 0 ≤ i 1 ≤ 32 and 0 ≤ i 2 ≤ 31.When the total distance from the first bit of each register is smaller than 69, among all the candidates, the maximal number of active AND gates is 72.Thus, we choose the best candidate (a 1 , b 7 , c 1 ) which reaches 72 active AND gates while achieving the smallest total distance of 6.
Algorithm 1 shows the specifications.Note that Trivium-LE-MAC inherits the security level of Trivium-LE(F) against any key-recovery attack, i.e., 80-bit security.However, the tag length is at most 64 bits.In other words, the security level of the integrity is at most 64 bits.

Remark About Authenticated Encryption
Authenticated encryption schemes attract strong interest from both industrial and academic communities, and an authenticated encryption using Trivium-LE(F) would be beneficial to lower energy consumption.In one of the possible constructions, inspired by the duplex sponge construction, a 1-bit message is absorbed into the internal state, and simultaneously, a key stream is squeezed to encrypt the 1-bit message.However, since the round function of Trivium-LE(F) is very sparse, the guess-and-determine attack can recover the internal state with a practical complexity when attackers can control and observe the partial information in the internal state at the same time.Such an event would not happen when the implementation respects nonce and never releases unverified plaintexts.However, in case such implementation issues happens, attackers can recover the secret key with practical complexity.
We think the risk above should be avoided.Consequently, we suggest the so-called generic construction such as [NRS14], where the authenticated encryption with associated data can be constructed by an IV-based symmetric-key cipher and a message authentication 112-bit security.It counters the guess-and-determine attacks by using one additional AND gate over and above the original architecture of Trivium.The update functions in Triad-SC are asymmetric: t 1 is of the form Hence, T 1 (r)'s are 7-ary trees and T 2 , T 3 (r)'s are 5-ary trees.
We performed a similar experiment for all the above ciphers as for the original Trivium: (a) We synthesized the circuit in restricted mode and record the power consumed in each strand.Results presented in Figure 14 show that strands associated with perfect trees consume much less power that the strands with imperfect trees.And (b) we generated numerous random instances of these ciphers with different tap locations.For every instance we plot power consumed vs number of perfect trees in the circuit tree graph.The results are plotted in Figure 13.The results are indeed on expected lines: there is strong negative correlation between energy consumed and number of perfect trees.For space constraints, in the figure we plot results only for the TSMC 90 nm cell library (with power measured at 10 MHz), however extrapolating the results from Section 3.2 we can make a similar argument that the results are neither library nor frequency specific.

Applicability to Grain-128
The concept of a circuit strand seamlessly translates over to Grain-128 [HJMM06] whose round function consists of two distinct strands that update two registers b 1 , b 2 , . . ., b 128 and s 1 , s 2 , . . ., s 128 such that (x 1 , . . ., x 128 ) ← (f, x 1 , . . ., x 127 ) (y 1 , . . ., y 128 ) ← (g, y 1 , . . ., y 127 ), where f and g are linear and non-linear functions respectively defined as follows: Evidently, the complexity of the update function in this family is higher than in Trivium and so finding a sensible restricted circuit configuration for these complex strands is a harder task.Even if we define the strand as a sub-circuit for the f, g functions, it is not immediately clear which configuration of gates is the best way to construct each strand.We can however delegate this responsibility to the circuit compiler, so that in the restricted mode it still respects the boundary between the strands, but chooses the internal structure of the strand independently.In Figure 15, we repeat the experiments from Section 2 by letting the synthesizer choose the circuit for each individual strand of f and g in Restricted    15 show that for Grain-128 suggests that restricted mode performs at least on par with other synthesis modes indicating that further optimizations for the restricted mode are possible by finding better circuit configurations.Additionally, by repeating the experiments from Section 3.2, we observe that increasing the number of perfect unrolled strand trees also correlates with the power consumption although in a weaker form, hence our results are also applicable to stream ciphers whose state update functions are significantly more complicated than in Trivium-like ciphers.Due to space constraints, we refer the reader to Figure 19 and Figure 20 in Appendix F.

Applicability to Subterranean-Deck
Unlike Trivium and Grain, the Subterranean-Deck [DMMR20] stream cipher does not feature a rotating register but in each round each state bit x 1 , x 2 , . . ., x 257 is replaced by the output of a single strand that is replicated 257 times such that (x 1 , . . ., x 257 ) ← (T π(1) , T π(2) , . . ., T π(257) ) for some permutation π.This strand can be realized in 3 NOT, 3 NAND, 1 XNOR3 and 1 XOR4 gates for feature-rich cell libraries and with 3 NOT, 3 NAND and 4 XOR2 for the more basic libraries.Denote t i = (x i + (x i+1 + 1)x i+2 ).Each T i consists of 3 sub-strands t i , t i+3 , t h+8 of exactly equal circuit complexity.In restricted mode, if each T i is compiled separately, then the sub-strand t i is replicated 3 times in the strands T i , T i−3 , T i−8 .This increases the circuit size three times, and also adds to unnecessary power consumption.Instead, if we choose t i (in place of T i ) as the minimal unit whose compilation is restricted, then this replication can be avoided, and this is precisely what we do here.This simplicity lends itself well to the restricted mode of synthesis as shown in Table 2 and this mode is decidedly better energy-wise.
A unique property of this structure is that all the sub-strands t i that constitute the round-function at any level are perfect trees in the corresponding circuit graph.As a The points in red, black, blue and green represent the power consumed by the strands t i in the first, second, third, and fourths levels of the round function.
result, it is expected that all the sub-strands at a given level consume similar energy.This is borne out by simulations performed on 4-round unrolled Subterranean-Deck as shown in Figure 16.

Triad-LE
Similarly to the case of the design of Trivium-LE(F) and Trivium-LE(S) based on Trivium, we present a low-energy variant of Triad-SC [BIM + 19], which we call Triad-LE, that consumes around 5% less energy than Triad-SC.Triad-SC's update function has similar structure to Trivium, but the linear tap locations are chosen with respect to the multipleof-2 property instead of the multiple-of-3 property.Each strand basically has the form , but to enhance the security level, the first strand is of the form ).By comprehensive cryptanalysis and evaluation using the number of perfect trees, the following parameters are adopted in Triad-LE.
Except for changing these tap positions, the specification is inherited from the original Triad-SC.Due to the page limit, we do not show the detail of the analysis here, and they are provided in Appendix E. As far as we apply several known attacks, the immunity of Triad-LE against the known attacks is almost equivalent with that of the original Triad-SC.

Conclusion
In this paper, we make some fundamental observations about the energy consumption of hardware-targeted stream ciphers and propose the first heuristic energy model that is based on the novel perfect tree metric.Our model is both simple and widely applicable to a wide range of stream ciphers and thus enables designers of future algorithms to specifically optimize for the energy consumption.The perfect tree energy model finds direct application in Trivium-LE(F) and Trivium-LE(S) that stand as the most energy-efficient encryption primitives known in the literature with a 10-15% (resp.around 25 %) margin to the next best cipher.A complete summary of all measurements is given in Table 3.Finally, we extend the reach of our model beyond stream ciphers and propose a novel, energy-efficient MAC Trivium-LE-MAC that can then be used to bootstrap an energy-efficient AEAD mode.

A Trivium
The stream cipher was proposed by De Cannière and Preneel [De 06] and is an eSTREAM portfolio member.The construction is specifically tailored to constrained hardware devices and can thus be efficiently implemented on a small circuit area budget.It features a state of 288 bits and key size of 80 bits alongside an initialization vector of the same length.At the beginning of the initialization phase, the 80-bit secret key K = (k 1 , . . ., k 80 ) and the publicly known 80-bit initial vector IV = (iv 1 , . . ., iv

B Proof of Lemma 1
We prove Lemma by the means of induction.
Base Case: Consider t 1 (r) in the original Trivium, whose recursive description is given in Figure 6.We know that for r = 1 → 66(= X 1 1 ), t 1 (r) can be written as functions of depth 0 nodes of the circuit i.e. the state variables x 1 , x 2 , x 3 , . . ., x 288 , and it is easy to see that all t 1 (r), r ∈ [1, 66] are perfect depth 1 trees.For r = 67, t 1 (r) is expanded as t 3 (1) + x 27 + x 28 • x 29 + x 105 .Note that t 3 (1) is no longer a depth 0 node, and hence t 1 (67) is not a perfect tree.Also consider a sightly modified form of Trivium in which X f 2 = 62 (say).In this case the recursive definition of t 1 (r) is as follows: Now it is easy to see that t 1 (r) is a perfect depth 1 tree only upto r = 62, as t 1 (63) will involve a t 1 (1) term which is no longer at depth 0. Thus the number of perfect depth 1 trees for t 1 (r) in a generalized Trivium circuit has to be the smaller of 66 and 62, i.e. min X 1 , X f 2 .Does this also depend on the tap position of the two AND gates and the final XOR term t 3 (r − 93)?The final XOR term must be tapped from the final location of each register to ensure that the state update function is one-to-one.So numerically, X op j has to be the length of the register X j .Since X j is an intermediate location and X op j is the final location of register j, we always have X j < X op j .If we select the tap locations for the AND gates in the range (X j , X op j ), it is easy to see that the perfect depth 1 trees only occur till the smaller of X 1 and X f 2 .Even if the the tap location of one or both inputs to the AND gate is less than X 1 j , we can simply select the numerically smallest tap location of register X j as X j , since in terms of the circuit graph it does not make a difference if X j is input to an XOR or an AND gate.However, here we have the AND taps strictly in between X j and X op j and so the the actual locations do of the AND taps not make a difference.Thus it is pretty easy to see base case for our recursive formula f 1 (X j ) = 0 and g 1 (X j ) = min X j , X f j+1 .

Inductive
Step: Now let us assume the inductive hypothesis, i.e. g l , f l are as defined in the Lemma statement for t = 1, 2, 3, . . ., l − 1.Consider the equation for t 1 (r) at r = r 0 = f l−1 (X 3 ) + X op 1 and r = r 0 + 1.For conciseness, denote by the symbol α the value of f l−1 (X 3 ) and ∆ = X op 1 − X 1 .It holds (note a 1 , a 2 are the AND gate taps with Note that by the inductive hypothesis, t 3 (α) corresponds to a depth l − 2 tree, whereas t 3 (α + 1) corresponds to a depth l − 1 tree.All other t 3 terms in the above expressions are depth l − 1 trees or greater by the inductive hypothesis.If t 1 (α + (X op 1 − X f 2 )) also corresponds to a depth l − 1 tree, it is easy to see that r 0 + 1 is the first value of r for which t 1 (r) produces a perfect depth l tree.However that is always not the case.It may so happen that t 1 (α + (X op 1 − X f 2 )) still corresponds to a depth l − 2 tree for certain specific instances of the generic Trivium circuit.In such cases the value of r has to be equal to u = f l−1 (X 1 ) + X f 2 + 1 to ensure that the t 1 term in the expression for t 1 (r) also produces a depth l−1 tree by the inductive hypothesis.This is true since t 1 (u−X f 2 ) = t 1 (f l−1 (X 1 )+1) which corresponds to a depth l − 1 tree by the inductive hypothesis.
For t 1 (r) to definitely correspond to a depth l perfect tree both the depth conditions on the above t 3 and t 1 nodes must be satisfied.This leads us to the easy conclusion that the first value of r for which t 1 (r) is a perfect depth r tree is the maximum of f l−1 (X 3 )+X op 1 +1 and f l−1 (X 1 ) + X f 2 + 1. Generalizing over all configurations of n-stage registers, we have Now to prove the recursive expression for g l , consider again t 1 (r) for the generic Trivium circuit for r = r 1 = g l−1 (X 3 ) + X 1 and r = r 1 + 1.For conciseness, denote by the symbol β the value of g l−1 (X 3 ).It holds By the inductive hypothesis t 3 (β + 1), no longer corresponds to a perfect tree of depth l − 1.Assuming that t ) do correspond to perfect depth l − 1 trees, r 1 = g l−1 (X 3 ) + X 1 is of course the largest value of r that produces depth l trees.
There are 2 assumptions made in the above proof which may not always hold for all configurations of the generic Trivium circuit.The first is if t 3 (β − ∆) does not correspond to a perfect depth l − 1 tree (note if t 3 (β − ∆) is not a perfect depth l − 1 tree, neither of the AND terms will correspond to depth l − 1 trees since their index values are larger than β − ∆).The above happens when The above condition essentially means that the number of perfect depth l − 1 trees for t 3 (r) is less than or equal to X op 1 −X 1 .This implies that the terms t 3 (r−X 1 ) and t 3 (r−X op 1 ) can never be both of depth l − 1, which in turn implies that the expression for t 1 (r) can never produce a depth l tree.In this case we can simply set g l (X 1 ) to be some value less than or equal to f l (X 1 ) to indicate this impossibility.We can simply pick one of the expressions for f l (X 1 ), i.e. f l−1 (X 3 ) + X op 1 for this purpose.Combining the two assumptions we can write the new expression as f The second assumption was that t 1 term also produces a perfect depth l − 1 tree.For a generic Trivium circuit, this assumption may be false.The term t ).Since we need both depth conditions to be satisfied, we take minimum of the above two.Generalizing for all n stage Trivium circuits we have This completes the proof.

C Trivium-LE(S) Security Analysis
Trivium-LE(S) is another low-energy variant of Trivium and results in around 25% lower energy when compared with Trivium.The tap location for AND gates is X i + 1 and X i + 2.
Security Analysis.We inherit the security of Trivium-LE(F) against the TMD tradeoff attack because it does not exploit the tap location.Here, we discuss the security against the correlation attack, the Maximov/Biryukov's guess-and-determine attack, and the cube attack.
Linear Distinguishing Attack.On this parameter, the maximum linear correlation is 2 −48 , and the required data is about 2 96 to distinguish the keystream from ideal one.Compared to 2 144 in Trivium or Trivium-LE(F), the security margin is very narrow.However, it is still enough to achieve the claimed security, i.e., 80 bits, which is the same as Trivium.
Maximov/Biryukov's Guess-and-Determine Attack.Similarly to the case for Trivium-LE(F), we evaluated the number of collectable linear equations after guessing some outputs of AND gates.In the scenario T1, the time complexity is c • 2 77.0503 , where separation.Specifically, we use the following map encoder : (ad, m) → (ad len(ad) m len(m)), where len(ad) and len(m) are 7-byte values representing the byte length of ad and m, respectively.This implies that associated data and message accepts at most 2 56 bytes.Note that the AEAD only accepts byte strings and never accepts bit strings whose length is not the multiple of 8.When encoder is injective, we never have two different (ad, m) which are encoded to the same single message.To prove it, we confirm the decoder corresponding to the encoder is uniquely determined.We assume a byte string x is the output of the encoder.Moreover, len(x) denotes the byte length of x, and x[i] denotes the ith byte of x.
We first extract the last 7-byte of x and get the byte length of m, i.To begin this section, we try to enumerate the number of perfect trees for a generic Triad-SC architecture.Definition 6. Denote by Triad(X, 3) a generic Triad-SC configuration composed of 3 chained registers (X 1 , . . ., X n ) such that X j is the register's leftmost forward tap, X f j is a feedback, X op j is the output tap and Y, Z are two additional non-linear taps that feed into the output of register X 1 .

E Triad-LE
Corollary 1.Given an arbitrary, generic Triad(X, 3) circuit composed of 3 registers, the total number of perfect unrolled strand trees S(T ) in the fully unrolled setting is given by where y + = max{y, 0} and f l (X j ), g l (X j ) are recursively defined functions for 1 ≤ l ≤ 3 of the form min f l−1 (X j−1 ) + X 3 j + (g l−1 (X j−1 ) − f l−1 (X j−1 )) − (X op j − X j ) + , g l−1 (X j ) + X f j+1 , otherwise.
such that f 1 (X j ) = 0 and g 1 (X j ) = min X j , X f j+1 .The number of perfect unrolled strand trees of depth t for the j-th strand is S(T j )| depth=t = (g t (X j ) − f t (X j )) + .
Proof.The addition of an AND gate that feeds into the output of X 1 can be regarded as two feedback taps.As a result, the same reasoning from Proof B applies.This means that two additional feedback terms are required for both f l (X j ) and g l (X j ), i.e., f j−1 (X 1 ) + Y and g j−1 (X 1 ) + Y for tap Y , similarly f l−1 (X 2 ) + Z and g j−1 (X 2 ) + Z for tap Z.

E.1 Searching for More Energy-Efficient Parameters
We repeated the experiment that we did for generating energy-efficient candidates with the Trivium architecture.Again since the search space for tap locations is too large we focused on the following strategy.
A: We keep the multiple of 2 property of original Triad-SC (all linear taps are multiples of 2), for the same reason as in Trivium.

B:
In Triad-SC, Y and Z were chosen to be equal to X op 2 − 3 and X op 3 − 3. Given the search space is large we decided to stick to this.

C:
For the taps for the other AND gates, one location was fixed at X op j − 1 (as in original Triad-SC) and the other location was searched for exhaustively.

D:
Under the condition where the output of each AND gate is approximated to 0, we denote by the maximum correlation in a linear combination of keystream bits.Moreover, we also denote by #ACT and the number of active AND gates.In Triad-SC, #ACT and = 96 and = 2 −72 .We inherited this condition.
Similarly to the original attack, we guess outputs of any AND gates and collect keystream bits, which are linearly represented by the internal state.Unlike Trivium, Triad-LE has an additional AND gate and we need to guess these outputs.Moreover, the output of the additional AND gate is always involved in the keystream, and it implies we cannot collect any linear equation for free.As a result, by guessing 85, 85, 77, and 92 outputs of AND gates are guessed, we can recover the 256-bit state, but the required keystream is 2 159.04 and the time complexity is c • 2 163.04 .
Cube Attack.We also investigated the increase in algebraic degree by using the bit-based division property.Note that it is not enough to achieve 112-bit security even if the degree of f k (iv) is the full, i.e., 96.When the corresponding superpoly is low degree, 1-bit secret key information can be efficiently recovered.Therefore, unlike Trivium-LE, we focus on f (k, iv) and evaluated the increase in algebraic degree.Figure 18 shows the upper bound of the algebraic degree, and it reaches degree 224 in 680 rounds.Considering that the full rounds are 1024, Triad-LE has plenty of security margin.

Figure 1 :
Figure 1: Energy consumption (pJ) chart from Banik et al. [BMA + 18] using the STM 90 nm cell library process at a clock frequency of 10 MHz.Added to the plot are figures for the energy consumptions of Subterranean-Deck, and the designs Trivium-LE(F), Trivium-LE(S), Triad-LE that we propose in this paper, for the same standard cell library and operating frequency.Figures are reported forshort messages (1 to 10 blocks of 64-bits) and longer messages (1-100 blocks).Legend entries highlighted in blue and green have a security level of 80 and 128 bits respectively, whereas Triad-LE offers 112-bit security.

( a )
Trivium-like constructions [De 06] with register output tap locations chosen randomly.(b) Trivium-like constructions proposed in the literature that have some structural differences in comparison to the original Trivium design.These include the modified Trivium proposed in [MB07], TriviA [CCHN18], Kreyvium [CCF + 18] and Triad-SC [BIM + 19].(c)Algebraically more complex ciphers with large state update functions such as Grain-128[HJMM06].(d)Subterranean-like constructions [DMMR20], which do not exhibit rotating state registers.

Figure 2 :
Figure2: Trivium energy measurements for the three synthesis settings for different frequencies and libraries.Note that energy graphs are noisier for the regular/ultra modes which indicates that the synthesizer chooses different mapping strategies for varying r.

Figure 3 :
Figure3: Trivium area measurements (Gate Equivalent) for the three synthesis settings for all unrolling factors r and cell libraries.Note that the number of clock cycles that are required in order to encrypt x bits of data is given by 1152 r + x r , hence the encryption of 1.28 MBit of data for r = 288 has a latency of 4449 cycles.

Figure 4 :
Figure4: Power measurements for all the perfect unrolled strand trees for two cell library processes.The red data points indicate unrolled strand equations which correspond to perfect trees.The dashed blue line signifies the transition boundaries between perfect and imperfect trees, i.e., low points represent perfect unrolled strand trees while high points correspond to imperfect trees.

Figure 8 :
Figure 8: Illustration of finite depth trees in Trivium circuit.

Figure 9 :
Figure 9: Power measurements of several Trivium(X, 3) circuits vs S(T ) for different libraries and frequencies.The red data points signify the power consumption of original Trivium.

Figure 10 :
Figure10: Post-routing power measurements of several Trivium(X, 3) circuits as a function of S(T ) for using the TSMC 90 nm process.The red data points signify the power consumption of the original Trivium.Note that fewer data points are being plotted as the post-routing workflow is significantly more time consuming in comparison to post-synthesis analysis.
on Algebraic Degree of f k (iv).on Algebraic Degree of f (k, iv).

Figure 13 :
Figure 13: Power consumption figures as a function of the number of perfect trees for Trivium-like schemes.All data was obtained using the TSMC 90 nm process at a clock frequency of 10 MHz.Red data points mark the original schemes.

Figure 14 :Figure 15 :
Figure 14: Power consumption measurements for all the unrolled strand trees for the 128-bit key size variation of Trivium proposed by Maximov and Biryukov [MB07], the state update function of the TriviA stream cipher [CCHN18], Kreyvium [CCF + 18] and Triad-SC [BIM + 19].The Measurements were obtained using the TSMC 90 nm process at a frequency of 10 MHz.

Figure 16 :
Figure 16: Power plot for 4-round unrolled Subterranean-Deck (TSMC 90 nm at 10 MHz).The points in red, black, blue and green represent the power consumed by the strands t i in the first, second, third, and fourths levels of the round function.

Figure 18 :
Figure 18: Increase in algebraic degree with respect to the number of initialization rounds.

Table 2 :
Subterranean-Deck energy measurements for r = 4 for four cell libraries.
mode instead of imposing a predefined set of logic gates.The results in Figure

Table 3 :
Measurements summary for all investigated stream ciphers.the National Key R&D Research Program Grant 2017YFB0802504, the National Natural Science Foundation of China Grant 61572482 and the National Cryptography Development Fund Grand MMJJ20170107.[WHT + 18] Qingju Wang, Yonglin Hao, Yosuke Todo, Chaoyun Li, Takanori Isobe, and Willi Meier.Improved division property based cube attacks exploiting algebraic properties of superpoly.In Hovav Shacham and Alexandra Boldyreva, editors, CRYPTO 2018, Part I, volume 10991 of LNCS, pages 275-305.Springer, Heidelberg, August 2018. by 80 ) are loaded into the internal state (x 1 , . . ., x 288 ).Subsequently, the state update function is run for 4 • 288 = 1152 iterations.During encryption, the keystream bits z i are extracted from intermediate values of the state update equations.Both the initialization procedure and the keystream routine are shown in Algorithm 2.