[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
a lengthy analysis of counter mode and ESP
Title: a lengthy analysis of counter mode and
ESP
We have been discussing how to make use of
AES in counter mode with ESP. Several folks have told me that
the discussion is not generally understandable, because of the many
subtle issues involved and because when the discussion moved to the
general IPsec list, from the design team list, we failed to do an
adequate job of establishing context. So, this very lengthy message is
an attempt to provide background and a more thorough discussion of the
various technical issues associated with this complex topic.
Hopefully, the WG as a whole will better understand the issues and be
better positioned to make choices as a result of this analysis.
Steve
----------
What is Counter Mode?
To begin, we need to understand what
counter mode (CM) is, at least relative to the process of encrypting
packets. CM is not new (I believe it dates from about the mid 80's),
although it is newer than the other modes commonly defined and used
for block ciphers (which go back to the late 70's and very early
80's). To the best of my knowledge, CM was first used in a security
protocol standard by the ATM Forum, in the early 90's, when they
adopted it for use with DES.
Note that CM can be used with any block cipher. It is generic, not
unique to AES, although the size of the "counter" is a function of
the block size and thus is somewhat a function of the specific block
cipher employed. In CM a sequence of values is generated based on an
algorithm (the choice of algorithm is one focus of our discussion),
and then encrypted with a key using the basic block (sometimes
referred to as electronic code book, or ECB) mode of the cipher. The
output from this encryption process usually is referred to as
the "key stream." The underlying block cipher is being used to
generate a pseudorandom key stream, which is XORed with the plaintext
to produce ciphertext. Decryption involves encrypting the same
sequence of inputs to produce the same key stream, which is then XORed
with the received ciphertext, to recover the transmitted
plaintext.
By the way, CM is not the only block cipher mode that operates this
way. Output feedback mode (OFB) does the same thing, from a black box
perspective, i.e., it acts as a pseudorandom number generator (PRNG)
to produce key stream for XORing with data, to effect encryption and
decryption. In both modes, the keystream is completely independent of
the data being encrypted, an important feature that can significantly
improve performance in some contexts. The primary reason for
considering CM vs. OFB, from an engineering perspective, is the
additional performance enhancing opportunities provided by CM, as
described later.
An essential requirement for CM (and OFB) is that the key stream
generated using a specific key not be reused. Stripping off the key
stream is easy for a wiretapper (just XORing), if he knows that reuse
has occurred and has access to ciphertext that was encrypted under the
reused keystream. (It's especially easy if the attacker knows which
pairs of ciphertext outputs reflect the reuse.)
Remember, that the keystream is just the result of encrypting a
sequence of inputs provided to CM, one block at a time. Block ciphers
are 1-to-1 and onto mappings. Thus, so long as the sequence of inputs
is unique, under a selected key, the sequence of outputs will be
unique as well. In the IPsec context, this translates into a
requirement to generate unique input value sequences for each SA,
since each SA uses a unique key. (As Scott pointed out, it is
important to note in the CM spec that the key management procedure
ensure uniqueness for each SA, including the common case of the two
SAs that make up a full-duplex connection, since the adverse effects
are so severe under CM.)
Using CM with Packets (and cells)
For use in a packet-oriented environment like IP, each input to the
block cipher algorithm can be divided into two parts: a per-packet
part and an intra-packet part. (It may be convenient, though not
necessary, to think of the per-packet part as a set of high order bits
for the input, and the intra-packet part as the low order bits.) The
per-packet part of the input needs to be unique for each packet, much
the same way as we use an IV for CBC or CFB mode. The intra-packet
part of the input is used to generate keystream for the packet, a
block at a time, for as many blocks as are needed to encrypt the
packet. So long as the per-packet part of the input is unique for each
packet, the intra-packet input can repeat, from one packet to the next
(but not within a packet). This is OK because keystream uniqueness is
guaranteed so long as the whole input to the block cipher is unique;
even if two inputs differ by only one bit, that is enough to guarantee
that the outputs (keystream in this case) will be
distinct.
For example, the intra-packet part of the input could start at 0 and
count up by 1 for each block in a packet, until enough keystream is
generated to encrypt the packet. This means that the intra-packet part
of the input would repeat for each packet, but so long as the
inter-packet input was unique, the keystream generated for each packet
will be unique and thus not pose a security problem.
For a receiver to decrypt a received packet, it must generate the same
keystream as the one that the transmitter used to encrypt the packet.
Thus the receiver must know the key being used, and it must be able to
generate the same sequence of inputs for the block cipher. There are
many ways to make this happen, depending on the context.
When the ATM Forum selected CM for encrypting virtual circuits, it
adopted a design that assumes that, most of the time, all cells on a
VC are delivered in order, without loss. This means that if the
transmitter and receiver know the starting value for the input to CM,
and know the algorithm used to generate the sequence of inputs (and of
course, both use the same key), each can generate the same input
sequence to CM, and thus generate the same keystream, without
transmitting any additional data. Moreover, in the ATM context, each
cell is a fixed size (48 bytes plus header), so the amount of
keystream needed to encrypt each cell is constant. This makes it
especially easy for the transmitter and receiver to generate keystream
for each cell in a synchronous fashion. ATM cells have no place to
carry any data for synchronizing the receiver's crypto to the
transmitter, but if one assumes perfect, ordered delivery, this is not
a problem for CM.
One might assume from the ATM example that CM was chosen because it
requires no IV to be carried with each cell. But there are many other
ways to encrypt cells or packets without carrying an IV, so long as
the communication path generally delivers traffic in order and
without loss. For example, some X.25 encryption devices developed in
the 80's used CBC mode, and did not carry a per-packet IV. Each end
transmitted an IV to the other when a call was established and ran CBC
mode continuously on the data portion of each packet in the call. Thus
CM is not uniquely attractive because of an ability to operate without
transmission of explicit, per-packet crypto synch data. If packets can
arrive out of order, then generally there is a need to carry data for
per-packet crypto synch, or to have some other way for a receiver to
know what keystream to create (for a mode like CM).
At this point it is worth noting that the CM design adopted by the ATM
Forum is not primarily based on a counter, per se. Rather it uses a
linear feedback shift register (LFSR) to generate sequences of up to
2^21 inputs to the block cipher (enough for over 2^19 cells when using
DES or 3DES). It also uses a counter as part of the input, to be able
to "jump" forward in the input space, to facilitate
resynchronization. I view this latter feature as a sort of "fix"
for the fact that ATM VCs are not perfect after all. (Later we'll
discuss why input generation using other than simple counters may be
advantageous in some contexts.)
The ATM use of CM is illustrative in that it represents a design for a
specific network context. In the case of IP, we have a different
context and thus a different set of requirements that influence the
design. Specifically, in the IP context, we cannot assume ordered
delivery of packets, nor can we assume no packet loss will occur. This
implies that each packet must carry data to allow the receiver to
synchronize the CM input, to generate the correct keystream for
decrypting received packets.
If we assume that the input consists of two parts, a per-packet part
and an intra-packet part, it is obvious that we do not need to
transmit the intra-packet part with each packet, so long as there is
agreement on the algorithm used to generate the sequence of
intra-packet values. In principle, one could negotiate the
intra-packet sequence generation algorithm for each SA, and
performance advantages could accrue from adopting that approach. For
IPsec, I suggest that we not go all out for increased performance at
the cost of increased complexity. If one chose to allow negotiation of
per-SA, intra-packet sequence generation algorithms, the resulting
complexity would be imposed on all implementations, because of the
need to support a well-defined set of these algorithms at both
transmitter and receiver. Thus I suggest that we stick with the
simpler approach of defining exactly one method for generating the
intra-packet input sequences, to ensure interoperability with minimal
complexity.
Given the above assumption about a pre-defined, standard algorithm for
generating sequences for the intra-packet part of the CM input, we
need to carry only the per-packet part of the CM input in each packet,
much as we carry an IV with each packet in CBC mode. Note that since
we must carry this value in each packet. The transmitter and receiver
need not agree on how the value is generated. All we require is that
the algorithm used by the transmitter generate a unique sequence of
values under a single key, and that this sequence supports the maximum
number of packets that we can send on an SA (2^32 or 2^64, depending
on whether the ESN option is employed). The receiver need not be aware
of the algorithm used by the transmitter; it is a local matter for the
transmitter to generate the per-packet values and to ensure that they
are unique.
If the transmitter makes an error and duplicates a per-packet CM input
value, the damage is done as soon as the packet with the duplicate
counter input is sent. Even if a receiver could detect that a
duplicate per-packet value had been used, e.g., because the receiver
knew what algorithm the transmitter used to generate these values, it
is not clear what the receiver would do. IPsec has adopted a design
philosophy of not having receivers respond with error messages to what
appear to be problems in received packets. (This is part of the design
philosophy to avoid creating DoS opportunities.) Also, there are
many ways in which such duplicates might escape detection by a
receiver. For example, a transmitter might send each packet twice,
once encrypted using the previous per-packet counter input, and once
with a new per-packet input. This would result in duplicate keystream
for every packet, a passive wiretapper's dream come true! Every
packet would arrive and be decrypted properly by the receiver.
Depending on how the per-packet counter input was managed, this might
go completely unnoticed by the IPsec receiver, or the receiver might
see the replicated packets as duplicated, and just discard them as
part of the usual anti-replay mechanism. Failure of this sort, might
well go undetected for some time.
Note that, unlike the IV's used with CBC or CFB modes, there is not
a requirement that the per-packet value used for CM be pseudorandom.
An obvious candidate for such a value is a counter that increments by
1 for each newly transmitted packet. Since the preceding analysis
indicated that generation of the per-packet values can be a purely
local matter for the transmitter, there is not a requirement to
mandate a standard mechanism for generating these values, although the
standard could suggest some good ways of doing this. Also, if the
implementation undergoes security evaluation, as described later, that
evaluation will review the means used by the transmitter to ensure CM
input uniqueness, irrespective of anything we state in an IETF
standard.
CM and Performance
Performance considerations are the primary motivations for using a
counter mode with a block cipher such as AES. The primary reason for
the performance advantage of CM relative to other encryption modes
derives from the fact that it has no feedback loop, e.g., in contrast
to CFB, CBC or even OFB modes. In a mode that has a feedback loop, one
must complete encryption of a block of data before using the result as
an input to encrypting the next block of data. Even when a crypto mode
does not directly encrypt data, e.g., OFB mode, the presence of a
feedback loop in the mode requires that a block of data be processed
all the way through the algorithm before the next block can be
processed.
Modern block ciphers involve multiple "rounds" of computation. In
DES there were 16 rounds; in 3DES 48 rounds, and for AES (used with
128-bit keys) there are 12 rounds. As one tries to maximize
performance in hardware or software, it is common to "unwind" the
loop implied by the round structure. In hardware, this creates an
opportunity to pipeline the encryption process. For example, in AES,
after a 16-byte block of data has been input to a chip for processing,
one round later another 16-byte block could be input, if there is no
dependence on the encryption of the previous input. Thus, even though
the first block is still being processed and will not be output until
11 more rounds have been completed, one can begin processing the next
input block, if the mode permits.
Thus an encryption mode that allows easy
use of pipelining for encryption (and decryption) permits hardware
implementations to speed up processing by a factor that reflects the
number of rounds used by the basic block cipher, e.g., by a factor of
12 for AES.
A feedback mode like CBC or CFB does not permit one to easily take
advantage of pipelining, at least not for encrypting a single packet.
In a straightforward implementation, one would have to wait until all
the rounds were completed for encrypting the first block of input,
before starting to encrypt the second block of data from the same
packet, because the encrypted output from the first block is combined
with the next input, to counter some forms of attacks.
There are ways to try to overcome this limitation, e.g., by viewing a
packet not as a single series of blocks of data, but rather viewing it
as several series of blocks, interleaved in a fashion to account for
the number of rounds in the algorithm.
Techniques of this sort make use of multiple algorithm "cores,"
each a complete implementation of the block cipher, and thus makes use
of parallelism as well as pipelining. Modern, high speed crypto chips
often contain several algorithm cores, so this is not an unrealistic
notion. For example, in the ATM case, since each cell contains a
48-byte payload, one could view the payload as composed of two
sequences, each consisting of 3, 8-byte blocks, interleaved. One could
use two algorithms implementations in parallel, to process each
sequence, thus doubling the effective throughput. Or, the same payload
could be viewed as 4 interleaved sequences of 2 blocks each, for even
greater parallelism and improved throughput. However, this combination
of pipelining and parallelism can get very complicated in a hurry.
More importantly from a protocol standards and interoperability
perspective, a receiver needs to know how the transmitter chose to
"slice" the packet, in order to perform the decryption correctly.
That is because the way the interleaving is performed must be
duplicated by the receiver, in order to effect decryption.
Interleaving is not a transparent way to increase performance; a
receiver must know exactly what interleaving strategy was used by
every transmitter from which it receives packets. This makes it less
desirable in an open system environment, where we strive to remove
barriers to interoperability. As a result complicated ways of trying
to speed up encryption through interleaving of data while using
feedback modes are not usually employed in standard protocols. (IPsec
is striving to reduce the complexity mandated for all implementations,
as evidenced in the new version of IKE, and the complexity of
interleaving would be counter to this effort.) Thus it is common for a
chip with multiple cores to be used to encrypt multiple packets in
parallel, rather than trying to encrypt one packet in parallel.
Note that if multiple packets from the same SA are encrypted in
parallel, e.g., using multiple algorithm cores within a crypto chip,
this internal parallelism can cause system-level problems.
Consider the case where several short packets arrive for transmission
on an SA after encryption has begun on a large packet. Encrypting
multiple packets for the same SA in parallel could cause the short
packets to be encrypted, transmitted, received and decrypted before
the large packet was received at the destination IPsec implementation.
IPsec can tolerate this sort of self-imposed reordering for small
numbers of packets, nominally 64 per SA under the current
recommendations. One can imagine cases where an IPsec receiver might
start discarding large packets that arrive too late on an SA, because
of using packet parallelism on a per-SA basis to improve performance.
Also, TCP performance suffers when packets arrive out of order, so
there is the potential that an attempt to improve IPsec performance,
through internal parallelism on a per-SA basis, could ultimately
degrade performance! Thus developers must exercise care in designing
high speed IPsec devices, and they must understand how crypto chips
work to achieve high speeds.
In CM, taking advantage of pipelining and parallelism for an
individual packet is much easier because of the lack of feedback. One
can load the per-packet value into a core, and begin generating the
keystream for a series of blocks. After the first round has completed,
a new input can be loaded into the same core, to begin generating
another part of the keystream. In the simplest case, one could
pipeline 12 blocks worth of keystream generation for a single packet,
using a single core, taking advantage of the 12 round depth of AES.
The core could generate the sequence of intra-packet inputs, using
whatever algorithm is defined for that purpose.
If we are to make maximum use of a core, at some point it will be
necessary to load a new input into a core for a different packet,
while the core is still generating keystream for another packet, maybe
even for a packet for another SA. Otherwise the core will be idle
while we wait for up to 11 rounds (for AES) for the keystream for the
previous packet to exit. This suggests that a crypto chip has to
maintain some state for each core it contains, to remember which
keystream outputs go with which packets on which SAs, in order to be
able to get the most out of each core. So long as that state is
completely local to the chip, this is feasible.
Since crypto chips often contain multiple cores, it also is possible
to improve throughput for a single packet by using parallelism within
the chip. One might begin generating keystream in two or more cores,
in parallel, for the same packet. This is not too hard for CM, unlike
CBC or CFB mode, because we know what inputs to load into each core to
generate keystream for different blocks within a packet. Also, the way
we choose to do this at a transmitter will be invisible to a receiver,
so long as we follow the same algorithm for generating the
intra-packet input sequences. Thus transmitters and receivers can
employ different numbers of cores, even different numbers of chips, to
generate keystream in a deterministic fashion and still have the
process work properly, with no additional coordination. This is an
important feature: any approach we adopt for using CM in ESP should
allow a transmitter maximum flexibility in optimizing performance,
while ensuring that receivers need not be aware of the optimizations
employed by the transmitter.
For example, if a packet to be encrypted is 64 bytes long (more
properly, 49-64 bytes in length), one could assign two crypto cores to
the task of generating two blocks of keystream each, or assign four
cores the task of assigning one block each. In the former case the
whole packet can be encrypted in the time it takes for two rounds of
AES, in the latter case, in the time for one round of AES. All that is
required to make this work is to load the same inter-packet value into
each core, and then to direct each core to use the appropriate offset
into the packet for generating the right portion of the keystream.
This use of both pipelining and parallelism makes CM very attractive
as we try to create IPsec implementations that can operate at higher
and higher speeds. At the same time that we try to preserve maximum
flexibility for hardware implementations re performance, we don't
want to adopt a scheme that will impose burdens on software
implementations that may be peers. I believe this should be an
important principle in our decisions.
Finally, note that this sort of pipelining does not reduce the delay
associated with multi-round encryption. It will still take 12 rounds
to generate the keystream for a packet using AES with 128-bit keys.
Parallelism helps with delay, and the ability to pre-compute keystream
also can reduce delay. However, pre-computation of keystream in IPsec
may be feasible only on some contexts, e.g., when the number of SAs is
small and thus the burden of storing pre-computed keystream is not too
great.
CM and Overhead
With this background, we can examine the issue that has generated
considerable debate recently, namely how to manage the per-packet
inputs for CM. There is agreement that the intra-packet input should
just be a simple counter. To accommodate the largest possible
jumbogram that IPv6 can represent, a 28-bit value will suffice, and we
can represent it in a 32-bit field, as that is convenient for both
hardware and software implementations.
The disagreement has centered around the form that the per-packet
input to CM should take. As noted earlier, this value need not be
pseudorandom, unlike the recommendation for CBC and CFB modes. Some
concerns have been expressed about using a simple counter here,
because this means that subsequent inputs to CM will have a small
Hamming distance. Conceivably, this might facilitate some forms of
differential cryptanalysis, but AES is generally believed to be
sufficiently strong so that this concern has generally been dismissed.
So, there is agreement that a per-packet counter is sufficient.
If one adopts a very conservative model of encryption security, then
at most 2^64 blocks should be encrypted under a single key, which says
that no more than 64 bits are needed to identify each packet in an SA.
This is exactly the same size as the ESP extended sequence number, so
that value is an obvious candidate for use as the per-packet CM input.
It has the advantage that it is already carried in the ESP header for
each packet, and thus its use does not add any overhead to the ESP
packet structure for crypto synch purposes. The extent of the savings
is subject to interpretation.
CM has the property that one need not pad a packet to the block size
in order to encrypt/decrypt it. Rather, one merely XORs as many bits
as needed with the last block of a packet to encrypt/decrypt it. In
the IPsec context we require that the ciphertext portion of a packet
end on a 4-byte boundary, so this becomes the determining factor for
padding in the context of CM encryption. So, already CM has an
advantage over CBC which requires padding to a block boundary. Today,
using DES or 3DES, this CBC padding adds 4 bytes to every packet, on
average. CBC mode for AES would add 8 bytes of overhead. If we use the
ESP sequence number as the per-packet input to CM, we can gain another
8-byte savings relative to DES/3DES CBC.
There are several ways of looking at the overhead figures. Today we
tolerate an average of 12 bytes of per-packet crypto overhead, using
DES or 3DES CBC, an 8-byte IV plus the average 4-byte padding
overhead. Moving to CM will shave this overhead by an average of
4 bytes (by removing the padding), a 1/3 savings, even if we carry an
explicit 8-byte per-packet CM input. This analysis is accurate, but
simplistic, as it looks only at the overhead due to the crypto mode.
It ignores the overhead imposed by ESP.
The ESP header is 8-bytes long (4-byte SPI and 4-byte explicit
sequence number). Because of the 4-byte alignment requirement for the
end of the ciphertext, and because the last two bytes of the
ciphertext are part of the ESP trailer, an average of 4 bytes of
overhead is added there as well. Then the ICV is attached, in the
common (and recommended case) where integrity is employed. (Because CM
provides NO protection of any form with regard to integrity, an ICV is
critical when this mode is used.) The default ICV algorithms add an
additional 12 bytes to the end of the packet. So, in aggregate, ESP
typically adds 24 bytes to a packet, independent of the use of an
explicit IV. In that context, an explicit 8-byte IV represents 1/4 of
the total overhead added by ESP (24 bytes of pure ESP overhead, plus
an 8-byte IV = 32 bytes.)
If there is no compression of an (outer) IPv4 header, and if there are
no options, which is typical, that header is 20 bytes long. A TCP
header, inside the ESP payload, is 20 bytes (again, assuming no
options). A UDP header is 8 bytes. (IPsec does not accommodate
compression of transport layer headers; they are used for SA selection
and the transmitter and for SA checking at the receiver, so any
compression of these headers would become an integral part of security
critical IPsec processing, not a generally desirable outcome.) The
protocol overhead for a packet is 20+24 (or 8 for UDP) = 44 (or 28 for
UDP) bytes, independent of any application data and prior to
applying ESP. Other than VoIP, for which we have heard varying
estimates of application data size, the typical minimum size packet is
a TCP ACK, which is 44 bytes. The ESP overhead, at 24 bytes, makes for
a minimum encrypted packet size of 68 bytes (sans IV). In that common
context, the 8 added bytes for an explicit IV represents an 11.7%
increase in packet size. Typical packet sizes tend to be much larger,
with an attendant decrease in the percentage overhead attributable to
ESP and to an explicit IV.
In practice it seems that most IPsec traffic is carried in tunnel
mode, which adds another 20-byte inner IP header, which would bring
the IV overhead down to about 9%. (The use of tunnel mode is required
for the common case of remote user access, and for inter-site VPNs. It
also seems to be used as a default in many implementations, where
transport mode would suffice.) I leave it to others to provide
analogous estimates for the VoIP case, with appropriate assumptions
about what compression is applied where, and under what circumstances,
and whether this requires use of any compression algorithms that must
be licensed. But, it seems clear that in a very large number of common
IPsec uses, the overhead associated with carriage of an explicit IV is
fairly small.
None of this analysis is intended to suggest that we should be
cavalier about the overhead associated with IPsec. When I developed
the extended sequence number (ESN) option for AH and ESP, I chose to
not carry an extra 4 bytes of sequence number for several reasons.
There is an obvious desire to maintain backward compatibility for the
on the wire format, and to minimize overhead. Perhaps the most
important reason for making the added 4 bytes for the ESN be implicit
rather than explicit, is to preserve the 8-btye alignment for the
beginning of payload, consistent with IPv6 requirements. But the
overhead imposed by these 4 bytes would not, by itself, justify the
added complexity that results from making them implicit. In the same
vein, I suggest that we not introduce undue complexity to IPsec in an
effort to save 8 more bytes per packet, since it represents a fairly
small overhead in the vast majority of circumstances.
CM & System Security
When we consider the use of an explicit vs. an implicit IV for CM, we
also need to examine system security issues. Some of these issues are
rather subtle and thus require a detailed explanation, much like the
description of CM and how it can be used to achieve higher
performance.
Management of the ESP sequence number is well defined by RFC 2402 &
2406, and management of the ESN is defined in 2402/2406bis. A separate
counter must be maintained for each SA, and recall that an SA is a
unidirectional connection. The sequence number begins at 0; the first
transmitted packet carrying a value of 1, and the value is not allowed
to cycle for an SA. Thus a new SA must be created prior to
transmitting a packet with the sequence number value of 2^32 (or with
the implied value of 2^64 in the case of the ESN option.)
The purpose of the sequence number is to allow a receiver to detect
and reject replayed packets, e.g., as a countermeasure to DoS attacks.
As currently defined and used, e.g., with CBC mode, it has no
implications for the security of the cryptography employed with IPsec.
Thus a failure by a transmitter to properly manage the sequence number
will not result in a violation of the confidentiality, integrity, or
authentication guarantees associated with use of ESP. In this sense,
the anti-replay feature of ESP is a secondary security function.
As a result of this well-defined and limited security use for the
sequence number, its management is outside of the scope of the FIPS
140-1/2 crypto module evaluation criteria. These criteria focus on
security functionality and assurance directly related to the
management of crypto keys, on the correct implementation and operation
of crypto algorithms, and on related aspects of crypto module
security. These evaluation criteria does NOT address security issues
associated with a protocol like IPsec, even though a number of IPsec
implementations make use of crypto modules that have been evaluated
under FIPS 140-1/2. It is appropriate that these IPsec products cite
FIPS 140 evaluation ratings in product brochures even though this
evaluation does not address the other security features of IPsec. (The
evaluation criteria applies to both hardware and software modules and
thus is widely applicable.) Keep this in mind as we discuss
implementation options for management of ESP sequence numbers.
Some crypto chips that have been designed expressly for IPsec manage
the ESP sequence number on chip. This is a good security design
approach which makes it unlikely that the sequence number will be
mismanaged, e.g., that the same sequence number will be assigned to
multiple packets. However, it is not always possible to manage the
sequence number for an SA on a chip. Vendors developing IPsec products
have sometimes found it necessary to use multiple chips, perhaps
spread over more than one (printed circuit) board to provide crypto
support for traffic from one SA. An obvious context in which this has
arisen is in PPVPN products, where many subscriber data streams come
together in a product and the goal is to multiplex these data streams
over a pool of crypto chips. The motivations here are the same one
that cause ISPs to aggregate traffic at a POP using a hierarchy of
routers and switches. In such contexts it is uneconomical to devote a
crypto chip/module to each subscriber.
Sharing of a pool of crypto resources allows traffic for any SA to be
assigned to any available crypto chip. This requires that the keys for
each SA be available to each crypto chip. Because keys are constant
for the life of an SA, this is easy accomplished by replicating the
key material on each chip. Management of sequence numbers poses a
problem in this context if one tries to manage them on the crypto
chips. This is because this per-SA data changes every time a packet is
transmitted on an SA and it is generally impractical to update the
data and make it available to all the crypto chips in a timely
fashion.
Although the PPVPN context is the one that first brought this issue to
our attention, it is anticipated that there may be other contexts in
which traffic for an SA may be directed to different chips at
different times. It is a generic design strategy for scaling a system
to accommodate higher traffic volumes without dedicating crypto chips
to each possible traffic source.
When this issue was raised on the IPsec mailing list over a year ago,
the advice provided by the WG to vendors who encountered this problem
was to move management of the sequence numbers out of the crypto
chips. For each SA there is a point in the system where the traffic is
mapped to the SA and at that point one can assign the sequence number
to what will become an ESP packet, without any synchronization
problems. This approach does reduce the assurance with which sequence
numbers are managed, relative to on-chip management, but because the
security function offered by sequence numbers is quite limited, this
is not viewed as a problem.
If we make use of the ESP sequence number as the per-packet input to
CM, the situation changes. In this case, the sequence number is a
critical security parameter for the encryption algorithm, as noted
earlier. This is because encrypting two packets under the same key
with the same per-packet CM input results in duplicate keystream and a
serious compromise of confidentiality. Moreover, because the sequence
number is clearly visible in the packet, a wiretapper would know
precisely when this occurred, if it happened. If the ESP sequence
number is used as the per-packet input in CM, then management of the
sequence number falls within the scope of a FIPS 140-1/2 evaluation.
This means that the security boundary for evaluation of a product must
encompass management of the sequence number.
If sequence numbers are managed on crypto chips, this is not a
problem. But, if a product wants to manage sequence numbers off chip,
e.g., to better support scaling in the ways described above, then the
security evaluation boundary will have to encompass a much larger
portion of the IPsec product. In many instances, this will likely
limit the level of assurance at which a product can be evaluated. In
pragmatic terms, the security of the encryption process will almost
certainly be lower, because the per-packet CM input is now managed in
a less secure fashion. Some products might even forgo FIPS evaluation
entirely, because if the security evaluation boundary grow to
encompass most of the product software, any change in that software
voids the evaluation!
This also creates a potentially very odd situation: if an IPsec
product supports both CM and CBC modes, the security of the product
might vary depending on which mode was used. This could arise if the
security critical inputs to the encryption mode were managed in
different places. For example, for CBC mode on-chip IV generation
could be very secure, but off-chip sequence number management would
result in lower assurance for CM, if the sequence number were used as
the per-packet input.
This discussion of security evaluation boundaries is not based on
arbitrary evaluation criteria. It has a very pragmatic basis.
Experience has shown that many, if not most, of the security
vulnerabilities that arise in fielded products are a result of
implementation errors. Increasing the size of the security boundary
either reduces the likely level of assurance that is achieved, or
increases the costs associated with engineering the product (for a
fixed level of assurance). In general, architectural approaches that
make it easy to do a good job of ensuring keystream uniqueness, make
it more likely that a product will not have vulnerabilities due to
keystream generation failures.
To put this issue in perspective, the current ID warns implementers
against generating keystream for more than 2^64 blocks (not packets),
even though we do not know of any specific vulnerabilities that could
be exploited if this limit is exceeded. Yet we understand precisely
what attacks could result if keystream is reused. Thus it makes sense
to adopt approaches to keystream management that do not make it harder
to do this task in a very secure fashion.
This analysis argues for adopting an approach to generating the
per-packet CM input in a fashion that can be uniformly secure,
irrespective of whether the ESP sequence number is managed on chip or
off. For performance reasons, one would like an approach that allows
multiple chips to generate keystream for packets transmitted on the
same SA, while ensuring no keystream reuse and without having to
synchronize values (across chips) that change on a per-packet basis.
If one carries an explicit, per-packet IV, these goals can be
achieved.
There are several ways one can generate the per-packet IV in a secure
fashion across multiple chips. For example, one can assign a distinct,
small ID number to each chip (or each core on a chip), and include
that number as part of a 64-bit IV. This ensures that each chip can
generate a unique IV sequence in a purely local fashion, using the
per-chip (or per-core) ID as part of the IV. There is no need to
specify in a standard what portion of the IV is devoted to a per-chip
ID, since to a receiver the whole IV is opaque. This means that an
IPsec implementation that has just one chip can use the whole IV as a
counter or an LFSR, while a multi-chip implementation can assign as
many bits as are needed to allow for maximum, per-SA parallelism. This
is precisely the sort of opportunity for local optimization, with no
adverse implications for receiver implementation choices, that one
should strive for in a standard.
Counters vs. Linear Feedback Shift Registers (LFSRs)
Finally, there has been discussion of the merits of using counters vs.
LFSR to generate CM input sequences. LFSRs were used in the ATM Forum
standard for CM encryption. They exhibit advantages, relative to
adders, for generating a sequence of inputs in very high performance
hardware. The time required by an LFSR to generate the next value in
the sequence is constant, very small, and the hardware required is
also minimal. In contrast, if one generates a sequence of inputs using
a counter, the counter must allow for time to propagate carries, and
the counter is bigger than an LFSR, for comparable input sizes. (One
can diminish the delay attributable to carry propagation by devoting
more gates to the adder, but it is not likely to be faster than an
LFSR generating the same size sequence, and it will certainly be
bigger in gate count.)
Note that these arguments do not apply to software implementations,
especially ones making use of general purpose processors. In those
implementations, an LFSR would generally be slower than use of a
counter, given the availability of a 32 or 64-bit adder in the CPU.
Software implementations are not capable of very high performance, and
so there is no motivation to optimize for use of software with an
LFSR. But it is important to not burden software implementations that
communicate with very high performance hardware.
In the context of ESP, we argued that it is important to use a
well-defined, standard way to generate the intra-packet input
sequences. We also argued that it is preferable to have just one way
to do this. Using an LFSR here would yield performance benefits, but
these benefits may not be great, since the intra-packet sequences need
be only 28-bit values. Also, we want to preserve flexibility for
transmitters with regard to keystream generation. It does not seem
easy to standardize a means for using LFSRs for intra-packet keystream
generation. This leads to the suggestion to use a (28-bit) counter for
this part of the CM input.
However, generation of per-packet (vs. intra-packet) inputs to CM does
not have to be standardized, as explained in earlier analysis. We also
have seen how some implementations might need to generate a new
per-packet input as fast as an AES implementation can perform one
round, the same speed at which the intra-packet sequences need to be
generated. Thus it is attractive to allow for use of LFSRs to generate
per-packet inputs, i.e., IVs. Note too that because the per-packet
values are bigger (64-bits vs. 28-bits), the performance and gate
count advantages for an LFSR vs. a counter can be greater, at least in
principle, for hardware.
By allowing an implementer the freedom to choose how to generate these
IVs, no constraints (other than uniqueness) are imposed, and there is
no adverse effect on receivers, to whom these values are opaque. Thus
a software implementation that receives packets from a hardware
implementation using an LFSR for IVs is not affected by the
transmitters implementation choice. The same is true in the other
direction, i.e., use of a counter for IV generation does not affect a
receiver.
The Bottom Line
The preceding, lengthy analysis has indicated why it is preferable to
carry an explicit IV for per-packet synchronization for CM, in a range
of contexts. Using explicit IVs allows product designers flexibility
for improving encryption performance, while maintaining or improving
security, relative to reusing the ESP sequence number as an IV.
Reusing the ESP sequence number in this fashion precludes some designs
that increase performance, as illustrated in the analysis. In some
contexts, e.g., PPVPN designs, reuse of the sequence number will
either make FIPS 140 evaluation harder and more expensive, or will
reduce the security assurance level of products, when this mode is
employed.
The only downside to this choice is the
overhead associated with the IV. This overhead is, on average, less
than what most IPsec implementations currently experience with
DES/3DES CBC, because CM does not require block size padding. On a
percentage basis, the overhead of an explicit IV is typically quite
small. We have been told that in the VoIP context even this small
overhead is a concern, but we have no detailed analysis of the extent
of the overhead relative to overall packets size. We also have
been told that these concerns have lead to development of an
alternative security protocol SRTP, which is optimized for VoIP. I
believe IPsec should not skew its use of CM to better support any one
application, even an important one like VoIP, especially given the
possibility of using an alternative security protocol.