Exploring how subtle implementation considerations can expose application-critical information.
Side-channel attacks (SCAs) are fascinating: they're usually either considered theoretical impossibilities with constraints outside the threat model of most developers, or practical attacks like Spectre which affect a huge number of products and require an architectural overhaul from vendors.
This post does not cover SCAs in the realm of electromagnetic emanations, fault injection, or even microarchitectural branch prediction, but instead opts to provide an exploration of timing-based information disclosure techniques that can be readily leveraged at a higher level in a suite of algorithms deployed in real-world software (with examples focusing upon string comparisons in particular). Note that the concept explored here - using timing SCAs against common algorithms - is not novel or an exploit in itself; the algorithms and relevant implementations discussed here are not innately vulnerable, but rather use optimisations that make them insecure for use in sensitive data processing, a fact that is often missed by developers when integrating such functionality. Correspondingly, there should be two main takeaways from this post:
This concept was brought to my attention whilst auditing a project that used C's strcmp function to compare sensitive strings. It turned out that some people had written about this exact concept previously in a similar product area to the one I was looking into - check out Microsoft's NETGEAR writeup and Jeff Sutch's writeup about strcmp timing attacks for some interesting further reading.
Searching and comparison algorithms are abundantly littered across the standard library of most modern programming languages - C and PHP have strcmp, C++ has strcmp and std::string::operator==, Python has an equality operator, and so does Rust. So imagine this: you're tasked with writing a function that takes user's password guess and the correct answer (both as strings), and returns a boolean value indicating true if the guess was correct. Generally, people tend to write code like the below instance that uses a standard language feature (the equality operator) to abstract away from any character/byte level comparisons.
Ignore the fact that std::string is used to store sensitive data, that in itself is an issue from a forensic perspective.
In this C++ example, the equality operator is implemented by the standard library implementation for whichever compiler you use to produce an executable -
in this example let's say you are using C++ with libstdc++4.5 and expand from there.
The equality operator between two variables of std::string type calls the
std::basic_string
The key here is really the section if (!eq(*p,*q)) return lt(*p,*q)?-1:1; as it indicates that the function adopts an 'early-out' optimisation wherein as soon as computation is complete, the function returns instead of computing the rest of the string. This early-out optimisation is intended to reduce the CPU cycles spent computing a comparison between strings that have already been proven invalid, but in doing so it inadvertantly introduces a timing discrepancy between strings that match different amounts of characters. If this discrepancy can be measured by an adversary, the theory goes that they can derive information about the comparison that in many cases should be kept private.
This post aims to be language-agnostic, so let's examine how the standard library comparison functions work for different langauges:
Each implementation is effectively the same in including early-out optimisations to save unnecessary CPU cycles. Resultingly, each of the above standard-library functions are susceptible to the underlying concept explored throughout this post: using timing discrepancies to leak sensitive data. More broadly however, the concept of such considerations applies to the majority of widely-deployed software and therefore the impact of timing discrepancies can include leaking application critical data.
This post will focus upon timing attacks through return times, however there are lots of other vectors for timing attacks and SCAs in general (hence the emergence of EMSEC as a field of research/protection).
With the above example providing an introduction into timing-based side-channel attacks, it is worth considering that there are other sources of timing differences such as:
As a result of these and other considerations (e.g., different algorithms being employed by the developer), it is possible that the correlation between a targeted function's return time, $t$, and the application's sensitive data, $d$, may be nonlinear (such as a logarithmic correlation that is harder to identify from a sample of return timestamps but still a present emanation of information), a factor that would be accounted for in the following section.
Once an adversary has identified a possible source of information disclosure via timing discrepancies, they may choose to develop some form of evidence that the algorithmic bias lends itself to the input/output. In other words, to prove or exploit a timing discrepancy, a function should be defined to satisfy $Pr[f(t)=d]\geq1-l$, with $0 \leq l \lt 1$ as a noise tolerance which is tweaked based on the range of data over multiple samples.
Deriving such a function $f(t)$ that enables timing attacks is essential for an adversary, but depending on their level of access to a target application, they may need to tweak their approach to this:
There are quite a few reasons that an adversary may choose not to leverage the code of a targeted application to build a SCA:
Source Code Available | Black-box analysis |
---|---|
Researcher lacks access to the entire codebase | Unable to map the underlying execution of the application from execution analysis alone |
Researcher lacks of time/knowledge to learn the codebase | Disassembling the binary is infeasible (e.g., lack of access to an architectural reference manual) |
Impractical to account for all compiler-level optimisations in the codebase | Unable to identify the full call stack |
Large combined first/third-party tech stack where noise could be present |
In these instances, it is possible that - with enough samples of the targeted application's execution - the researcher may still be able to identify a SCA; hence the approach of developing a large 'sample corpus': a repository of execution logs consisting of relevant data such as input values, the application's runtime duration, and any other metadata that could be relevant (e.g., file I/O timings, mutex acquisitions/releases if possible).
The main drawback of this approach is that it requires sustained access to a deployment of the application being researched in order to gather execution samples, and that the researcher may never fully understand the origin or nature of the SCA vector, therefore making remediation efforts more difficult and further research difficult without the vendor's cooperation in explaining the vulnerability.
With a typical brute-force attack where $c$ represents the size of the element-set and $n$ represents the length of data being derived, an adversary would need to make $c^n$ attempts (at most), whereas this technique - requiring that an adversary only guess a single element at a time - requires $c\cdotn$ attempts at most. This is a significant time optimisation when predicting long peices of data like SHA-512 hashes which can therefore be cracked with over $8.18\cdot10^{149}$ times less attempts required. The most apparent difference between the computational complexity behind eliciting data with conventional v.s. SCA-backed brute-force attempts is that conventional brute-forcing is exponential in its difficulty with respect to the length of the data itself ($c^n$), whereas this example of a side-channel backed attack is linear in its complexity increases ($c\cdotn$).
For side-channel vulnerabilities that originate from the use of linear comparison functions such as strcmp,
the correlation between time and data can be modelled as a traditional
linear equation $f(t)=\frac{di}{dt}t+b$ where $b$ is a generic offset (usually the minimum value of t across the sample dataset)
and $d=i$ as knowing how many elements matched in a searching process enables the linear-complexity bruteforce attack.
The adversary's first goal may be to prove the existence of a correlation between the
time taken to return - $t_i$ (referred to as $t$) - for a given number of elements (e.g., characters, bytes) matched, $i$.
If both variables were linearly correlated, a
Pearson's correlation coefficient $\rho$ could be devised wherein
$\rho_{t_i,i} \neq 0$, with larger coefficients corresponding to attacks that are more stable and which require
less sample data needed to divulge internal data (how many elements were successfully compared in this instance).
Using the strcmp function of C++ as an example, below is a graph of return times (abstracted away from a specific unit, but initially measured in 'clock ticks' via clock()).
Sampled over 1,000,000,000 strcmp calls per element. Buffers were taken from argv to prevent optimisation and were of equal length.
For this data: $\rho_{t,i} = 0.9878$, indicating a strong correlation between the portion of input data matched and runtime duration. With this confirmation, a predictive function can simply take the form of a line of best fit accross these samples $f(t)=\lfloor\frac{1}{1.23\cdot10^6}t-3.9\rfloor$. Now, when an adversary wants to know how many elements of their brute-force attempt were valid, they just record the execution time of strcmp and feed it through the above function.
This approach is oriented to situations where the adversary has access to an application's code (either as pseudocode, actual source code, or perhaps disassembled machine code dumped from memory) but cannot (or doesn't want to) run the affected sections of the application to produce a collection of samples - this can be for a number of reasons such as resource availability, an inability to properly emulate architectural nuances of the application in its target environment, or some quirk of the SCA vector that lends itself to logical analysis.
In order to pursue this technique, the researcher will generally have access to a decompiled view of the application binary. Otherwise, in the cases of only having source code or design documents for instance; compiler-level optimisations and OS-level nuances (interrupts, context switching, syscalls, etc.) won't be represented, meaning that these artefacts are only really suitable for finding algorithmic issues (like the early-out optimisations covered above).
With a grounded understanding in how instructions are executed, the following assembly code example can be evaluted for timing discrepancies (hover/tap glowing lines for more information):
; Find result explaination:
; value not found ==> you get -1 hex value in register R8
; value found ==> you get address of value in register R8
RESULT RN R8
HEAD RN R9 ; array head
BACK RN R10 ; array last item | back
SIZE RN R11 ; used for saving array size
VALUE RN R12 ; used for saving value that you're searching for
AREA binarySearch, CODE, READONLY
ENTRY
EXPORT __main
__main
; initialization
MOV SIZE, #10 ; enter ARRAY length
MOV VALUE, #13 ; enter VALUE you wanna search for
MOV HEAD, #0
MOV BACK, #0
SUB R0, SIZE, #1 ; R4 is used as offset of HEAD to set ARRAY BACK
MOV R1, #4
MUL R0, R0, R1
ADR HEAD, ARRAY
ADD BACK, HEAD, R0 ; HEAD and BACK pointers are initialized
; R0 ==> current value
SEARCH ; iterate while finding value
LDR R0, [HEAD], #4 ; A cache miss could disclose which array is being used.
CMP R0, VALUE
BEQ FOUND ; Branch misprediction penalty could disclose which elements matched.
CMP HEAD, BACK
BEQ STOP
B SEARCH
FOUND
SUB RESULT, HEAD, #4
B STOP ; Early-out optimization.
NOTFOUND
MOV RESULT, #-1
B STOP
ARRAY DCD 2, 12, 13, 23, 25, 30, 34, 41, 45, 54
STOP B STOP
END
; This code is contributed by Mohammad Amin Sarbazi
Noise can originate from a range of sources, in other SCA vectors we would need to account for variables like background EM interference, acoustics/vibrations from environmental sources, or other factors like power consumption. Luckily, whilst analysing timing differentials the main sources for noise are somewhat limited to three factors:
If the attack vector for your identified SCA is running on another host there will likely be an increased delay between your request and the vulnerable application's response. This delay is generally considered pseudorandom and can be influenced by a number of things, including:
There are two ways to negate this delay's effect on a SCA's viability: firstly, an adversary can try to find a source of the time that was obtained before network transmission and is embedded within the data (either implcitly or explicitly, this data needs to be of a high resolution), or they may take attempts to get as close as they can to the network and measure the server's latency on 'dummy packets' before subtracting that value from the timed responses.
The first approach - finding another source of time information within the packet - is generally the most effective way to negate networking effects, as TCP has an optional 'timestamp' packet header that has a resolution of between 500ms and 1ms depending on the platform. The latter approach is more of a 'shot in the dark' whose efficacy is usually only stable when the adversary can get very close to the network or $t$ is already large in proportion to the network delay.
If neither of these approaches are tenable for a specific situation, it may be possible to 'cluster' elements into larger sets, wherein ${t\prime}_{i^\prime} - {t\prime}_{i^\prime-1}$ is a larger portion of the round-trip time (RTT) than its non-clustered counterpart - this can be achieved by considering each 'attempt' to be multiple elements (for time modelling, this effectively increases $c$, decreases $n$, and means that each $i\prime$ in the new model actually corresponds to multiple $i$ in the old one) so that once a partial match is achieved, the time difference is more apparent.Resultingly, a cluster size of $k$ makes the character set effectively $c\prime=k\cdotc$ as $\overline{t}$ remains unchanged but $t\prime_{i^\prime} = \sum_{j=0}^{k-1}({t\prime}_{k\cdoti^\prime-j})$. Therefore, $k$ would be tweaked to introduce the largest possible timing discrepancy whilst maintaining the reduced complexity of deriving sensitive data in comparison to conventional linear brute-force attacks.
Network latency isn't always present: in many local privilege escalation (LPE) attacks (e.g., where the attack vector is a kernel-mode driver), there is no TCP/IP overhead, making exploits significantly less complicated for adversaries who simply need to measure how long a hash spent being processed by a kernel-mode application.
Generally, there is an entire stack of technology encompassing the asset which is vulnerable to a SCA (even outside the concept of networked applications). Even in exclusively usermode cases, you'd need to deal with CPU context switching, any other code between the request handler/comparison function, and IPC delays if applicable. While some of these delays are predictable, others involve too many variables to be consistently calculated and subtracted from the time taken for a request to be completed. Similarly to with network noise, a good way to account for the impact of extraneous computation is to make use of the aforementioned 'clustering' technique to ensure that upon finding a match, there is a more noticable time difference. Otherwise, aiming to gather a corpus of statistical information may be possible to the extent that it is possible to continue identifying relevant pieces of data.
Lots of effort has gone into mitigating any form of timing discrepancy in the vast majority of applications, I'd say that particularly in the case of languages perceived as memory-safe: protecting against such logical/timing issues are a primary focus for developers. A common technique for mitigating time-based inferrences is the use of randomized sleeps throughout sensitive execution. This can be in the form of a sleep(random()) placed before the return statement, or by fixing a function's execution duration. While both of these mitigations are relatively solid in principle, their implementations are often flawed and may utilise predictable pseudorandom number generation algorithms (PRNGs) can be predicted or influence by the adversary. Beyond such a technique for calculating the sleep duration, the next best option is to find some other derivative of the delay or artefact of the fixed execution period that can be accounted for when analysing response times.
Generally, patching time-based side channel attacks can be achieved by asserting constant time execution regardless of data content. That might mean using the XOR operator to compare strings, and making sure that the number of iterations/instructions executed remains indistinguishable between partial failure cases. With respect to the main recommendations, cryptographically sound libraries are available for all of the programming languages mentioned here (Rust, C, C++, Python, PHP) and generally have an audited implementation of data comparison/searching functions similar to the following.
Another viable technique may be to simply hash both values and then perform a linear comparison between them as any timing discrepancy will only reveal the hash character that differed and it is generally assumed that adversaries are unable to generate hashes that match their exact preference.
Additionally, if it isn't possible for an application to avoid side-channel attacks (in which case you should be rethinking the architectural choices that led to such a lack of versatility) it may be possible to limit the scope of such an attack by implementing rate-limiting at a scale aligned with the length of sensitive data which could be derived; by combining this with query exclusivity (i.e., using a mutex/single-thread lock to break parallelism in attacks) and access control even to authentication endpoints, it could take an infeasible amount of time to derive any data of meaning.For example: with a six digit pin that resets every five minutes, an linear-prediction timing attack without clustering would take 60 attempts yet with a 30 second rate limit for each attempt and only allowing a single request to be processed at a time it would be infeasible to predict the code. Of course, that would come alongside significant performance degredations and should only be implemented as a last-ditch attempt to avoid such attacks (e.g., in the form of a reverse proxy placed in front of vulnerable applications which cannot be patched or discontinued).
While this post focused upon linear searching and the techniques applying directly to it, timing SCAs can originate from a huge variety of operations that interact with sensitive data; data retrieval times can vary between buffers cached by the CPU, split across memory pages which could be based on differing hardware (with different R/W speeds), and return times (particularly for networked applications) can leak activity taking place elsewhere through mutex/lock acquisition times.
Techniques for identifying, quantifying, and exploiting SCAs are incredibly important for everyone ranging from those responsible for designing/writing secure software to red teams involved with identifying and exploiting them, with blue teams needing to have the correct background knowledge in order to support the patching of such issues.
If you have any useful resources in this area or have noticed an inaccuracy, send me an email (michaellrowley@protonmail.com) and I'll make an edit on this page.