Jump to content

ReDoS: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
See also: pentium bug? not really related.
Citation bot (talk | contribs)
Altered title. Added website. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Pattern matching | #UCB_Category 21/28
 
(38 intermediate revisions by 26 users not shown)
Line 1: Line 1:
The '''regular expression denial of service''' ('''ReDoS''')<ref name="ReDoS in OWASP">
{{Short description|Regular expression denial-of-service attack}}
A '''regular expression denial of service''' ('''ReDoS''')<ref name="ReDoS in OWASP">
{{cite web
{{cite web
|author=[[OWASP]]
|author=OWASP
|author-link=OWASP
|date=2010-02-10
|date=2010-02-10
|url=http://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS
|url=http://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS
|title=Regex Denial of Service
|title=Regex Denial of Service
|accessdate=2010-04-16
|access-date=2010-04-16
}}
}}
</ref>
</ref>
is an [[algorithmic complexity attack]] that produces a [[denial-of-service]] by providing a [[regular expression]] that takes a very long time to evaluate. The attack exploits the fact that most [[Regular expression#Implementations|regular expression implementations]] have [[exponential time]] [[worst case complexity]]: the time taken can grow exponentially in relation to input size. An attacker can thus cause a program to spend an unbounded amount of time processing by providing such a regular expression, either slowing down or becoming unresponsive.<ref name="RiverStar">
is an [[algorithmic complexity attack]] that produces a [[denial-of-service]] by providing a [[regular expression]] and/or an input that takes a long time to evaluate. The attack exploits the fact that many<ref>{{Cite journal |last1=Davis |first1=James |last2=Louis |first2=Michael |last3=Coghlan |first3=Christy |last4=Servant |first4=Francisco |last5=Lee |first5=Dongyoon |date=2019 |title=Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions |url=https://davisjam.github.io//files/publications/DavisMichaelCoghlanServantLee-LinguaFranca-ESECFSE19.pdf |journal=The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering |pages=443–454}}</ref> [[Regular expression#Implementations|regular expression implementations]] have super-linear [[worst-case complexity]]; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.<ref name="RiverStar">{{cite web
|author=RiverStar Software
{{cite web
|author=[[RiverStar Software]]
|author-link=RiverStar Software
|date=2010-01-18
|date=2010-01-18
|url=https://www.riverstarsoftware.com/kb/article/security-bulletin-caution-using-regular-expressions-17.html
|url=https://www.riverstarsoftware.com/kb/article/security-bulletin-caution-using-regular-expressions-17.html
|title=Security Bulletin: Caution Using Regular Expressions
|title=Security Bulletin: Caution Using Regular Expressions
|accessdate=2010-04-16
|access-date=2010-04-16
|archive-date=2011-07-15
}}
|archive-url=https://web.archive.org/web/20110715190436/https://www.riverstarsoftware.com/kb/article/security-bulletin-caution-using-regular-expressions-17.html
</ref><ref name="ModSecurity">
|url-status=dead
{{cite book
}}</ref><ref name="ModSecurity">{{cite book
| last = Ristic
| last = Ristic
| first = Ivan
| first = Ivan
Line 27: Line 30:
| url = https://www.feistyduck.com/books/modsecurity-handbook/index.html
| url = https://www.feistyduck.com/books/modsecurity-handbook/index.html
| isbn = 978-1-907117-02-2
| isbn = 978-1-907117-02-2
| access-date = 2010-04-16
| archive-date = 2016-08-08
| archive-url = https://web.archive.org/web/20160808184945/https://www.feistyduck.com/books/modsecurity-handbook/index.html
| url-status = dead
}}</ref>
}}</ref>


==Description==
==Description==
Regular expression matching can be done by building a [[finite-state automaton]]. Regular expressions can be easily converted to [[nondeterministic finite-state automaton|nondeterministic automata]] (NFAs), in which for each state and input symbol there may be several possible next states. After building the automaton, several possibilities exist:
[[Regular expression]] ("regex") matching can be done by building a [[finite-state automaton]]. Regex can be easily converted to [[nondeterministic finite-state automaton|nondeterministic automata]] (NFAs), in which for each state and input symbol, there may be several possible next states. After building the automaton, several possibilities exist:


* the engine may convert it to a [[Deterministic finite automaton|deterministic finite-state automaton (DFA)]] and run the input through the result;
* the engine may convert it to a [[Deterministic finite automaton|deterministic finite-state automaton (DFA)]] and run the input through the result;
* the engine may try one by one all the possible paths until a match is found or all the paths are tried and fail ("backtracking").<ref name="Crossby&Wallach">
* the engine may try one by one all the possible paths until a match is found or all the paths are tried and fail ("backtracking").<ref name="Crossby&Wallach">{{cite web
{{cite web
|author=Crosby and Wallach, Usenix Security
|author=Crosby and Wallach, Usenix Security
|year=2003
|year=2003
|url=http://www.cs.rice.edu/~scrosby/hash/slides/USENIX-RegexpWIP.2.ppt
|url=http://www.cs.rice.edu/~scrosby/hash/slides/USENIX-RegexpWIP.2.ppt
|title=Regular Expression Denial Of Service
|title=Regular Expression Denial Of Service
|accessdate=2010-01-13
|access-date=2010-01-13
|archive-date=2005-03-01
}}
|archive-url=https://web.archive.org/web/20050301230312/http://www.cs.rice.edu/~scrosby/hash/slides/USENIX-RegexpWIP.2.ppt
</ref><ref name="Sullivan">
|url-status=dead
}}</ref><ref name="Sullivan">
{{
{{
cite web
cite web
Line 48: Line 56:
|url=http://msdn.microsoft.com/en-au/magazine/ff646973.aspx
|url=http://msdn.microsoft.com/en-au/magazine/ff646973.aspx
|title=Regular Expression Denial of Service Attacks and Defenses
|title=Regular Expression Denial of Service Attacks and Defenses
|accessdate=2010-05-06
|access-date=2010-05-06
}}
}}
</ref>
</ref>
Line 60: Line 68:
|last3 = Thielecke | first3 = H.
|last3 = Thielecke | first3 = H.
|title = Static Analysis for Regular Expression Denial-of-Service Attacks
|title = Static Analysis for Regular Expression Denial-of-Service Attacks
|booktitle = Network and System Security
|book-title = Network and System Security
|place = Madrid, Spain
|place = Madrid, Spain
|pages = 135–148
|pages = 135–148
Line 71: Line 79:
The last two algorithms, however, do not exhibit pathological behavior.
The last two algorithms, however, do not exhibit pathological behavior.


Note that for non-pathological regular expressions the problematic algorithms are usually fast, and in practice one can expect them to "[[compiler|compile]]" a regular expression in O(m) time and match it in O(n) time; instead, simulation of an NFA and lazy computation of the DFA have [[Time complexity|O]](m<sup>2</sup>n) worst-case complexity.<ref>Lazy computation of the DFA can usually reach the speed of deterministic automatons while keeping worst case behavior similar to simulation of an NFA. However, it is considerably more complex to implement and can use more memory.</ref> Regular expression denial of service occurs when these expectations are applied to regular expressions provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regular expression matcher.
Note that for non-pathological regular expressions, the problematic algorithms are usually fast, and in practice, one can expect them to "[[compiler|compile]]" a regex in O(m) time and match it in O(n) time; instead, simulation of an NFA and lazy computation of the DFA have [[Time complexity|O]](m<sup>2</sup>n) worst-case complexity.{{efn|Lazy computation of the DFA can usually reach the speed of deterministic automatons while keeping worst case behavior similar to simulation of an NFA. However, it is considerably more complex to implement and can use more memory.}} Regex denial of service occurs when these expectations are applied to a regex provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regex matcher.


While regex algorithms can be written in an efficient way, most regular expression engines in existence extend the regular expression languages with additional constructs that cannot always be solved efficiently. Such [[Regular expression#Patterns for non-regular languages|extended patterns]] essentially force the implementation of regular expression in most [[programming language]]s to use backtracking.
While regex algorithms can be written in an efficient way, most regex engines in existence extend the regex languages with additional constructs that cannot always be solved efficiently. Such [[Regular expression#Patterns for non-regular languages|extended patterns]] essentially force the implementation of regex in most [[programming language]]s to use backtracking.


==Examples==
==Examples==
===Malicious regexes===
===Exponential backtracking===
Malicious regexes that get stuck on crafted input can be different depending on the regular expression matcher that is under attack. For backtracking matchers, they occur whenever these factors occur:<ref name="Podcast">
The most severe type of problem happens with backtracking regular expression matches, where some patterns have a runtime that is exponential in the length of the input string.<ref name="Podcast">
{{cite web
{{cite web
|author=Jim Manico and Adar Weidman
|author=Jim Manico and Adar Weidman
Line 83: Line 91:
|url=http://www.owasp.org/index.php/Podcast_56
|url=http://www.owasp.org/index.php/Podcast_56
|title=OWASP Podcast 56 (ReDoS)
|title=OWASP Podcast 56 (ReDoS)
|accessdate=2010-04-02
|access-date=2010-04-02
}}
}}
</ref> For strings of <math>n</math> characters, the runtime is <math>O(2^n)</math>. This happens when a regular expression has three properties:
</ref>
* the regular expression applies repetition ("+", "*") to a complex subexpression;
* for the repeated subexpression, there exists a match which is also a suffix of another valid match.


* the regular expression applies repetition (<code>+</code>, <code>*</code>) to a subexpression;
The second condition is best explained with an example: in the regular expression <code>(a[ab]*)+</code>, both "{{tt|a}}" and "{{tt|aa}}" can match the repeated subexpression <code>a[ab]*</code>. Therefore, after matching "{{tt|a}}", the nondeterministic automaton may try a new match of <code>a[ab]*</code> or a new match of {{tt|a}}. If the input has many consecutive "{{tt|a}}"s, each of them will double the number of possible paths through the automaton. Examples of malicious regexes include the following:
* the subexpression can match the same input in multiple ways, or the subexpression can match an input string which is a prefix of a longer possible match;
* <code>(a+)+</code>
* and after the repeated subexpression, there is an expression that matches something which the subexpression does not match.
* <code>([a-zA-Z]+)*</code>
* <code>(a|aa)+</code>
* <code>(a|a?)+</code>
* <code>(.*a){x}</code> for x > 10


The second condition is best explained with two examples:
All the above are susceptible to the input {{tt|{{not a typo|aaaaaaaaaaaaaaaaaaaaaaaa!}}}}. (The minimum input length might change slightly, when using faster or slower machines.)


* in <code>(a|a)+$</code>, repetition is applied to the subexpression <code>a|a</code>, which can match <code>a</code> in two ways on each side of the alternation.
Other patterns may not cause an exponential behavior, but for long enough inputs (a few hundreds of characters, usually) they could still cause long elaboration times. An example of such a pattern is "<code>a*b?a*x</code>", the input being an arbitrarily long sequence of "{{tt|a}}"s. Such a pattern may also cause backtracking matchers to hang.
* in <code>(a+)*$</code>, repetition is applied to the subexpression <code>a+</code>, which can match <code>a</code> or <code>aa</code>, etc.

In both of these examples we used <code>$</code> to match the end of the string, satisfying the third condition, but it is also possible to use another character for this. For example <code>(a|aa)*c</code> has the same problematic structure.

All three of the above regular expressions will exhibit exponential runtime when applied to strings of the form <math>a...a!</math>. For example, if you try to match them against <code>aaaaaaaaaaaaaaaaaaaaaaaa!</code> on a backtracking expression engine, it will take a significantly long time to complete, and the runtime will approximately double for each extra <code>a</code> before the <code>!</code>.

It is also possible to have backtracking which is polynomial time <math>O(n^x)</math>, instead of exponential. This can also cause problems for long enough inputs, though less attention has been paid to this problem as malicious input must be much longer to have a significant effect. An example of such a pattern is "<code>a*b?a*x</code>", when the input is an arbitrarily long sequence of "<code>a</code>"s.


===Vulnerable regexes in online repositories===
===Vulnerable regexes in online repositories===
So-called "evil" or malicious regexes have been found in online regular expression repositories. Note that it is enough to find an evil ''sub''expression in order to attack the full regex:
So-called "evil" or vulnerable regexes have been found in online regular expression repositories. Note that it is enough to find a vulnerable ''sub''expression in order to attack the full regex:


# [http://regexlib.com/REDetails.aspx?regexp_id=1757 RegExLib, id=1757 (email validation)] - see {{red|red}} part, which is an Evil Regex<br /><code>^([a-zA-Z0-9]){{red|<nowiki>(([\-.]|[_]+)?([a-zA-Z0-9]+))*</nowiki>}}(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$</code>
# [http://regexlib.com/REDetails.aspx?regexp_id=1757 RegExLib, id=1757 (email validation)] - see {{red|red}} part<br /><code>^([a-zA-Z0-9]){{red|<nowiki>(([\-.]|[_]+)?([a-zA-Z0-9]+))*</nowiki>}}(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$</code>
# [http://www.owasp.org/index.php/OWASP_Validation_Regex_Repository OWASP Validation Regex Repository], Java Classname - see {{red|red}} part, which is an Evil Regex<br /><code>^{{red|(([a-z])+.)+}}[A-Z]([a-z])+$</code>
# [http://www.owasp.org/index.php/OWASP_Validation_Regex_Repository OWASP Validation Regex Repository], Java Classname - see {{red|red}} part<br /><code>^{{red|(([a-z])+.)+}}[A-Z]([a-z])+$</code>


These two examples are also susceptible to the input {{tt|aaaaaaaaaaaaaaaaaaaaaaaa!}}.
These two examples are also susceptible to the input <code>aaaaaaaaaaaaaaaaaaaaaaaa!</code>.


===Attacks===
===Attacks===
If a regex itself is affected by a user input, the attacker can inject a malicious regex and make the system vulnerable. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.
If the regex itself is affected by user input, such as a web service permitting clients to provide a search pattern, then an attacker can inject a malicious regex to consume the server's resources. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.

However, if a vulnerable regex exists on the server-side already, then an attacker may instead be able to provide an input that triggers its worst-case behavior. In this case, [[e-mail scanner]]s and [[intrusion detection system]]s could also be vulnerable.

In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it.<ref>{{Cite journal |last1=Barlas |first1=Efe |last2=Du |first2=Xin |last3=Davis |first3=James |date=2022 |title=Exploiting Input Sanitization for Regex Denial of Service |url=https://davisjam.github.io//files/publications/BarlasDuDavis-WebREDOS-ICSE2022.pdf |journal=ACM/IEEE International Conference on Software Engineering |pages=1–14|arxiv=2303.01996 }}</ref>

== Mitigation ==
ReDoS can be mitigated without changes to the regular expression engine, simply by setting a time limit for the execution of regular expressions when untrusted input is involved.<ref>{{cite web |title=Backtracking in .NET regular expressions - .NET |url=https://learn.microsoft.com/en-us/dotnet/standard/base-types/backtracking-in-regular-expressions |website=learn.microsoft.com |language=en-us |date=11 August 2023|quote=When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.}}</ref>


ReDoS can be avoided entirely by using a non-vulnerable regular expression implementation. After [[CloudFlare]]'s [[web application firewall]] (WAF) was brought down by a PCRE ReDoS in 2019, the company rewrote its WAF to use the non-backtracking Rust regex library, using an algorithm similar to [[RE2 (software)|RE2]].<ref>{{cite web |title=Making the WAF 40% faster |url=https://blog.cloudflare.com/making-the-waf-40-faster |website=The Cloudflare Blog |language=en |date=1 July 2020}}</ref><ref>{{cite web |date=2007|last=Cox|first=Russ|title=Regular Expression Matching Can Be Simple And Fast |url=http://swtch.com/~rsc/regexp/regexp1.html|access-date=2011-04-20}} &ndash; describes the RE2 algorithm</ref>
However, some of the examples in the above paragraphs are considerably less "artificial" than the others; thus, they demonstrate how a vulnerable regexes may be used as a result of programming mistakes. In this case [[e-mail scanner]]s and [[intrusion detection system]]s could also be vulnerable. Fortunately, in most cases the problematic regular expressions can be rewritten as "non-evil" patterns. For example, <code>(.*a){x}</code> can be rewritten to <code>([^a]*a){x,}</code>.


Vulnerable regular expressions can be detected programmatically by a [[Lint (software)|linter]].<ref>See e.g. {{cite web |last1=Schmidt |first1=Michael |title=RunDevelopment/scslre |website=[[GitHub]] |url=https://github.com/RunDevelopment/scslre |date=30 March 2023}}, {{cite web |title= recheck Introduction |url=https://makenowjust-labs.github.io/recheck/docs/intro/ |last1=TSUYUSATO|first1=Kitsune |language=en}}, and {{cite web |last1=Davis |first1=James |title=vuln-regex-detector/src/detect/README.md |url=https://github.com/davisjam/vuln-regex-detector/blob/master/src/detect/README.md |website=GitHub |language=en}}</ref> Methods range from pure [[static analysis]]<ref>H. Thielecke, A. Rathnayake (2013). "[http://www.cs.bham.ac.uk/~hxt/research/rxxr.shtml Regular expression denial of service (ReDoS) static analysis] {{Webarchive|url=https://web.archive.org/web/20140803030745/http://www.cs.bham.ac.uk/~hxt/research/rxxr.shtml |date=2014-08-03 }}". Retrieved 2013-05-30.</ref><ref>B. van der Merwe, N Weideman (2017). "[https://github.com/NicolaasWeideman/RegexStaticAnalysis Regex Static Analysis]". Retrieved 2017-08-12.</ref> to [[fuzzing]].<ref>{{cite web |title=Fuzzing with Static Analysis {{!}} recheck |url=https://makenowjust-labs.github.io/recheck/docs/internals/fuzzing/#ref-shen-2018 |website=makenowjust-labs.github.io |language=en}}</ref> In most cases, the problematic regular expressions can be rewritten as "non-evil" patterns. For example, <code>(.*a)+</code> can be rewritten to <code>([^a]*a)+</code>. [[Regular expression#Possessive matching|Possessive matching and atomic grouping]], which disable backtracking for parts of the expression,<ref name=es-re>{{cite web|title=Essential classes: Regular Expressions: Quantifiers: Differences Among Greedy, Reluctant, and Possessive Quantifiers|url=https://docs.oracle.com/javase/tutorial/essential/regex/quant.html#difs|website=The Java Tutorials|publisher=[[Oracle Corporation|Oracle]]|access-date=23 December 2016|archive-date=7 October 2020|archive-url=https://web.archive.org/web/20201007183203/https://docs.oracle.com/javase/tutorial/essential/regex/quant.html#difs|url-status=live}}</ref> can also be used to "pacify" vulnerable parts.<ref>{{cite web |title=compose-regexp.js, "Atomic matching" |url=https://github.com/compose-regexp/compose-regexp.js?tab=readme-ov-file#atomic-matching |website=GitHub |date=2 January 2024}}<br>{{cite web |title=tc39/proposal-regexp-atomic-operators |url=https://github.com/tc39/proposal-regexp-atomic-operators |publisher=Ecma TC39 |language=en |date=31 December 2023}}</ref><ref>{{cite web |title=Preventing Regular Expression Denial of Service (ReDoS) |url=https://www.regular-expressions.info/redos.html |website=www.regular-expressions.info}}</ref>
In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it.


==See also==
==See also==
Line 122: Line 139:


==References==
==References==
{{reflist}}
{{notelist}}
{{Reflist}}


==External links==
==External links==
* Examples of ReDoS in open source applications:
* Examples of ReDoS in open source applications:
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3277 ReDoS in DataVault]
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3277 ReDoS in DataVault] (CVE-2009-3277)
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3275 ReDoS in EntLib]
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3275 ReDoS in EntLib] (CVE-2009-3275)
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3276 ReDoS in NASD CORE.NET Terelik]
** [http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-3276 ReDoS in NASD CORE.NET Terelik] (CVE-2009-3276)
* Some benchmarks for ReDoS
* Some benchmarks for ReDoS
** Achim Hoffman (2010). "[https://github.com/EnDe/ReDoS/ ReDoS - benchmark for regular expression DoS in JavaScript]". Retrieved 2010-04-19.
** Achim Hoffman (2010). "[https://github.com/EnDe/ReDoS/ ReDoS - benchmark for regular expression DoS in JavaScript]". Retrieved 2010-04-19.
** Richard M. Smith (2010). "[http://www.computerbytesman.com/redos Regular expression denial of service (ReDoS) attack test results]". Retrieved 2010-04-19.
** Richard M. Smith (2010). "[http://www.computerbytesman.com/redos Regular expression denial of service (ReDoS) attack test results]". Retrieved 2010-04-19.
* Paper on implementing regular expressions not vulnerable to certain classes of ReDoS
** Russ Cox (2007). "[http://swtch.com/~rsc/regexp/regexp1.html Regular Expression Matching Can Be Simple And Fast]". Retrieved 2011-04-20.
* Tools for detecting ReDoS vulnerabilities
** H. Thielecke, A. Rathnayake (2013). "[http://www.cs.bham.ac.uk/~hxt/research/rxxr.shtml Regular expression denial of service (ReDoS) static analysis]". Retrieved 2013-05-30.
** B. van der Merwe (2016). "[http://www.cs.sun.ac.za/~abvdm/regex.html The regular expression static analysis (ReDos) project page]". Retrieved 2016-08-12.


[[Category:Algorithmic complexity attacks]]
[[Category:Algorithmic complexity attacks]]

Latest revision as of 18:04, 25 October 2024

A regular expression denial of service (ReDoS)[1] is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many[2] regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.[3][4]

Description

[edit]

Regular expression ("regex") matching can be done by building a finite-state automaton. Regex can be easily converted to nondeterministic automata (NFAs), in which for each state and input symbol, there may be several possible next states. After building the automaton, several possibilities exist:

  • the engine may convert it to a deterministic finite-state automaton (DFA) and run the input through the result;
  • the engine may try one by one all the possible paths until a match is found or all the paths are tried and fail ("backtracking").[5][6]
  • the engine may consider all possible paths through the nondeterministic automaton in parallel;
  • the engine may convert the nondeterministic automaton to a DFA lazily (i.e., on the fly, during the match).

Of the above algorithms, the first two are problematic. The first is problematic because a deterministic automaton could have up to states where is the number of states in the nondeterministic automaton; thus, the conversion from NFA to DFA may take exponential time. The second is problematic because a nondeterministic automaton could have an exponential number of paths of length , so that walking through an input of length will also take exponential time.[7] The last two algorithms, however, do not exhibit pathological behavior.

Note that for non-pathological regular expressions, the problematic algorithms are usually fast, and in practice, one can expect them to "compile" a regex in O(m) time and match it in O(n) time; instead, simulation of an NFA and lazy computation of the DFA have O(m2n) worst-case complexity.[a] Regex denial of service occurs when these expectations are applied to a regex provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regex matcher.

While regex algorithms can be written in an efficient way, most regex engines in existence extend the regex languages with additional constructs that cannot always be solved efficiently. Such extended patterns essentially force the implementation of regex in most programming languages to use backtracking.

Examples

[edit]

Exponential backtracking

[edit]

The most severe type of problem happens with backtracking regular expression matches, where some patterns have a runtime that is exponential in the length of the input string.[8] For strings of characters, the runtime is . This happens when a regular expression has three properties:

  • the regular expression applies repetition (+, *) to a subexpression;
  • the subexpression can match the same input in multiple ways, or the subexpression can match an input string which is a prefix of a longer possible match;
  • and after the repeated subexpression, there is an expression that matches something which the subexpression does not match.

The second condition is best explained with two examples:

  • in (a|a)+$, repetition is applied to the subexpression a|a, which can match a in two ways on each side of the alternation.
  • in (a+)*$, repetition is applied to the subexpression a+, which can match a or aa, etc.

In both of these examples we used $ to match the end of the string, satisfying the third condition, but it is also possible to use another character for this. For example (a|aa)*c has the same problematic structure.

All three of the above regular expressions will exhibit exponential runtime when applied to strings of the form . For example, if you try to match them against aaaaaaaaaaaaaaaaaaaaaaaa! on a backtracking expression engine, it will take a significantly long time to complete, and the runtime will approximately double for each extra a before the !.

It is also possible to have backtracking which is polynomial time , instead of exponential. This can also cause problems for long enough inputs, though less attention has been paid to this problem as malicious input must be much longer to have a significant effect. An example of such a pattern is "a*b?a*x", when the input is an arbitrarily long sequence of "a"s.

Vulnerable regexes in online repositories

[edit]

So-called "evil" or vulnerable regexes have been found in online regular expression repositories. Note that it is enough to find a vulnerable subexpression in order to attack the full regex:

  1. RegExLib, id=1757 (email validation) - see red part
    ^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
  2. OWASP Validation Regex Repository, Java Classname - see red part
    ^(([a-z])+.)+[A-Z]([a-z])+$

These two examples are also susceptible to the input aaaaaaaaaaaaaaaaaaaaaaaa!.

Attacks

[edit]

If the regex itself is affected by user input, such as a web service permitting clients to provide a search pattern, then an attacker can inject a malicious regex to consume the server's resources. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.

However, if a vulnerable regex exists on the server-side already, then an attacker may instead be able to provide an input that triggers its worst-case behavior. In this case, e-mail scanners and intrusion detection systems could also be vulnerable.

In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it.[9]

Mitigation

[edit]

ReDoS can be mitigated without changes to the regular expression engine, simply by setting a time limit for the execution of regular expressions when untrusted input is involved.[10]

ReDoS can be avoided entirely by using a non-vulnerable regular expression implementation. After CloudFlare's web application firewall (WAF) was brought down by a PCRE ReDoS in 2019, the company rewrote its WAF to use the non-backtracking Rust regex library, using an algorithm similar to RE2.[11][12]

Vulnerable regular expressions can be detected programmatically by a linter.[13] Methods range from pure static analysis[14][15] to fuzzing.[16] In most cases, the problematic regular expressions can be rewritten as "non-evil" patterns. For example, (.*a)+ can be rewritten to ([^a]*a)+. Possessive matching and atomic grouping, which disable backtracking for parts of the expression,[17] can also be used to "pacify" vulnerable parts.[18][19]

See also

[edit]

References

[edit]
  1. ^ Lazy computation of the DFA can usually reach the speed of deterministic automatons while keeping worst case behavior similar to simulation of an NFA. However, it is considerably more complex to implement and can use more memory.
  1. ^ OWASP (2010-02-10). "Regex Denial of Service". Retrieved 2010-04-16.
  2. ^ Davis, James; Louis, Michael; Coghlan, Christy; Servant, Francisco; Lee, Dongyoon (2019). "Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions" (PDF). The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering: 443–454.
  3. ^ RiverStar Software (2010-01-18). "Security Bulletin: Caution Using Regular Expressions". Archived from the original on 2011-07-15. Retrieved 2010-04-16.
  4. ^ Ristic, Ivan (2010-03-15). ModSecurity Handbook. London, UK: Feisty Duck Ltd. p. 173. ISBN 978-1-907117-02-2. Archived from the original on 2016-08-08. Retrieved 2010-04-16.
  5. ^ Crosby and Wallach, Usenix Security (2003). "Regular Expression Denial Of Service". Archived from the original on 2005-03-01. Retrieved 2010-01-13.
  6. ^ Bryan Sullivan, MSDN Magazine (2010-05-03). "Regular Expression Denial of Service Attacks and Defenses". Retrieved 2010-05-06. {{cite web}}: External link in |author= (help)CS1 maint: numeric names: authors list (link)
  7. ^ Kirrage, J.; Rathnayake, A.; Thielecke, H. (2013). "Static Analysis for Regular Expression Denial-of-Service Attacks". Network and System Security. Madrid, Spain: Springer. pp. 135–148. arXiv:1301.0849. doi:10.1007/978-3-642-38631-2_11.
  8. ^ Jim Manico and Adar Weidman (2009-12-07). "OWASP Podcast 56 (ReDoS)". Retrieved 2010-04-02.
  9. ^ Barlas, Efe; Du, Xin; Davis, James (2022). "Exploiting Input Sanitization for Regex Denial of Service" (PDF). ACM/IEEE International Conference on Software Engineering: 1–14. arXiv:2303.01996.
  10. ^ "Backtracking in .NET regular expressions - .NET". learn.microsoft.com. 11 August 2023. When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.
  11. ^ "Making the WAF 40% faster". The Cloudflare Blog. 1 July 2020.
  12. ^ Cox, Russ (2007). "Regular Expression Matching Can Be Simple And Fast". Retrieved 2011-04-20. – describes the RE2 algorithm
  13. ^ See e.g. Schmidt, Michael (30 March 2023). "RunDevelopment/scslre". GitHub., TSUYUSATO, Kitsune. "recheck Introduction"., and Davis, James. "vuln-regex-detector/src/detect/README.md". GitHub.
  14. ^ H. Thielecke, A. Rathnayake (2013). "Regular expression denial of service (ReDoS) static analysis Archived 2014-08-03 at the Wayback Machine". Retrieved 2013-05-30.
  15. ^ B. van der Merwe, N Weideman (2017). "Regex Static Analysis". Retrieved 2017-08-12.
  16. ^ "Fuzzing with Static Analysis | recheck". makenowjust-labs.github.io.
  17. ^ "Essential classes: Regular Expressions: Quantifiers: Differences Among Greedy, Reluctant, and Possessive Quantifiers". The Java Tutorials. Oracle. Archived from the original on 7 October 2020. Retrieved 23 December 2016.
  18. ^ "compose-regexp.js, "Atomic matching"". GitHub. 2 January 2024.
    "tc39/proposal-regexp-atomic-operators". Ecma TC39. 31 December 2023.
  19. ^ "Preventing Regular Expression Denial of Service (ReDoS)". www.regular-expressions.info.
[edit]