1001Ferramentas
๐Ÿฆ€Validators

Rust Regex Validator

Compile a regex using the Rust regex crate syntax subset. Flags unsupported features (lookarounds, backrefs).

โ€”

A crate regex do Rust prioriza performance linear e nรฃo suporta lookarounds nem backreferences.

The Rust regex crate: linear-time matching by design

The regex crate, written by Andrew Gallant (burntsushi), is the de facto standard regular-expression library for Rust. Unlike PCRE, PCRE2, Python's re or Java's java.util.regex, the Rust crate guarantees linear-time matching in the size of the input. There is no recursion, no backtracking and therefore no ReDoS attack surface โ€” the same property exploited by competitive search tools like ripgrep (also by burntsushi) to scan gigabytes of source code in seconds.

The trade-off is feature parity. To stay linear, the crate deliberately omits the two features most associated with catastrophic backtracking: backreferences (\1, \2) and lookarounds ((?=...), (?!...), (?<=...), (?<!...)). If you need them, the sister crate fancy-regex wraps regex and falls back to a backtracking engine โ€” at the cost of the linear-time guarantee.

Engine internals: lazy DFA over an NFA Pike VM

The crate compiles a pattern into an NFA (Pike VM) and, during matching, opportunistically builds a lazy DFA โ€” only the states actually visited are materialized, keeping memory bounded. For ASCII-heavy workloads the runtime also dispatches to memchr, aho-corasick (multi-pattern), boyer-moore and SIMD-accelerated literal scanners. This hybrid approach is why ripgrep consistently beats GNU grep and the silver searcher on large code trees.

In practice, Regex::new("foo") can take tens of microseconds because it builds the automaton. For repeated matching, cache compiled regexes with once_cell::sync::Lazy or the older lazy_static! macro โ€” never compile inside a hot loop.

Supported syntax: Unicode-first, RE2-compatible

The grammar is intentionally close to Go's regexp and Google's RE2, with extensions:

  • Character classes: [a-z], negation [^0-9], intersection and subtraction inside [...].
  • Unicode categories: \p{Greek}, \p{Letter}, \p{Decimal_Number} โ€” Unicode 15 by default.
  • Named captures: (?P<year>\d{4}), retrieved by .name("year").
  • Flags: (?i) case-insensitive, (?m) multiline, (?s) dot-matches-newline, (?x) extended (whitespace ignored).
  • Quantifiers: greedy */+/?, lazy *?/+?/??, bounded {n,m}.
  • Anchors: ^, $, \b word boundary, \A / \z absolute start/end.

Performance vs PCRE2, Python and Hyperscan

Independent benchmarks (rebar, also from burntsushi) consistently rank engines roughly as follows for general workloads: Hyperscan (Intel) โ‰ˆ Rust regex > Go regexp > PCRE2-JIT > PCRE2 > Python re > Ruby Onigmo. Hyperscan is faster on very specific multi-pattern scenarios (network IDS); Rust wins on general-purpose code search. PCRE2 wins on raw feature count.

A classic anti-pattern that destroys backtracking engines is nested quantification like (a+)+$ or (.*)*. On 30 characters of "aaaaaa..." followed by "b", PCRE explodes into millions of steps; the Rust crate completes in microseconds because the DFA visits each character at most once.

Idiomatic usage and ecosystem

use once_cell::sync::Lazy;
use regex::Regex;

static RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap()
});

fn parse(s: &str) -> Option<(&str, &str, &str)> {
    let c = RE.captures(s)?;
    Some((c.name("y")?.as_str(),
          c.name("m")?.as_str(),
          c.name("d")?.as_str()))
}

Use cases where the linear guarantee shines: log parsing at millions of lines/sec, syntax highlighting in editors (Helix, Zed), source-code search (ripgrep, fd), HTTP routers (actix-web uses regex for path params), and any service that accepts user-provided patterns โ€” exposing PCRE2 to untrusted input is a known denial-of-service vector.

Testing and tooling

  • rustexp (rustexp.lpil.uk) โ€” official-feeling playground for the Rust crate.
  • regex101 with the Rust flavor โ€” interactive explanation, but be aware some lookaround syntax is shown as supported only because the site falls back to PCRE.
  • cargo test โ€” write unit tests asserting both positive and negative matches; the crate is deterministic and reproducible across platforms.
  • regex-syntax crate โ€” parse-only API to validate a pattern at compile time or in build scripts.

FAQ

Does Rust's regex support backreferences?

No. The core crate refuses patterns with \1, \2, etc., because they require backtracking. If you need them, use the fancy-regex crate, which wraps regex and falls back to a backtracking engine when the pattern demands it.

Is the crate immune to ReDoS?

Yes. Matching is linear in the input length and bounded in memory by the lazy DFA cache (also configurable). Patterns that paralyze PCRE โ€” like (a+)+$ โ€” finish in microseconds.

How does it compare with PCRE2?

PCRE2 has every feature ever invented (lookarounds, backreferences, subroutines, recursion) and the PCRE2-JIT is fast. The Rust crate gives up those features in exchange for linear-time safety and SIMD-accelerated literal scanning. Choose PCRE2 for expressiveness, Rust for safety and code search.

Can I use the same pattern in Go and Rust?

Mostly yes โ€” both implement RE2 syntax. Rust adds Unicode property classes and a few extra flags that Go lacks, but the common subset transfers cleanly.

Why is my Rust regex slow on the first call?

Regex::new() builds the automaton. Move compilation outside hot loops with once_cell::sync::Lazy. On steady state, the lazy DFA is among the fastest engines available.

Related Tools