Internet-Draft I-Regexp November 2021
Bormann Expires 17 May 2022 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-bormann-jsonpath-iregexp-01
Published:
Intended Status:
Standards Track
Expires:
Author:
C. Bormann
Universität Bremen TZI

I-Regexp: An Interoperable Regexp Format

Abstract

"Regular expressions" (regexps) are a set of related, widely implemented pattern languages used in data modeling formats and query languages that is available in many dialects. This specification defines an interoperable flavor of regexps, I-Regexp.

The present version -01 of this document is a slight update of the original trial balloon, meant to determine whether this approach is useful for the JSONPath WG.

Discussion Venues

This note is to be removed before publishing as an RFC.

Discussion of this document takes place on the JSONpath Working Group mailing list (JSONpath@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/JSONpath/.

Source for this draft and an issue tracker can be found at https://github.com/cabo/iregexp.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 17 May 2022.

Table of Contents

1. Introduction

Data modeling formats (YANG, CDDL) as well as query languages (jsonpath) often need a regular expression (regexp) sublanguage. There are many dialects of regular expressions in use in platforms, programming languages, and data modeling formats.

While regular expressions originally were intended to provide a Boolean matching function, they have turned into parsing functions for many applications, with capture groups, greedy/lazy/possessive variants, etc. Language features such as backreferences allow specifying languages that actually are context-free (Chomsky type 2) instead of the regular languages (Chomsky type 3) that regular expressions are named for.

YANG (Section 9.4.5 of [RFC7950]) and CDDL (Section 3.8.3 of [RFC8610]) have adopted the regexp language from W3C Schema [XSD2]. XSD regexp is a pure matching language, i.e., XSD regexps can be used to match a string against them and yield a simple true or false result. XSD regexps are not as widely implemented as programming language regexp dialects such as those of Perl, Python, Ruby, Go [RE2], or JavaScript (ECMAScript) [ECMA-262]. The latter are often in a state of continuous development; in the best case (ECMAScript) there is a complete specification which however is highly complex (Section 21.2 of [ECMA-262] comprises 62 pages) and evolves on a yearly timeline, with significant additions. Regexp dialects such as PCRE [PCRE2] have evolved to cover a common set of functions available in parsing regexp dialects, offered in a widely available library.

With continuing accretion of complex features, parsing regexp libraries have become susceptible to bugs and performance degradation, in particular those that can be exploited in DoS attacks. The library RE2 that is compatible with Go language regexps strives to be immune to DoS attacks, making it attractive to applications such as query languages where an attacker could control the input. The problem remains that other bugs in such libraries can lead to exploitable vulnerabilities; at the time of writing, the Common Vulnerabilities and Exposures (CVE) system has 131 entries that mention the word "regex" [REGEX-CVE] (not all, but many of which are such bugs, with 23 matches for arbitrary code execution).

Implementations of YANG and CDDL often struggle with providing true XSD regexps; some instead cheat by providing one of the parsing regexp varieties, sometime without even advertising this fact.

A matching regexp that does not use the more complex XSD features (Section 3) can usually be converted into a parsing regexp of many dialects by simply surrounding it with anchors of that dialect (e.g., ^ or \A and $ or \z). If the original matching regexps exceed the envelope of compatibility between dialects, this can lead to interoperability problems, or, worse, security vulnerabilities. Also, features of the target dialect such as capture groups may be triggered inadvertently, reducing performance.

The present specification defines an interoperable regexp flavor for matching, I-Regexp. This flavor is a subset of XSD regexps. It also comes with defined rules for converting the regexp into common parsing regexp dialects.

2. Requirements

I-Regexps should handle the vast majority of practical cases where a matching regexp is needed in a data model specification or a query language expression.

A brief survey of published RFCs yielded the regexp patterns in Appendix A (with no attempt at completeness). These should be covered by I-Regexps, both syntactically and with their intended semantics.

3. Subsetting XSD Regexps

XSD Regexps are relatively easy to implement or map to widely implemented parsing regexp dialects, with a small number of notable exceptions:

4. Formal definition of I-Regexp

The syntax of I-Regexp is defined by the ABNF specification in Figure 1, with the following additional restriction:

This syntax is a subset of that of [XSD2]; the semantics of all the constructs allowed by this ABNF grammar are the same as those in [XSD2].

i-regexp = branch *( "|" branch )
branch = *piece
piece = atom [ quantifier ]
quantifier = ( %x2A-2B ; '*'-'+'
 / "?" ) / ( "{" quantity "}" )
quantity = QuantExact [ "," [ QuantExact ] ]
QuantExact = 1*%x30-39 ; '0'-'9'

atom = NormalChar / charClass / ( "(" i-regexp ")" )
NormalChar = ( %x00-27 / %x2C-2D ; ','-'-'
 / %x2F-3E ; '/'-'>'
 / %x40-5A ; '@'-'Z'
 / %x5E-7A ; '^'-'z'
 / %x7E-10FFFF )
charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
SingleCharEsc = "\" ( %x28-2B ; '('-'+'
 / %x2D-2E ; '-'-'.'
 / "?" / %x5B-5E ; '['-'^'
 / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
 )
charClassEsc = MultiCharEsc / catEsc / complEsc
MultiCharEsc = "\" ( %s"D" / %s"S" / %s"W" / %s"d" / %s"s" / %s"w" )
charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
 / %x5E-10FFFF ) / SingleCharEsc
catEsc = %s"\p{" charProp "}"
complEsc = %s"\P{" charProp "}"
charProp = IsCategory / IsBlock
IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
    Symbols / Others
Letters = %s"L" [ ( %x6C-6D ; 'l'-'m'
 / %s"o" / %x74-75 ; 't'-'u'
 ) ]
Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
 / %s"i" / %s"o" / %s"s" ) ]
Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
Others = %s"C" [ ( %s"c" / %s"f" / %x6E-6F ; 'n'-'o'
 ) ]
IsBlock = %s"Is" 1*( "-" / %x30-39 ; '0'-'9'
 / %x41-5A ; 'A'-'Z'
 / %x61-7A ; 'a'-'z'
 )
Figure 1

About a third of the complexity of this ABNF grammar comes from going into details on the Unicode IsCategory classes. Additional complexity stems from the way hyphens can be used inside character classes to denote ranges; the grammar deliberately excludes questionable usage such as /[a-z-A-Z]/.

5. Mapping I-Regexp to Regexp Dialects

(TBD; these mappings need to be thoroughly verified.)

5.1. XSD Regexps

Any I-Regexp also is an XSD Regexp [XSD2], so the mapping is an identify function.

5.2. ECMAScript Regexps

Perform the following steps on an I-Regexp to obtain an ECMAScript regexp [ECMA-262]:

  • Replace any MultiCharEsc and dots (.) outside character classes as in Table 1.
  • For any MultiCharEsc that show a charClassEsp (and not a charClassExpr) in the second column of Table 1, replace them inside the charClassExpr of the regexp as per Table 1.
  • For MultiCharEsc that do show a charClassExpr in the second column of Table 1 ("CCE2"), replace them inside charClassExpr of the regexp ("CCE1") as follows:

    • Examine for both charChlassExpr whether it is negative (has a "^" at the start).
    • Strip the brackets and any leading "^" from CCE2, yielding CC2.
    • If CCE2 is not negative, replace the MultiCharEsc by CC2.
    • If CCE2 is negative but CCE1 is not, remove the MultiCharEsc from CCE1 and replace the entire CCE1 by the construct (CCE1 | CCE2).
    • If both CCE1 and CCE2 are negative, fail (see Section 4).
  • Envelope the result in ^ and $.

Note that where a regexp literal is required, this needs to enclose the actual regexp in /.

Table 1: ECMAScript equivalents of MultiCharEsc
I-Regexp ECMAScript equivalent charClassExp
. [^\n\r] f
\d \p{Nd} t
\s [ \t\n\r] f
\w [^\p{P}\p{Z}\p{C}] f
\D \P{Nd} t
\S [^ \t\n\r] f
\W [\p{P}\p{Z}\p{C}] f

The performance can be increased by turning parenthesized regexps (production atom) into (?:...) constructions.

5.3. PCRE, RE2, Ruby Regexps

Perform the same steps as in Section 5.2 to obtain a valid regexp in PCRE [PCRE2], the Go programming language [RE2], and the Ruby programming language, except that the last step is:

  • Envelope the result in \A and \z.

Again, the performance can be increased by turning parenthesized regexps (production atom) into (?:...) constructions.

5.4. << Your kind of Regexp here >>

(Please submit the mapping needed for your favorite kind of regexp.)

6. IANA Considerations

This document makes no requests of IANA.

7. Security considerations

TBD

(Discuss security issues of regexp implementations, both DoS and RCE; this is covered in part in Section 1.)

8. References

8.1. Normative References

[XSD2]
Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes Second Edition", World Wide Web Consortium Recommendation REC-xmlschema-2-20041028, , <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028>.

8.2. Informative References

[ECMA-262]
Ecma International, "ECMAScript 2020 Language Specification", ECMA Standard ECMA-262, 11th Edition, , <https://www.ecma-international.org/wp-content/uploads/ECMA-262.pdf>.
[PCRE2]
"Perl-compatible Regular Expressions (revised API: PCRE2)", n.d., <http://pcre.org/current/doc/html/>.
[RE2]
"RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.", n.d., <https://github.com/google/re2>.
[REGEX-CVE]
"CVE - Search Results", n.d., <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=regex>.
[RFC7493]
Bray, T., Ed., "The I-JSON Message Format", RFC 7493, DOI 10.17487/RFC7493, , <https://www.rfc-editor.org/info/rfc7493>.
[RFC7950]
Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", RFC 7950, DOI 10.17487/RFC7950, , <https://www.rfc-editor.org/info/rfc7950>.
[RFC8610]
Birkholz, H., Vigano, C., and C. Bormann, "Concise Data Definition Language (CDDL): A Notational Convention to Express Concise Binary Object Representation (CBOR) and JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610, , <https://www.rfc-editor.org/info/rfc8610>.

Appendix A. Regexps and Similar Constructs in Published RFCs

This appendix contains a number of regular expressions that have been extracted from published RFCs based on some ad-hoc matching. Multi-line constructions were not included. All regular expressions validate against the ABNF in Figure 1.

rfc6021.txt  459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
rfc6021.txt  513 \d*(\.\d*){1,127}
rfc6021.txt  529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
rfc6021.txt  631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6021.txt  647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
rfc6021.txt  933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6021.txt  938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6020.txt 6647 [0-9a-fA-F]*
rfc6095.txt 2544 \S(.*\S)?
rfc6110.txt 1583 [aeiouy]*
rfc6110.txt 3222 [A-Z][a-z]*
rfc6536.txt 1583 \*
rfc6536.txt 1632 [^\*].*
rfc6643.txt  524 \p{IsBasicLatin}{0,255}
rfc6728.txt 3480 \S+
rfc6728.txt 3500 \S(.*\S)?
rfc6991.txt  477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
rfc6991.txt  525 \d*(\.\d*){1,127}
rfc6991.txt  541 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc6991.txt  542 .|..|[^xX].*|.[^mM].*|..[^lL].*
rfc6991.txt  571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
rfc6991.txt  665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6991.txt  693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
rfc6991.txt  725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6991.txt  743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6991.txt 1099 [0-9\.]*
rfc6991.txt 1109 [0-9a-fA-F:\.]*
rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc7407.txt  933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
rfc7758.txt  703 \d{2}:\d{2}:\d{2}(\.\d+)?
rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
rfc7895.txt  349 \d{4}-\d{2}-\d{2}
rfc7950.txt 8323 [0-9a-fA-F]*
rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc7950.txt 8356 [xX][mM][lL].*
rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
rfc8049.txt 6704 [A-Z]{2}
rfc8194.txt  629 \*
rfc8194.txt  637 [0-9]{8}\.[0-9]{6}
rfc8194.txt  905 Z|[\+\-]\d{2}:\d{2}
rfc8194.txt  963 (2((2[4-9])|(3[0-9]))\.).*
rfc8194.txt  974 (([fF]{2}[0-9a-fA-F]{2}):).*
rfc8299.txt 7986 [A-Z]{2}
rfc8341.txt 1878 \*
rfc8341.txt 1927 [^\*].*
rfc8407.txt 1723 [0-9\.]*
rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].*
rfc8525.txt  550 \d{4}-\d{2}-\d{2}
rfc8776.txt  838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
rfc8776.txt  874 ([a-zA-Z0-9\-_.]+:)*
rfc8819.txt  311 [\S ]+
rfc8944.txt  596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}
Figure 2: Example regular expressions extracted from RFCs

Acknowledgements

This draft has been motivated by the discussion in the IETF JSONPATH WG about whether to include a regexp mechanism into the JSONPath query expression specification, as well as by previous discussions about the YANG pattern and CDDL .regexp features.

The basic approach for this draft was inspired by The I-JSON Message Format [RFC7493].

Author's Address

Carsten Bormann
Universität Bremen TZI
Postfach 330440
D-28359 Bremen
Germany