Network Working Group                                         C. Bormann
Internet-Draft                                    Universität Bremen TZI
Intended status: Standards Track                                 T. Bray
Expires: 8 September 2022                                     Textuality
                                                            7 March 2022


                I-Regexp: An Interoperable Regexp Format
                   draft-bormann-jsonpath-iregexp-03

Abstract

   This document specifies I-Regexp, a flavor of regular expressions
   that is limited in scope with the goal of interoperation across many
   different regular-expression libraries.

About This Document

   This note is to be removed before publishing as an RFC.

   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-bormann-jsonpath-iregexp/.

   Discussion of this document takes place on the JSONpath Working Group
   mailing list (mailto:JSONpath@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/JSONpath/.

   Source for this draft and an issue tracker can be found at
   https://github.com/cabo/iregexp.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 8 September 2022.


Bormann & Bray          Expires 8 September 2022                [Page 1]

Internet-Draft                  I-Regexp                      March 2022


Copyright Notice

   Copyright (c) 2022 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  I-Regexp Syntax . . . . . . . . . . . . . . . . . . . . . . .   3
   4.  I-Regexp Semantics  . . . . . . . . . . . . . . . . . . . . .   5
   5.  Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . .   5
     5.1.  XSD Regexps . . . . . . . . . . . . . . . . . . . . . . .   5
     5.2.  ECMAScript Regexps  . . . . . . . . . . . . . . . . . . .   6
     5.3.  PCRE, RE2, Ruby Regexps . . . . . . . . . . . . . . . . .   6
     5.4.  << Your kind of Regexp here >>  . . . . . . . . . . . . .   6
   6.  Motivation and Background . . . . . . . . . . . . . . . . . .   6
     6.1.  Subsetting XSD Regexps  . . . . . . . . . . . . . . . . .   7
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   8.  Security considerations . . . . . . . . . . . . . . . . . . .   9
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     9.2.  Informative References  . . . . . . . . . . . . . . . . .   9
   Appendix A.  Regexps and Similar Constructs in Recent Published
           RFCs  . . . . . . . . . . . . . . . . . . . . . . . . . .  10
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   The present specification defines an interoperable regular expression
   flavor, I-Regexp.

   This document uses the abbreviation "regexp" for what are usually
   called regular expressions in programming.  "I-Regexp" is used as a
   noun meaning a character string which conforms to the requirements in
   this specification; the plural is "I-Regexps".


Bormann & Bray          Expires 8 September 2022                [Page 2]

Internet-Draft                  I-Regexp                      March 2022


   I-Regexp does not provide advanced regexp features such as capture
   groups, lookahead, or backreferences.  It supports only a Boolean
   matching capability, i.e., testing whether a given regexp matches a
   given piece of text.

   I-Regexp is a subset of XSD regexps [XSD-2].

   This document includes rules for converting I-Regexps for use with
   several well-known regexp libraries.

1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Requirements

   I-Regexps should handle the vast majority of practical cases where a
   matching regexp is needed in a data model specification or a query
   language expression.

   A brief survey of published RFCs yielded the regexp patterns in
   Appendix A (with no attempt at completeness).  With certain
   exceptions as discussed there, these should be covered by I-Regexps,
   both syntactically and with their intended semantics.

3.  I-Regexp Syntax

   An I-Regexp MUST conform to the ABNF specification in Figure 1.


Bormann & Bray          Expires 8 September 2022                [Page 3]

Internet-Draft                  I-Regexp                      March 2022


   i-regexp = branch *( "|" branch )
   branch = *piece
   piece = atom [ quantifier ]
   quantifier = ( %x2A-2B ; '*'-'+'
    / "?" ) / ( "{" quantity "}" )
   quantity = QuantExact [ "," [ QuantExact ] ]
   QuantExact = 1*%x30-39 ; '0'-'9'

   atom = NormalChar / charClass / ( "(" i-regexp ")" )
   NormalChar = ( %x00-27 / %x2C-2D ; ','-'-'
    / %x2F-3E ; '/'-'>'
    / %x40-5A ; '@'-'Z'
    / %x5E-7A ; '^'-'z'
    / %x7E-10FFFF )
   charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
   SingleCharEsc = "\" ( %x28-2B ; '('-'+'
    / %x2D-2E ; '-'-'.'
    / "?" / %x5B-5E ; '['-'^'
    / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
    )
   charClassEsc = catEsc / complEsc
   charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
   CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
   CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
    / %x5E-10FFFF ) / SingleCharEsc
   catEsc = %s"\p{" charProp "}"
   complEsc = %s"\P{" charProp "}"
   charProp = IsCategory / IsBlock
   IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
       Symbols / Others
   Letters = %s"L" [ ( %x6C-6D ; 'l'-'m'
    / %s"o" / %x74-75 ; 't'-'u'
    ) ]
   Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
   Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
   Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
    / %s"i" / %s"o" / %s"s" ) ]
   Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
   Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
   Others = %s"C" [ ( %s"c" / %s"f" / %x6E-6F ; 'n'-'o'
    ) ]
   IsBlock = %s"Is" 1*( "-" / %x30-39 ; '0'-'9'
    / %x41-5A ; 'A'-'Z'
    / %x61-7A ; 'a'-'z'
    )

                     Figure 1: I-Regexp Syntax in ABNF


Bormann & Bray          Expires 8 September 2022                [Page 4]

Internet-Draft                  I-Regexp                      March 2022


   As an additional restriction, charClassExpr is not allowed to match
   [^], which according to this grammar would parse as a positive
   character class containing the single character ^.

   This is essentially XSD regexp without character class subtraction
   and multi-character escapes.

   An I-Regexp implementation MUST be a complete implementation of this
   limited subset.  In particular, full Unicode support is REQUIRED; the
   implementation MUST NOT limit itself to 7- or 8-bit character sets
   such as ASCII and MUST support the Unicode character property set in
   character classes.

   *  *Issues*: The ABNF has been automatically generated and maybe
      could use some further polishing.  The ABNF has been verified
      against Appendix A, but a wider corpus of regular expressions will
      need to be examined.  Note that about a third of the complexity of
      this ABNF grammar comes from going into details on the Unicode
      IsCategory classes.  Additional complexity stems from the way
      hyphens can be used inside character classes to denote ranges; the
      grammar deliberately excludes questionable usage such as
      /[a-z-A-Z]/.

4.  I-Regexp Semantics

   This syntax is a subset of that of [XSD-2].  Implementations which
   interpret I-Regexps MUST yield Boolean results as specified in
   [XSD-2].  (See also Section 5.1.)

5.  Mapping I-Regexp to Regexp Dialects

   (TBD; these mappings need to be further verified in implementation
   work.)

5.1.  XSD Regexps

   Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an
   identity function.

   Note that a few errata for [XSD-2] have been fixed in [XSD11-2],
   which is therefore also included as a normative reference.  XSD 1.1
   is less widely implemented than XSD 1.0, and implementations of XSD
   1.0 are likely to include these bugfixes, so for the intents and
   purposes of this specification an implementation of XSD 1.0 regexps
   is equivalent to an implementation of XSD 1.1 regexps.


Bormann & Bray          Expires 8 September 2022                [Page 5]

Internet-Draft                  I-Regexp                      March 2022


5.2.  ECMAScript Regexps

   Perform the following steps on an I-Regexp to obtain an ECMAScript
   regexp [ECMA-262]:

   *  For any dots (.) outside character classes (first alternative of
      charClass production): replace dot by [^\n\r].

   *  Envelope the result in ^ and $.

   Note that where a regexp literal is required, this needs to enclose
   the actual regexp in /.

   The performance of an ECMAScript matcher can be increased by turning
   parenthesized regexps (last choice in production atom) into (?:...)
   constructions.

5.3.  PCRE, RE2, Ruby Regexps

   Perform the same steps as in Section 5.2 to obtain a valid regexp in
   PCRE [PCRE2], the Go programming language [RE2], and the Ruby
   programming language, except that the last step is:

   *  Envelope the result in \A and \z.

   Again, the performance can be increased by turning parenthesized
   regexps (production atom) into (?:...) constructions.

5.4.  << Your kind of Regexp here >>

   (Please submit the mapping needed for your favorite kind of regexp.)

6.  Motivation and Background

   Data modeling formats (YANG, CDDL) as well as query languages
   (jsonpath) often need a regular expression (regexp) sublanguage.
   There are many dialects of regular expressions in use in platforms,
   programming languages, and data modeling formats.

   While regular expressions originally were intended to describe a
   formal language, i.e., to provide a Boolean matching function, they
   have turned into parsing functions for many applications, with
   capture groups, greedy/lazy/possessive variants, etc.  Language
   features such as backreferences allow specifying languages that
   actually are context-free (Chomsky type 2) instead of the regular
   languages (Chomsky type 3) that regular expressions are named for.


Bormann & Bray          Expires 8 September 2022                [Page 6]

Internet-Draft                  I-Regexp                      March 2022


   YANG (Section 9.4.5 of [RFC7950]) and CDDL (Section 3.8.3 of
   [RFC8610]) have adopted the regexp language from W3C Schema [XSD-2].
   XSD regexp is a pure matching language, i.e., XSD regexps can be used
   to match a string against them and yield a simple true or false
   result.  XSD regexps are not as widely implemented as programming
   language regexp dialects such as those of Perl, Python, Ruby, Go
   [RE2], or JavaScript (ECMAScript) [ECMA-262].  The latter are often
   in a state of continuous development; in the best case (ECMAScript)
   there is a complete specification which however is highly complex
   (Section 21.2 of [ECMA-262] comprises 62 pages) and evolves on a
   yearly timeline, with significant additions.  Regexp dialects such as
   PCRE [PCRE2] have evolved to cover a common set of functions
   available in parsing regexp dialects, offered in a widely available
   library.

   With continuing accretion of complex features, parsing regexp
   libraries have become susceptible to bugs and performance
   degradation, in particular those that can be exploited in Denial of
   Service (DoS) attacks.  The library RE2 that is compatible with Go
   language regexps strives to be immune to DoS attacks, making it
   attractive to applications such as query languages where an attacker
   could control the input.  The problem remains that other bugs in such
   libraries can lead to exploitable vulnerabilities; at the time of
   writing, the Common Vulnerabilities and Exposures (CVE) system has
   131 entries that mention the word "regex" [REGEX-CVE] (not all, but
   many of which are such bugs, with 23 matches for arbitrary code
   execution).

   Implementations of YANG and CDDL often struggle with providing true
   XSD regexps; some instead cheat by providing one of the parsing
   regexp varieties, sometimes without even advertising this fact.

   A matching regexp that does not use the more complex XSD features
   (Section 6.1) can usually be converted into a parsing regexp of many
   dialects by simply surrounding it with anchors of that dialect (e.g.,
   ^ or \A and $ or \z).  If the original matching regexps exceed the
   envelope of compatibility between dialects, this can lead to
   interoperability problems, or, worse, security vulnerabilities.
   Also, features of the target dialect such as capture groups may be
   triggered inadvertently, reducing performance.

6.1.  Subsetting XSD Regexps

   XSD regexps are relatively easy to implement or map to widely
   implemented parsing regexp dialects, with a small number of notable
   exceptions:


Bormann & Bray          Expires 8 September 2022                [Page 7]

Internet-Draft                  I-Regexp                      March 2022


   *  Character class subtraction.  This is a very useful feature in
      many specifications, but it is unfortunately mostly absent from
      parsing regexp dialects.

      Discussion: This absence can often be addressed by translating
      character class subtraction into positive character classes
      (possibly requiring significant expansion) and/or inserting
      negative lookahead assertions (which are not universally supported
      by regexp libraries, most notably not by RE2 [RE2]).  This
      specification therefore opts for leaving out character class
      subtraction.

   *  Multi-character escapes.  \d, \w, \s and their uppercase
      equivalents (complement classes) exhibit a large amount of
      variation between regexp flavors.  (E.g., predefined character
      classes such as \w may be meant to be ASCII only, or they may
      encompass all letters and digits defined in Unicode.  The latter
      is usually of interest in the application of query languages to
      text in human languages, while the former is of interest to a
      subset of applications in data model specifications.)

   *  Unicode.  While there is no doubt that a regexp flavor meant to
      last needs to be Unicode enabled, there are a number of aspects of
      this that need discussion.  Not all regexp implementations that
      one might want to map I-Regexps into will support accesses to
      Unicode tables that enable executing on constructs such as
      \p{IsCoptic}, for mapping into such implementations, translation
      needs to be provided.  Fortunately, the \p/\P feature in general
      is now quite widely available.

      Discussion: The ASCII focus can partially be addressed by adding a
      constraint outside the regexp that the matched text has to be
      ASCII in the first place.  This often is all that is needed where
      regexps are used to define lexical elements of a computer
      language.  This reduces the size of the Unicode tables required in
      such a constrained implementation considerably.  (In Appendix A,
      RFC 6643 contains a lone instance of \p{IsBasicLatin}{0,255},
      which is needed to describe a transition from a legacy character
      set to Unicode.  RFC2622 contains [[:digit:]], [[:alpha:]],
      [[:alnum:]], albeit in a specification for the flex tool; this is
      intended to be close to \d, \p{L}, \w in an ASCII subset.)

7.  IANA Considerations

   This document makes no requests of IANA.


Bormann & Bray          Expires 8 September 2022                [Page 8]

Internet-Draft                  I-Regexp                      March 2022


8.  Security considerations

   As discussed in Section 6, more complex regexp libraries are likely
   to contain exploitable bugs leading to crashes and remote code
   execution.  There is also the problem that such libraries often have
   hard to predict performance characteristics, leading to attack
   vectors that overload an implementation by matching against an
   expensive attacked controlled regexp.

   I-Regexps have been designed to allow implementation in a way that is
   resilient to both threats; this objective needs to be addressed
   throughout the implementation effort.

9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [XSD-2]    Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes
              Second Edition", World Wide Web Consortium Recommendation 
              REC-xmlschema-2-20041028, 28 October 2004,
              <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028>.

   [XSD11-2]  Peterson, D., Gao, S., Malhotra, A., Sperberg-McQueen, M.,
              Thompson, H., and P. Biron, "W3C XML Schema Definition
              Language (XSD) 1.1 Part 2: Datatypes", World Wide Web
              Consortium Recommendation REC-xmlschema11-2-20120405, 5
              April 2012,
              <https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405>.

9.2.  Informative References

   [ECMA-262] Ecma International, "ECMAScript 2020 Language
              Specification", ECMA Standard ECMA-262, 11th Edition, June
              2020, <https://www.ecma-international.org/wp-
              content/uploads/ECMA-262.pdf>.

   [PCRE2]    "Perl-compatible Regular Expressions (revised API:
              PCRE2)", n.d., <http://pcre.org/current/doc/html/>.


Bormann & Bray          Expires 8 September 2022                [Page 9]

Internet-Draft                  I-Regexp                      March 2022


   [RE2]      "RE2 is a fast, safe, thread-friendly alternative to
              backtracking regular expression engines like those used in
              PCRE, Perl, and Python. It is a C++ library.", n.d.,
              <https://github.com/google/re2>.

   [REGEX-CVE]
              "CVE - Search Results", n.d.,
              <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=regex>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/info/rfc7493>.

   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
              RFC 7950, DOI 10.17487/RFC7950, August 2016,
              <https://www.rfc-editor.org/info/rfc7950>.

   [RFC8610]  Birkholz, H., Vigano, C., and C. Bormann, "Concise Data
              Definition Language (CDDL): A Notational Convention to
              Express Concise Binary Object Representation (CBOR) and
              JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610,
              June 2019, <https://www.rfc-editor.org/info/rfc8610>.

Appendix A.  Regexps and Similar Constructs in Recent Published RFCs

   This appendix contains a number of regular expressions that have been
   extracted from some recently published RFCs based on some ad-hoc
   matching.  Multi-line constructions were not included.  With the
   exception of some (often surprisingly dubious) usage of multi-
   character escapes, all regular expressions validate against the ABNF
   in Figure 1.

   rfc6021.txt  459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6021.txt  513 \d*(\.\d*){1,127}
   rfc6021.txt  529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6021.txt  631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6021.txt  647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6021.txt  933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt  938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6020.txt 6647 [0-9a-fA-F]*
   rfc6095.txt 2544 \S(.*\S)?
   rfc6110.txt 1583 [aeiouy]*
   rfc6110.txt 3222 [A-Z][a-z]*
   rfc6536.txt 1583 \*
   rfc6536.txt 1632 [^\*].*
   rfc6643.txt  524 \p{IsBasicLatin}{0,255}


Bormann & Bray          Expires 8 September 2022               [Page 10]

Internet-Draft                  I-Regexp                      March 2022


   rfc6728.txt 3480 \S+
   rfc6728.txt 3500 \S(.*\S)?
   rfc6991.txt  477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6991.txt  525 \d*(\.\d*){1,127}
   rfc6991.txt  541 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc6991.txt  542 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc6991.txt  571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6991.txt  665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6991.txt  725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
   rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6991.txt 1099 [0-9\.]*
   rfc6991.txt 1109 [0-9a-fA-F:\.]*
   rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc7407.txt  933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
   rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
   rfc7758.txt  703 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7895.txt  349 \d{4}-\d{2}-\d{2}
   rfc7950.txt 8323 [0-9a-fA-F]*
   rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc7950.txt 8356 [xX][mM][lL].*
   rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
   rfc8049.txt 6704 [A-Z]{2}
   rfc8194.txt  629 \*
   rfc8194.txt  637 [0-9]{8}\.[0-9]{6}
   rfc8194.txt  905 Z|[\+\-]\d{2}:\d{2}
   rfc8194.txt  963 (2((2[4-9])|(3[0-9]))\.).*
   rfc8194.txt  974 (([fF]{2}[0-9a-fA-F]{2}):).*
   rfc8299.txt 7986 [A-Z]{2}
   rfc8341.txt 1878 \*
   rfc8341.txt 1927 [^\*].*
   rfc8407.txt 1723 [0-9\.]*
   rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc8525.txt  550 \d{4}-\d{2}-\d{2}
   rfc8776.txt  838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
   rfc8776.txt  874 ([a-zA-Z0-9\-_.]+:)*
   rfc8819.txt  311 [\S ]+
   rfc8944.txt  596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}

         Figure 2: Example regular expressions extracted from RFCs

   The multi-character escapes (MCE) or the character classes built
   around them used here can be substituted as shown in Table 1.


Bormann & Bray          Expires 8 September 2022               [Page 11]

Internet-Draft                  I-Regexp                      March 2022


                     +===========+==================+
                     | MCE/class | Substitute class |
                     +===========+==================+
                     | \S        | [^ \t\n\r]       |
                     +-----------+------------------+
                     | [\S ]     | [^\t\n\r]        |
                     +-----------+------------------+
                     | \d        | [0-9]            |
                     +-----------+------------------+

                         Table 1: Substitutes for
                        multi-character escapes in
                                 examples

   Note that the semantics of \d in XSD regular expressions is that of
   \p{Nd}; however, this would include all Unicode characters that are
   digits in various writing systems and certainly is not actually meant
   in the RFCs listed.

Acknowledgements

   This draft has been motivated by the discussion in the IETF JSONPATH
   WG about whether to include a regexp mechanism into the JSONPath
   query expression specification, as well as by previous discussions
   about the YANG pattern and CDDL .regexp features.

   The basic approach for this draft was inspired by The I-JSON Message
   Format [RFC7493].

Authors' Addresses

   Carsten Bormann
   Universität Bremen TZI
   Postfach 330440
   D-28359 Bremen
   Germany
   Phone: +49-421-218-63921
   Email: cabo@tzi.org


   Tim Bray
   Textuality
   Email: tbray@textuality.com


Bormann & Bray          Expires 8 September 2022               [Page 12]