-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathsrfi-115-1.3.html
1678 lines (1337 loc) · 60.1 KB
/
srfi-115-1.3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!--
SPDX-FileCopyrightText: 2014 Alex Shinn
SPDX-License-Identifier: MIT
-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
'http://www.w3.org/TR/REC-html40/strict.dtd'>
<html lang=en-US>
<head>
<!-- This commented out text is for the brittle SRFI tools -->
<!--
</head>
<body>
<H1>Title</H1>
Scheme Regular Expressions
<H1>Author</H1>
Alex Shinn
<H1>Status</H1>
This SRFI is currently in ``draft'' status.
-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="Scheme, regular expressions, programming language, SRFI">
<title>SRFI 115: Scheme Regular Expressions</title>
<style type="text/css">
body {
width: 7in;
margin: 30pt;
}
thead {
font-variant: small-caps;
}
td {
padding-right: 20px;
}
code.proc-def {
font-style: bold;
color: rgb(120,0,120);
}
.code-example {
background-color: beige;
}
var {
font-style: bold;
color: rgb(20,20,120);
}
</style>
</head>
<body>
<h1><a name="Title">Title</a></h1>
<div class="title-text">
<p>
Scheme Regular Expressions
<p>
<p>
</div>
<h1><a name="Author">Author</a></h1>
<p>
Alex Shinn
<p>
This SRFI is currently in ``draft'' status.
To see an explanation of
each status that a SRFI can hold, see <a
href="http://srfi.schemers.org/srfi-process.html">here</a>.
To provide input on this SRFI, please
<a href="mailto:srfi minus 115 at srfi dot schemers dot org">mail to
<code><srfi minus 115 at srfi dot schemers dot org></code></a>. See
<a href="../../srfi-list-subscribe.html">instructions here</a> to
subscribe to the list. You can access previous messages via
<a href="mail-archive/maillist.html">the archive of the mailing list</a>.
</p>
<ul>
<li>Received: <a href="http://srfi.schemers.org/srfi-115/srfi-115-1.1.html">2013/10/08</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-115/srfi-115-1.2.html">2013/11/17</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-115/srfi-115-1.3.html">2013/12/08</a></li>
<li>Draft: 2013/10/12-2013/12/12</li>
</ul>
<h1>Table of Contents</h1>
<ul id="toc-table">
<li><a href="#Abstract">Abstract</a></li>
<li><a href="#Issues">Issues</a></li>
<li><a href="#Rationale">Rationale</a></li>
<li><a href="#Types-and-Naming-Conventions">Types and Naming Conventions</a></li>
<li><a href="#Compatibility-Levels-and-Features">Compatibility Levels and Features</a></li>
<li><a href="#Library-Procedures-and-Syntax">Library Procedures and Syntax</a></li>
<li><a href="#SRE-Syntax">SRE Syntax</a></li>
<ul>
<ul>
<li><a href="#SRE_2dSyntax_Basic-Patterns">Basic Patterns</a></li>
<li><a href="#SRE_2dSyntax_Repeating-patterns">Repeating patterns</a></li>
<li><a href="#SRE_2dSyntax_Submatch-Patterns">Submatch Patterns</a></li>
<li><a href="#SRE_2dSyntax_Character-Sets">Character Sets</a></li>
<li><a href="#SRE_2dSyntax_Named-Character-Sets">Named Character Sets</a></li>
<li><a href="#SRE_2dSyntax_Boundary-Assertions">Boundary Assertions</a></li>
<li><a href="#SRE_2dSyntax_Non-Greedy-Patterns">Non-Greedy Patterns</a></li>
<li><a href="#SRE_2dSyntax_Look-Around-Patterns">Look Around Patterns</a></li>
</ul>
</ul>
<li><a href="#Implementation">Implementation</a></li>
<li><a href="#References">References</a></li>
</ul>
<h1><a name="Abstract">Abstract</a></h1>
<p>
This SRFI provides a library for matching strings with regular
expressions described using the SRE "Scheme Regular Expression"
notation first introduced by <a href="#ref-SCSH">SCSH</a>, and
extended heavily by <a href="#ref-IrRegex">IrRegex</a>.
<p>
<p>
<h1><a name="Issues">Issues</a></h1>
<p>
How to integrate with the PCRE regular expression library? The
intention is to make this the primitive notation, and for POSIX
require a separate wrapper such as <code>(pcre->sre <str>)</code>.
Alternately we could allow both in the same API, as in IrRegex,
though this introduces an ambiguity. Finally, we could make this
entirely separate from the PCRE API.
<p>
From SCSH's SREs I've left out the <code>dsm</code> notation which doesn't
seem as though it need be exposed to the user, the
<code>posix-string</code> notation because it's better accomplished with
<code>pcre->sre</code>, and <code>uncase</code> whose exact semantics and
motivation I never quite understood. I also left out the
<code>blank</code> character class since it's a GNU extension without an
accepted Unicode definition.
<p>
| and & are allowed, but the former must be escaped, which looks
fairly ugly. For aesthetics they can also be written <code>or</code> and
<code>and</code>, respectively.
<p>
I've kept most IrRegex extensions, but made many of the non-POSIX
ones optional, designated by the <code>regexp-extended</code> feature, and
<code>backref</code> specifically gets its own feature
<code>regexp-backrefs</code>. I left out the common utility patterns
<code>integer</code>, <code>domain</code>, <code>url</code>, etc., which can easily
enough be included in libraries and unquoted into SREs.
<p>
The => shorthand for named matches used by IrRegex would perhaps
have better been named <-, the more common choice to represent
binding in parsers, leaving => open for the send-to-procedure idiom
used in cond.
<p>
The API uses string indices for start, end and match positions,
which is slow for a UTF8 implementation. However, the reference
implementation uses string cursors for efficient iteration,
minimizing offset conversions, and suffers no penalty if submatch
strings are directly extracted instead of bounds.
<p>
Unicode properties and grapheme handling have no precedent in SRE
implementations, though has much precedent in other regexp
libraries. Making Unicode the default feels right, but the vast
majority of regexps are likely to want ASCII.
<p> Many Unicode properties as well as Unicode script names that are
available in PCRE are not provided as char-sets here.
<p> SREs with embedded SRFI 14 char-sets can't be written and read
back in portably. R7RS WG2 is considering external syntax
representations, and may include them for SRFI 14 char-sets as well,
making this a non-issue. On the other hand SREs with embedded
compiled regexps, as allowed in SCSH, are not supported, largely to
preserve writeability. Instead you should embedded other SREs.
<p> <code>regexp->sre</code> is frequently requested in IrRegex. It
is useful and the only argument against it is that it would require
more memory for compiled regexps (linearly more for most
implementations), but I'll wait to see if it's requested in the
discussion.
<p> Library-level features aren't supported in R7RS.
<p>
There aren't enough examples.
<p>
<p>
<h1><a name="Rationale">Rationale</a></h1>
<p> Regular expressions, coming from a long history of formal language
theory, are today the lingua franca of simple string matching. A
regular expression is an expression describing a regular language,
the simplest level in the Chomsky hierarchy. They have the nice
property that they can match in linear time, whereas parsers for the
next level in the hierarchy require cubic time. This combined with
their conciseness led them to be a popular choice for searching in
editors, tools and search interfaces. Other tools may be better
suited to specific purposes, but it is assumed any modern language
provide regular expression support.
<p> SREs were first introduced in SCSH as an s-expression based
alternative to the more common string based description. This
format offers many advantages, including being easier to read and
write (notably with structured editors), easier to compose (with no
escaping issues), and faster and simpler to compile. An efficient
reference implementation of this SRFI can be written in under 1000
lines of code, whereas in IrRegex the full PCRE parser alone
requires over 500 lines.
<p>
<p>
<h1>Procedure Index</h1>
<table>
<tr>
<td><a href="#proc-regexp">regexp</a></td>
<td><a href="#proc-rx">rx</a></td>
<td><a href="#proc-char-set-sre">char-set->sre</a></td>
<td><a href="#proc-valid-sre_3f">valid-sre?</a></td>
</tr>
<tr>
<td><a href="#proc-regexp_3f">regexp?</a></td>
<td><a href="#proc-regexp-matches">regexp-matches</a></td>
<td><a href="#proc-regexp-matches_3f">regexp-matches?</a></td>
<td><a href="#proc-regexp-search">regexp-search</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-fold">regexp-fold</a></td>
<td><a href="#proc-regexp-extract">regexp-extract</a></td>
<td><a href="#proc-regexp-split">regexp-split</a></td>
<td><a href="#proc-regexp-partition">regexp-partition</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-replace">regexp-replace</a></td>
<td><a href="#proc-regexp-replace-all">regexp-replace-all</a></td>
<td><a href="#proc-regexp-match_3f">regexp-match?</a></td>
<td><a href="#proc-regexp-match-count">regexp-match-count</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-match-submatch">regexp-match-submatch</a></td>
<td><a href="#proc-regexp-match-submatch-start">regexp-match-submatch-start</a></td>
<td><a href="#proc-regexp-match-submatch-end">regexp-match-submatch-end</a></td>
<td><a href="#proc-regexp-match-_3elist">regexp-match->list</a></td>
</tr>
</table>
<h1>Sre Syntax Index</h1>
<table><tr>
<td><a href="#proc-_3cstring_3e"><string></a></td>
<td><a href="#proc-seq">seq</a></td>
<td><a href="#proc-_3a">:</a></td>
<td><a href="#proc-or">or</a></td>
</tr><tr><td><a href="#proc-_7c_5c_7c_7c">|</a></td>
<td><a href="#proc-w_2fnocase">w/nocase</a></td>
<td><a href="#proc-w_2fcase">w/case</a></td>
<td><a href="#proc-w_2fascii">w/ascii</a></td>
</tr><tr><td><a href="#proc-w_2funicode">w/unicode</a></td>
<td><a href="#proc-_3f">?</a></td>
<td><a href="#proc-_2a">*</a></td>
<td><a href="#proc-_2b">+</a></td>
</tr><tr><td><a href="#proc-_3e_3d">>=</a></td>
<td><a href="#proc-_3d">=</a></td>
<td><a href="#proc-_2a_2a">**</a></td>
<td><a href="#proc-submatch">submatch</a></td>
</tr><tr><td><a href="#proc-_24">$</a></td>
<td><a href="#proc-submatch-named">submatch-named</a></td>
<td><a href="#proc-_3d_3e">=></a></td>
<td><a href="#proc-backref">backref</a></td>
</tr><tr><td><a href="#proc-_3cchar_3e"><char></a></td>
<td><a href="#proc-_3cstring_3e_29">(<string>)</a></td>
<td><a href="#proc-_2f">/</a></td>
<td><a href="#proc-or">or</a></td>
</tr><tr><td><a href="#proc-_7e">~</a></td>
<td><a href="#proc--">-</a></td>
<td><a href="#proc-and">and</a></td>
<td><a href="#proc-_26">&</a></td>
</tr><tr><td><a href="#proc-any">any</a></td>
<td><a href="#proc-nonl">nonl</a></td>
<td><a href="#proc-ascii">ascii</a></td>
<td><a href="#proc-lower-case">lower-case</a></td>
</tr><tr><td><a href="#proc-lower">lower</a></td>
<td><a href="#proc-upper-case">upper-case</a></td>
<td><a href="#proc-upper">upper</a></td>
<td><a href="#proc-alphabetic">alphabetic</a></td>
</tr><tr><td><a href="#proc-alpha">alpha</a></td>
<td><a href="#proc-numeric">numeric</a></td>
<td><a href="#proc-num">num</a></td>
<td><a href="#proc-alphanumeric">alphanumeric</a></td>
</tr><tr><td><a href="#proc-alphanum">alphanum</a></td>
<td><a href="#proc-alnum">alnum</a></td>
<td><a href="#proc-punctuation">punctuation</a></td>
<td><a href="#proc-punct">punct</a></td>
</tr><tr><td><a href="#proc-symbol">symbol</a></td>
<td><a href="#proc-graphic">graphic</a></td>
<td><a href="#proc-graph">graph</a></td>
<td><a href="#proc-whitespace">whitespace</a></td>
</tr><tr><td><a href="#proc-white">white</a></td>
<td><a href="#proc-space">space</a></td>
<td><a href="#proc-printing">printing</a></td>
<td><a href="#proc-print">print</a></td>
</tr><tr><td><a href="#proc-control">control</a></td>
<td><a href="#proc-cntrl">cntrl</a></td>
<td><a href="#proc-hex-digit">hex-digit</a></td>
<td><a href="#proc-xdigit">xdigit</a></td>
</tr><tr><td><a href="#proc-bos">bos</a></td>
<td><a href="#proc-eos">eos</a></td>
<td><a href="#proc-bol">bol</a></td>
<td><a href="#proc-eol">eol</a></td>
</tr><tr><td><a href="#proc-bow">bow</a></td>
<td><a href="#proc-eow">eow</a></td>
<td><a href="#proc-nwb">nwb</a></td>
<td><a href="#proc-word">word</a></td>
</tr><tr><td><a href="#proc-word_2b">word+</a></td>
<td><a href="#proc-word">word</a></td>
<td><a href="#proc-bog">bog</a></td>
<td><a href="#proc-eog">eog</a></td>
</tr><tr><td><a href="#proc-grapheme">grapheme</a></td>
<td><a href="#proc-_3f_3f">??</a></td>
<td><a href="#proc-_2a_3f">*?</a></td>
<td><a href="#proc-_2a_2a_3f">**?</a></td>
</tr><tr><td><a href="#proc-look-ahead">look-ahead</a></td>
<td><a href="#proc-look-behind">look-behind</a></td>
<td><a href="#proc-neg-look-ahead">neg-look-ahead</a></td>
<td><a href="#proc-neg-look-behind">neg-look-behind</a></td>
</tr>
</table>
<h1><a name="Types-and-Naming-Conventions">Types and Naming Conventions</a></h1>
<p>
We introduce two new types, <code>regexp</code> and
<code>regexp-match</code>, which are disjoint from all other types. We
also introduce the concept of an "SRE," which is not a disjoint type
but is a Scheme object following the specification described below.
<p>
SRFI 14 defines the <code>char-set</code> type, which can be used as
part of an SRE.
<p>
In the prototypes below the following naming conventions imply type
restrictions:
<p>
<ul>
<li><var>char-set</var>: a SRFI 14 character set
<li><var>cset-sre</var>: an sre which corresponds to matching a single character out of a set of characters
<li><var>end</var>: an exact, non-negative integer, defaulting to the <code>(string-length str)</code>
<li><var>finish</var>: a procedure <code>(lambda (i regexp-match str obj) ...)</code>
<li><var>obj</var>: any object
<li><var>knil</var>: any object
<li><var>kons</var>: a procedure <code>(lambda (i regexp-match str obj) ...)</code>
<li><var>re</var>: an SRE or pre-compiled regexp object
<li><var>regexp-match</var>: an regexp-match object from a successful match
<li><var>sre</var>: an SRE as described below
<li><var>start</var>: an exact, non-negative integer, defaulting to 0
<li><var>str</var>: a string
<li><var>subst</var>: an sexp describing a substition template
<li><var>X-or-false</var>: either an object of type X or the false value
<p>
<p>
</ul>
<h1><a name="Compatibility-Levels-and-Features">Compatibility Levels and Features</a></h1>
<p>
We specify a thorough, though not exhaustive, syntax with many
extensions popular in modern regular expression libraries such as
<a href="#ref-PCRE">PCRE</a>. This is because it is assumed in many
cases said libraries will be used as the underlying implementation,
the features will be desirable, and if left unspecified people will
provide their own, often incompatible, extensions.
<p> On the other hand it is acknowledged that not all implementations
will be able to support all extensions. Some are difficult to
implement for DFA implementations, and some, like
<code>backref</code>, are prohibitively expensive for any
implementation. Furthermore, even if an implementation has Unicode
support, its regexp library may not.
<p>
To resolve these differences we divide the syntax into a minimal
core which all implementations are required to support, and
additional extensions. In <a href="#ref-R7RS">R7RS</a> or other
implementations which support <a href="#ref-SRFI-0">SRFI 0</a>
<code>cond-expand</code>, the availability can be tested with the
following <code>cond-expand</code> features:
<p>
<ul>
<li><code>regexp-non-greedy</code> - the non-greedy repetition patterns <code>??</code>, <code>*?</code>, and <code>**?</code> are supported
<li><code>regexp-look-around</code> - the <code>[neg]-look-ahead</code> and <code>[neg]-look-behind</code> zero-width assertions are supported
<li><code>regexp-backrefs</code> - the <code>backref</code> pattern is supported
<li><code>regexp-unicode</code> - regexp character sets support Unicode
</ul>
<p>
The first three simply refer to support for certain SRE patterns.
<p>
<code>regexp-unicode</code> indicates support for Unicode contexts.
Toggling between Unicode and ASCII can be done with the
<code>w/unicode</code> and <code>w/ascii</code> patterns. In a
Unicode context, the named character sets have their full Unicode
definition as described below, grapheme boundaries are "extended
grapheme clusters," and word boundaries are "default word
boundaries" as defined in <a href="#ref-UAX29">UAX #29</a> (Unicode
Text Segmentation). Thus Unicode contexts are equivalent to Level 2
support for regular expressions as defined in Unicode TR-18.
Implementations which provide this feature may still support
non-Unicode characters.
<p>
<p>
<h1><a name="Library-Procedures-and-Syntax">Library Procedures and Syntax</a></h1>
<p>
<dt>(<a name="proc-regexp"><code class="proc-def">regexp</code></a> <var>re</var>) => regexp
<dd class="proc-def"></dd>
<p> Compile a regexp if given an object whose structure matches the
SRE syntax. This may be written as a literal or partial literal
with <code>quote</code> or <code>quasiquote</code>, or may be
generated entirely programmatically. Returns <var>re</var>
unmodified if it is already a regexp. Raises an error if
<var>re</var> is neither a regexp nor a valid representation of an
SRE.
<p> Mutating <var>re</var> may invalidate the resulting regexp,
causing unspecified results if subsequently used for matching.
<p>
<dt>(<a name="proc-rx"><code class="proc-def">rx</code></a> <var>sre</var> <var>...</var>) => regexp
<dd class="proc-def"></dd>
<p> Macro shorthand for <code>(regexp `(: <var>sre</var> ...))</code>.
May be able to perform some or all computation at compile time if
<var>sre</var> is not unquoted. Note because of this equivalence
with the procedural constructor <code>regexp</code>, the semantics
of <code>unquote</code> differs from the original SCSH
implementation in that unquoted expressions can expand into any
object matching the SRE syntax, rather than a compiled regexp
object. Further, <code>unquote</code> and
<code>unquote-splicing</code> both expand all matches.
<blockquote style="background:lightgray"> Rationale: Providing a
procedural interface provides for greater flexibility, and without
loss of potential compile-time optimizations by preserving the
syntactic shorthand. The alternative is to rely on eval to
dynamically generate regular expressions. However regexps in many
cases come from untrusted sources, such as search parameters to a
server, or from serialized sources such as config files or
command-line arguments. Moreover many applications may want to keep
many thousands of regexps in memory at once. Given the relatively
heavy cost and insecurity of eval, and the frequency with which SREs
are read and written as text, we prefer the procedural interface.
</blockquote>
<p>
<dt>(<a name="proc-char-set-sre"><code class="proc-def">char-set->sre</code></a> <var>char-set</var>) => sre
<dd class="proc-def"></dd>
<p>
Returns an SRE corresponding to the given SRFI 14 character set.
The resulting SRE expands the character set into notation which does
not make use of embedded SRFI 14 character sets, and so is suitable
for writing portably.
<p>
<dt>(<a name="proc-valid-sre_3f"><code class="proc-def">valid-sre?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>
<p>
Returns true iff <var>obj</var> can be safely passed to <var>regexp</var>.
<p>
<dt>(<a name="proc-regexp_3f"><code class="proc-def">regexp?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>
<p>
Returns true iff <var>obj</var> is a regexp.
<p>
<dt>(<a name="proc-regexp-matches"><code class="proc-def">regexp-matches</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => regexp-match-or-false
<dd class="proc-def"></dd>
<p>
Returns an regexp-match object if <var>re</var> successfully matches the entire
string <var>str</var> from <var>start</var> (inclusive) to <var>end</var> (exclusive), or #f is the
match fails. The regexp-match object will contain information needed to
extract any submatches.
<p>
<dt>(<a name="proc-regexp-matches_3f"><code class="proc-def">regexp-matches?</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => boolean?
<dd class="proc-def"></dd>
<p>
Returns <code>#t</code> if <var>re</var> matches <var>str</var> as in regexp-matches, or
<code>#f</code> otherwise. May be faster than regexp-matches since it
doesn't need to return submatch data.
<p>
<dt>(<a name="proc-regexp-search"><code class="proc-def">regexp-search</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => regexp-match-or-false
<dd class="proc-def"></dd>
<p>
Returns an regexp-match object if <var>re</var> successfully matches a substring
of <var>str</var> between <var>start</var> (inclusive) and <var>end</var> (exclusive), or
<code>#f</code> is the match fails. The regexp-match object will contain
information needed to extract any submatches.
<p>
<dt>(<a name="proc-regexp-fold"><code class="proc-def">regexp-fold</code></a> <var>re</var> <var>kons</var> <var>knil</var> <var>str</var> <var>[finish</var> <var>[start</var> <var>[end]]]</var>) => obj
<dd class="proc-def"></dd>
<p>
The fundamental regexp matching iterator. Repeatedly searches <var>str</var>
for the regexp <var>re</var> so long as a match can be found. On each
successful match, applies
<pre class="code-example">
(<var>kons</var> <i>i</i> <i>regexp-match</i> <i>str</i> <i>acc</i>)
</pre>
where <i>i</i> is the index since the last match (beginning with <var>start</var>),
<i>regexp-match</i> is the resulting match, and <i>acc</i> is the result of the
previous <var>kons</var> application, beginning with <var>knil</var>. When no more
matches can be found, calls <var>finish</var> with the same arguments, except
that <i>regexp-match</i> is #f.
<p>
By default <var>finish</var> just returns <i>acc</i>.
<p>
<dt>(<a name="proc-regexp-extract"><code class="proc-def">regexp-extract</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>
<p>
Extract all non-empty substrings of <var>str</var> which match <var>re</var> between
<var>start</var> and <var>end</var> as a list of strings.
<p>
<pre class="code-example">
(regexp-extract '(+ numeric) "192.168.0.1")
=> ("192" "168" "0" "1")
</pre>
<dt>(<a name="proc-regexp-split"><code class="proc-def">regexp-split</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>
<p>
Split <var>str</var> into a list of strings separated by matches of <var>re</var>.
<p>
<pre class="code-example">
(regexp-split '(+ space) " fee fi fo\tfum\n")
=> ("fee" "fi" "fo" "fum")
</pre>
<dt>(<a name="proc-regexp-partition"><code class="proc-def">regexp-partition</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>
<p>
Partition <var>str</var> into a list of non-empty strings matching <var>re</var>,
interspered with the unmatched portions of the string. The first
and every odd element is an unmatched substring, which will be the
empty string if <var>re</var> matches at the beginning of the string or end
of the previous match. The second and every even element will be a
substring matching <var>re</var>. If the final match ends at the end of the
string, no trailing empty string will be included. Thus, in the
degenerate case where <var>str</var> is the empty string, the result is
<code>("")</code>.
<p>
<pre class="code-example">
(regexp-partition '(+ (or space punct)) "")
=> ("")
(regexp-partition '(+ (or space punct)) "Hello, world!\n")
=> ("Hello" ", " "world" "!\n")
(regexp-partition '(+ (or space punct)) "¿Dónde Estás?")
=> ("" "¿" "Dónde" " " "Estás" "?")
</pre>
<dt>(<a name="proc-regexp-replace"><code class="proc-def">regexp-replace</code></a> <var>re</var> <var>str</var> <var>subst</var> <var>[start</var> <var>[end]]</var>) => string
<dd class="proc-def"></dd>
<p>
Returns a new string replacing the first match of <var>re</var> in <var>str</var> with
the <var>subst</var>. <var>subst</var> can be a string, an integer or symbol
indicating the contents of a numbered or named submatch of <var>re</var>,
<var>'pre</var> for the substring to the left of the match, or <var>'post</var> for
the substring to the right of the match.
<p>
<pre class="code-example">
(regexp-replace '(+ space) "one two three" "_")
=> "one_two three"
</pre>
<dt>(<a name="proc-regexp-replace-all"><code class="proc-def">regexp-replace-all</code></a> <var>re</var> <var>str</var> <var>subst</var> <var>[start</var> <var>[end]]</var>) => string
<dd class="proc-def"></dd>
<p>
Equivalent to <var>regexp-replace</var>, but replaces all occurrences of <var>re</var>
in <var>str</var>.
<p>
<pre class="code-example">
(regexp-replace-all '(+ space) "one two three" "_")
=> "one_two_three"
</pre>
<dt>(<a name="proc-regexp-match_3f"><code class="proc-def">regexp-match?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>
<p>
Returns true iff <var>obj</var> is a successful match from <var>regexp-matches</var> or
<var>regexp-search</var>.
<p>
<dt>(<a name="proc-regexp-match-count"><code class="proc-def">regexp-match-count</code></a> <var>regexp-match</var>) => integer
<dd class="proc-def"></dd>
<p>
Returns the number of submatches of regexp-match, regardless of whether
they matched or not.
<p>
<dt>(<a name="proc-regexp-match-submatch"><code class="proc-def">regexp-match-submatch</code></a> <var>regexp-match</var> <var>field</var>) => string-or-false
<dd class="proc-def"></dd>
<p> Returns the substring matched in <var>regexp-match</var>
corresponding to <var>field</var>, either an integer or a symbol for
a named submatch. Index 0 refers to the entire match, index 1 to
the first lexicographic submatch, and so on. If passed an integer
outside the range of matches, or a symbol which does not correspond
to a named submatch of the pattern, it is an error. If the
corresponding submatch did not match, returns false.
<p> The result of extracting a submatch after the original matched
string has been mutated is unspecified.
<p>
<dt>(<a name="proc-regexp-match-submatch-start"><code class="proc-def">regexp-match-submatch-start</code></a> <var>regexp-match</var> <var>field</var>) => integer-or-false
<dd class="proc-def"></dd>
<p>
Returns the start index <var>regexp-match</var> corresponding to
<var>field</var>, as in <var>regexp-match-submatch</var>.
<p>
<dt>(<a name="proc-regexp-match-submatch-end"><code class="proc-def">regexp-match-submatch-end</code></a> <var>regexp-match</var> <var>field</var>) => integer-or-false
<dd class="proc-def"></dd>
<p>
Returns the end index in <var>regexp-match</var> corresponding to
<var>field</var>, as in <var>regexp-match-submatch</var>.
<p>
<dt>(<a name="proc-regexp-match-_3elist"><code class="proc-def">regexp-match->list</code></a> <var>regexp-match</var>) => list
<dd class="proc-def"></dd>
<p>
Returns a list of all submatches in <var>regexp-match</var> as string or false,
beginning with the entire match 0.
<p>
<p>
<h1><a name="SRE-Syntax">SRE Syntax</a></h1>
<p> The grammar for SREs is summarized below. Note that an SRE is a
first-class object consisting of nested lists of strings, chars,
char-sets, symbols and numbers. Where the syntax is described as
<code>(foo bar)</code>, this can be constructed equivalently as
<code>'(foo bar)</code> or <code>(list 'foo 'bar)</code>, etc.
The following sections explain the semantics in greater detail.
<p>
<pre class="code-example">
<sre> ::=
| <string> ; A literal string match.
| <cset-sre> ; A character set match.
| (* <sre> ...) ; 0 or more matches.
| (+ <sre> ...) ; 1 or more matches.
| (? <sre> ...) ; 0 or 1 matches.
| (= <n> <sre> ...) ; <n> matches.
| (>= <n> <sre> ...) ; <n> or more matches.
| (** <n> <m> <sre> ...) ; <n> to <m> matches.
| (| <sre> ...) ; Alternation.
| (or <sre> ...)
| (: <sre> ...) ; Sequence.
| (seq <sre> ...)
| ($ <sre> ...) ; Numbered submatch.
| (submatch <sre> ...)
| (=> <name> <sre> ...) ; Named submatch. <name> is
| (submatch-named <name> <sre> ...) ; a symbol.
| (w/case <sre> ...) ; Introduce a case-sensitive context.
| (w/nocase <sre> ...) ; Introduce a case-insensitive context.
| (w/unicode <sre> ...) ; Introduce a unicode context.
| (w/ascii <sre> ...) ; Introduce an ascii context.
| bos ; Beginning of string.
| eos ; End of string.
| bol ; Beginning of line.
| eol ; End of line.
| bog ; Beginning of grapheme cluster.
| eog ; End of grapheme cluster.
| graheme ; A single grapheme cluster.
| bow ; Beginning of word.
| eow ; End of word.
| nwb ; A non-word boundary.
| (word <sre> ...) ; A sre wrapped in word boundaries.
| (word+ <cset-sre> ...) ; A single word restricted to a cset.
| word ; A single word.
| (?? sre ...) ; A non-greedy pattern, 0 or 1 match.
| (*? sre ...) ; Non-greedy 0 or more matches.
| (**? m n sre ...) ; Non-greedy <m> to <n> matches.
| (look-ahead sre ...) ; Zero-width look-ahead assertion.
| (look-behind sre ...) ; Zero-width look-behind assertion.
| (neg-look-ahead sre ...) ; Zero-width negative look-ahead assertion.
| (neg-look-behind sre ...) ; Zero-width negative look-behind assertion.
</pre>
The grammar for <code>cset-sre</code> is as follows.
<p>
<pre class="code-example">
<cset-sre> ::=
| <char> ; literal char
| "<char>" ; string of one char
| <char-set> ; embedded SRFI 14 char set
| (<string>) ; literal char set
| (/ <range-spec> ...) ; ranges
| (or <cset-sre> ...) ; union
| (and <cset-sre> ...) ; intersection
| (- <cset-sre> ...) ; difference
| (~ <cset-sre> ...) ; complement of union
| (w/case <cset-sre> ...) ; case and unicode toggling
| (w/nocase <cset-sre> ...)
| (w/ascii <cset-sre> ...)
| (w/unicode <cset-sre> ...)
| any | nonl | ascii | lower-case | lower
| upper-case | upper | alphabetic | alpha
| numeric | num | alphanumeric | alphanum | alnum
| punctuation | punct | symbol | graphic | graph
| whitespace | white | space | printing | print
| control | cntrl | hex-digit | xdigit
</pre>
<pre class="code-example">
<range-spec> ::= <string> | <char>
</pre>
<h3><a name="SRE_2dSyntax_Basic-Patterns">Basic Patterns</a></h3>
<p>
<dt><a name="proc-_3cstring_3e"><code class="proc-def"><string></code></a>
<dd class="proc-def"></dd>
<p>
A literal string.
<p>
<pre class="code-example">
(regexp-search "needle" "hayneedlehay") => #<regexp-match>
(regexp-search "needle" "haynEEdlehay") => #f
</pre>
<dt>(<a name="proc-seq"><code class="proc-def">seq</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-:"><code class="proc-def">:</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Sequencing.
<p>
<pre class="code-example">
(regexp-search '(: "one" space "two" space "three") "one two three") => #<regexp-match>
</pre>
<dt>(<a name="proc-or"><code class="proc-def">or</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_7c_5c_7c_7c"><code class="proc-def">|\||</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Alternation.
<p>
<pre class="code-example">
(regexp-search '(or "eeney" "meeney" "miney") "meeney") => #<regexp-match>
(regexp-search '(or "eeney" "meeney" "miney") "moe") => #f
</pre>
<dt>(<a name="proc-w_2fnocase"><code class="proc-def">w/nocase</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p> Enclosed <var>sres</var> are case-insensitive. In a Unicode
context character and string literals match with the default simple
Unicode case-insensitive matching, and character sets match if any
character in the set matches case-insensitively. Implementations
may, but are not required to, handle variable length case
conversions, such as #\x00DF "ß" matching the two characters "SS".
In an ASCII context only the 52 ASCII letters "a-zA-Z" match
case-insensitively to each other.
<p>
<pre class="code-example">
(regexp-search '(w/nocase "needle") "haynEEdlehay") => #<regexp-match>
</pre>
<dt>(<a name="proc-w_2fcase"><code class="proc-def">w/case</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Enclosed <var>sres</var> are case-sensitive. This is the default.
<p>
<pre class="code-example">
(regexp-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<regexp-match>
</pre>
<dt>(<a name="proc-w_2fascii"><code class="proc-def">w/ascii</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Enclosed <var>sres</var> are interpreted in an ASCII context. In practice
many regular expressions are used for simple parsing and only ASCII
characters are relevant. Switching to ASCII mode can improve
performance in some implementations.
<p>
<pre class="code-example">
(regexp-search '(w/ascii bos (* letter) eos) "English") => #<regexp-match>
(regexp-search '(w/ascii bos (* letter) eos) "Ελληνική") => #f
</pre>
<dt>(<a name="proc-w_2funicode"><code class="proc-def">w/unicode</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Enclosed <var>sres</var> are interpreted in a Unicode context - character
sets with both an ASCII and Unicode definition take the latter. Has
no effect if the <code>regexp-unicode</code> feature is not provided. This
is the default.
<p>
<pre class="code-example">
(regexp-search '(w/unicode bos (* letter) eos) "English") => #<regexp-match>
(regexp-search '(w/unicode bos (* letter) eos) "Ελληνική") => #<regexp-match>
</pre>
<h3><a name="SRE_2dSyntax_Repeating-patterns">Repeating patterns</a></h3>
<p>
<dt>(<a name="proc-_3f"><code class="proc-def">?</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
An optional pattern - matches 1 or 0 times.
<p>
<pre class="code-example">
(regexp-search '(: "match" (? "es") "!") "matches!") => #<regexp-match>
(regexp-search '(: "match" (? "es") "!") "match!") => #<regexp-match>
(regexp-search '(: "match" (? "es") "!") "matche!") => #f
</pre>
<dt>(<a name="proc-_2a"><code class="proc-def">*</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Kleene star, matches 0 or more times.
<p>
<pre class="code-example">
(regexp-search '(: "<" (* (~ #\>)) ">") "<html>") => #<regexp-match>
(regexp-search '(: "<" (* (~ #\>)) ">") "<>") => #<regexp-match>
(regexp-search '(: "<" (* (~ #\>)) ">") "<html") => #f
</pre>
<dt>(<a name="proc-_2b"><code class="proc-def">+</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
1 or more matches. Like <code>*</code> but requires at least a single match.
<p>
<pre class="code-example">
(regexp-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<regexp-match>
(regexp-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<regexp-match>
(regexp-search '(: "<" (+ (~ #\>)) ">") "<>") => #f
</pre>
<dt>(<a name="proc-_3e_3d"><code class="proc-def">>=</code></a> <var>n</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
More generally, <var>n</var> or more matches.
<p>
<pre class="code-example">
(regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<regexp-match>
(regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<regexp-match>
(regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
</pre>
<dt>(<a name="proc-_3d"><code class="proc-def">=</code></a> <var>n</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
Exactly <var>n</var> matches.
<p>
<pre class="code-example">
(regexp-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<regexp-match>
(regexp-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f
</pre>
<dt>(<a name="proc-_2a_2a"><code class="proc-def">**</code></a> <var>from</var> <var>to</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
The most general form, from <var>n</var> to <var>m</var> matches, inclusive.
<p>
<pre class="code-example">
(regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<regexp-match>
(regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
</pre>
<h3><a name="SRE_2dSyntax_Submatch-Patterns">Submatch Patterns</a></h3>
<p>
<dt>(<a name="proc-submatch"><code class="proc-def">submatch</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_24"><code class="proc-def">$</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>
A numbered submatch. The contents matching the pattern
will be available in the resulting regexp-match.
<p>
<dt>(<a name="proc-submatch-named"><code class="proc-def">submatch-named</code></a> <var>name</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_3d_3e"><code class="proc-def">=></code></a> <var>name</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<p>