forked from kesar/HTMLawed
-
Notifications
You must be signed in to change notification settings - Fork 2
/
htmLawed_README.htm
2352 lines (2107 loc) · 224 KB
/
htmLawed_README.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Language" content="en" />
<meta name="description" content="htmLawed PHP software is a free, open-source, customizable HTML input purifier and filter - htmLawed_README.txt - presented with rTxt2htm, a PHP Labware utility" />
<meta name="keywords" content="htmLawed, HTM, HTML, HTML5, HTML 5, XHTML, XHTML5, HTML Tidy, converter, filter, formatter, purifier, sanitizer, XSS, input, PHP, software, code, script, security, cross-site scripting, hack, sanitize, remove, standards, tags, attributes, elements, Aria, Ruby, data attributes, tidy, indent, auto-indent, prettify, pretty print, htmLawed_README.txt, rTxt2htm, PHP Labware" />
<style type="text/css" media="all">
<!--/*--><![CDATA[/*><!--*/
a {text-decoration:none; color: blue;}
a:hover {color: red;}
a:visited {color: blue;}
body {margin: 0; padding: 0;}
body, div, html, p {font-family: Georgia, 'Times new roman', Times;}
code.code {font-family: 'Bitstream vera sans mono', 'Courier New', 'Courier', monospace;}
div.comment {padding: 5px; color: #999999; font-size: 80%;}
div.comment a {color: #6699cc;}
div#body {width: 70%; margin: 5px; padding: 5px;} /* holds non-toc content */
div#toc {position: fixed; top: 5px; left: 73%; z-index: 2; margin-top: 5px; margin-left: 5px; border: 1px solid gray; padding: 5px; background-color: #ededed; width: 23%; overflow: auto; max-height:94%; font-size: 90%;} /* holds content table (toc) */
div#top {font-size: 14px; margin: 5px; padding: 5px;} /* holds all content */
div.monospace {overflow: auto; font-family: 'Bitstream vera sans mono', 'Courier New', 'Courier', monospace;}
div.sub-section {padding-left: 15px;}
div.sub-sub-section {padding-left: 30px;}
h1 {font-size: 22px; margin-top: 5px; margin-bottom: 5px;}
h2 {font-size: 20px; float: left; margin-top: 15px; margin-bottom: 5px;}
h3 {font-size: 18px; float: left; margin-top: 15px; margin-bottom: 5px;}
h4 {font-size: 16px; float: left; margin-top: 15px; margin-bottom: 5px;}
hr {margin-top: 15px; margin-bottom: 5px;}
input, textarea {font-family: 'Bitstream vera sans mono', 'Courier New', 'Courier', monospace;}
p.subtle {color: gray; padding: 0; padding-top: 10px; margin: 0;}
p.subtle a, p.subtle a:visited {color: #6699cc;}
span.item-no {color: black;}
span.subtle {color: gray; margin: 0; padding:0;}
span.subtle a, span.subtle a:visited {color: #6699cc;}
span.term {font-family: 'Bitstream vera sans mono', 'Courier New', 'Courier', monospace;}
span.toc-item {color: black;}
span.totop {float: right; margin-top: 15px; margin-bottom: 5px;}
span.totop a, span.totop a:visited {color: #6699cc;}
@media screen { /* fixes for old IE */
* html, * html body {overflow-y: auto!important; height: 100%; margin: 0; padding: 0;}
* html div#body {height: 100%; overflow-y: auto; position: relative;}
* html div#toc {position: absolute;}
}
/*]]>*/-->
</style>
<title>htmLawed documentation | htmLawed PHP software is a free, open-source, customizable HTML input purifier and filter</title>
</head>
<body>
<div id="top">
<h1><a id="peak" name="peak"></a>htmLawed documentation</h1>
<div id="toc"><span class="toc-item"><a href="#s1"><span class="item-no">1</span>  About htmLawed</a></span><br />
  <span class="toc-item"><a href="#s1.1"><span class="item-no">1.1</span>  Example uses</a></span><br />
  <span class="toc-item"><a href="#s1.2"><span class="item-no">1.2</span>  Features</a></span><br />
  <span class="toc-item"><a href="#s1.3"><span class="item-no">1.3</span>  History</a></span><br />
  <span class="toc-item"><a href="#s1.4"><span class="item-no">1.4</span>  License & copyright</a></span><br />
  <span class="toc-item"><a href="#s1.5"><span class="item-no">1.5</span>  Terms used here</a></span><br />
  <span class="toc-item"><a href="#s1.6"><span class="item-no">1.6</span>  Availability</a></span><br />
<span class="toc-item"><a href="#s2"><span class="item-no">2</span>  Usage</a></span><br />
  <span class="toc-item"><a href="#s2.1"><span class="item-no">2.1</span>  Simple</a></span><br />
  <span class="toc-item"><a href="#s2.2"><span class="item-no">2.2</span>  Configuring htmLawed using the <span class="term">$config</span> argument</a></span><br />
  <span class="toc-item"><a href="#s2.3"><span class="item-no">2.3</span>  Extra HTML specifications using the <span class="term">$spec</span> argument</a></span><br />
  <span class="toc-item"><a href="#s2.4"><span class="item-no">2.4</span>  Performance time & memory usage</a></span><br />
  <span class="toc-item"><a href="#s2.5"><span class="item-no">2.5</span>  Some security risks to keep in mind</a></span><br />
  <span class="toc-item"><a href="#s2.6"><span class="item-no">2.6</span>  Use with <span class="term">kses()</span> code</a></span><br />
  <span class="toc-item"><a href="#s2.7"><span class="item-no">2.7</span>  Tolerance for ill-written HTML</a></span><br />
  <span class="toc-item"><a href="#s2.8"><span class="item-no">2.8</span>  Limitations & work-arounds</a></span><br />
  <span class="toc-item"><a href="#s2.9"><span class="item-no">2.9</span>  Examples of usage</a></span><br />
<span class="toc-item"><a href="#s3"><span class="item-no">3</span>  Details</a></span><br />
  <span class="toc-item"><a href="#s3.1"><span class="item-no">3.1</span>  Invalid/dangerous characters</a></span><br />
  <span class="toc-item"><a href="#s3.2"><span class="item-no">3.2</span>  Character references/entities</a></span><br />
  <span class="toc-item"><a href="#s3.3"><span class="item-no">3.3</span>  HTML elements</a></span><br />
    <span class="toc-item"><a href="#s3.3.1"><span class="item-no">3.3.1</span>  HTML comments & <span class="term">CDATA</span> sections</a></span><br />
    <span class="toc-item"><a href="#s3.3.2"><span class="item-no">3.3.2</span>  Tag-transformation for better compliance with standards</a></span><br />
    <span class="toc-item"><a href="#s3.3.3"><span class="item-no">3.3.3</span>  Tag balancing & proper nesting</a></span><br />
    <span class="toc-item"><a href="#s3.3.4"><span class="item-no">3.3.4</span>  Elements requiring child elements</a></span><br />
    <span class="toc-item"><a href="#s3.3.5"><span class="item-no">3.3.5</span>  Beautify or compact HTML</a></span><br />
    <span class="toc-item"><a href="#s3.3.6"><span class="item-no">3.3.6</span>  Custom elements</a></span><br />
  <span class="toc-item"><a href="#s3.4"><span class="item-no">3.4</span>  Attributes</a></span><br />
    <span class="toc-item"><a href="#s3.4.1"><span class="item-no">3.4.1</span>  Auto-addition of XHTML-required attributes</a></span><br />
    <span class="toc-item"><a href="#s3.4.2"><span class="item-no">3.4.2</span>  Duplicate/invalid <span class="term">id</span> values</a></span><br />
    <span class="toc-item"><a href="#s3.4.3"><span class="item-no">3.4.3</span>  URL schemes & scripts in attribute values</a></span><br />
    <span class="toc-item"><a href="#s3.4.4"><span class="item-no">3.4.4</span>  Absolute & relative URLs</a></span><br />
    <span class="toc-item"><a href="#s3.4.5"><span class="item-no">3.4.5</span>  Lower-cased, standard attribute values</a></span><br />
    <span class="toc-item"><a href="#s3.4.6"><span class="item-no">3.4.6</span>  Transformation of deprecated attributes</a></span><br />
    <span class="toc-item"><a href="#s3.4.7"><span class="item-no">3.4.7</span>  Anti-spam & <span class="term">href</span></a></span><br />
    <span class="toc-item"><a href="#s3.4.8"><span class="item-no">3.4.8</span>  Inline style properties</a></span><br />
    <span class="toc-item"><a href="#s3.4.9"><span class="item-no">3.4.9</span>  Hook function for tag content</a></span><br />
  <span class="toc-item"><a href="#s3.5"><span class="item-no">3.5</span>  Simple configuration directive for most valid XHTML</a></span><br />
  <span class="toc-item"><a href="#s3.6"><span class="item-no">3.6</span>  Simple configuration directive for most <em>safe</em> HTML</a></span><br />
  <span class="toc-item"><a href="#s3.7"><span class="item-no">3.7</span>  Using a hook function</a></span><br />
  <span class="toc-item"><a href="#s3.8"><span class="item-no">3.8</span>  Obtaining <em>finalized</em> parameter values</a></span><br />
  <span class="toc-item"><a href="#s3.9"><span class="item-no">3.9</span>  Retaining non-HTML tags in input with mixed markup</a></span><br />
<span class="toc-item"><a href="#s4"><span class="item-no">4</span>  Other</a></span><br />
  <span class="toc-item"><a href="#s4.1"><span class="item-no">4.1</span>  Support</a></span><br />
  <span class="toc-item"><a href="#s4.2"><span class="item-no">4.2</span>  Known issues</a></span><br />
  <span class="toc-item"><a href="#s4.3"><span class="item-no">4.3</span>  Change-log</a></span><br />
  <span class="toc-item"><a href="#s4.4"><span class="item-no">4.4</span>  Testing</a></span><br />
  <span class="toc-item"><a href="#s4.5"><span class="item-no">4.5</span>  Upgrade, & old versions</a></span><br />
  <span class="toc-item"><a href="#s4.6"><span class="item-no">4.6</span>  Comparison with <span class="term">HTMLPurifier</span></a></span><br />
  <span class="toc-item"><a href="#s4.7"><span class="item-no">4.7</span>  Use through application plug-ins/modules</a></span><br />
  <span class="toc-item"><a href="#s4.8"><span class="item-no">4.8</span>  Use in non-PHP applications</a></span><br />
  <span class="toc-item"><a href="#s4.9"><span class="item-no">4.9</span>  Donate</a></span><br />
  <span class="toc-item"><a href="#s4.10"><span class="item-no">4.10</span>  Acknowledgements</a></span><br />
<span class="toc-item"><a href="#s5"><span class="item-no">5</span>  Appendices</a></span><br />
  <span class="toc-item"><a href="#s5.1"><span class="item-no">5.1</span>  Characters discouraged in HTML</a></span><br />
  <span class="toc-item"><a href="#s5.2"><span class="item-no">5.2</span>  Valid attribute-element combinations</a></span><br />
  <span class="toc-item"><a href="#s5.3"><span class="item-no">5.3</span>  CSS 2.1 properties accepting URLs</a></span><br />
  <span class="toc-item"><a href="#s5.4"><span class="item-no">5.4</span>  Microsoft Windows 1252 character replacements</a></span><br />
  <span class="toc-item"><a href="#s5.5"><span class="item-no">5.5</span>  URL format</a></span><br />
  <span class="toc-item"><a href="#s5.6"><span class="item-no">5.6</span>  Brief on htmLawed code</a></span></div><!-- ended div toc -->
<div id="body">
<br />
<div class="comment">htmLawed_README.txt, 4 August 2023<br />
htmLawed 1.2.15<br />
Copyright Santosh Patnaik<br />
Dual licensed with LGPL 3 and GPL 2+<br />
A PHP Labware internal utility - <a href="https://bioinformatics.org/phplabware/internal_utilities/htmLawed">https://bioinformatics.org/phplabware/internal_utilities/htmLawed</a> </div>
<br />
<div class="section"><h2>
<a name="s1" id="s1"></a><span class="item-no">1</span>  About htmLawed
</h2><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed is a PHP script to process text with HTML markup to make it more compliant with HTML standards and with administrative policies. It works by making HTML well-formed with balanced and properly nested tags, neutralizing code that introduces a security vulnerability or is used for cross-site scripting (XSS) attacks, allowing only specified HTML tags and attributes, and so on. Such <em>lawing in</em> of HTML code ensures that it is in accordance with the aesthetics, safety and usability requirements set by administrators.<br />
<br />
  htmLawed is highly customizable, and fast with low memory usage. Its free and open-source code is in one small file. It does not require extensions or libraries, and works in older versions of PHP as well. It is a good alternative to the HTML <a href="http://tidy.sourceforge.net">Tidy</a> application.<br />
<div class="sub-section"><h3>
<a name="s1.1" id="s1.1"></a><span class="item-no">1.1</span>  Example uses
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  *  Filtering of text submitted as comments on blogs to allow only certain HTML elements<br />
<br />
  *  Making RSS newsfeed items standard-compliant: often one uses an excerpt from an HTML document for the content, and with unbalanced tags, non-numerical entities, etc., such excerpts may not be XML-compliant<br />
<br />
  *  Beautifying or pretty-printing HTML code<br />
<br />
  *  Text processing for stricter XML standard-compliance: e.g., to have lowercased <span class="term">x</span> in hexadecimal numeric entities becomes necessary if an HTML document with MathML content needs to be served as <span class="term">application/xml</span><br />
<br />
  *  Scraping text from web-pages<br />
<br />
  *  Transforming an HTML element to another<br />
</div>
<div class="sub-section"><h3>
<a name="s1.2" id="s1.2"></a><span class="item-no">1.2</span>  Features
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  Key: <span class="term">*</span> security feature, <span class="term">^</span> standard compliance, <span class="term">~</span> requires setting right options<br />
<br />
  htmLawed:<br />
<br />
  *  makes input more <strong>secure</strong> and <strong>standard-compliant</strong> for HTML as well as generic <strong>XML</strong> documents  ^<br />
  *  supports markup for <strong>HTML 5</strong>, <strong>custom elements</strong>, and <strong>microdata, ARIA, Ruby, custom attributes</strong>, etc.  ^<br />
  *  can <strong>beautify</strong> or <strong>compact</strong> HTML  ~<br />
  *  works with input of almost any <strong>character encoding</strong> and does not affect it<br />
  *  has good <strong>tolerance for ill-written HTML</strong><br />
<br />
  *  can enforce <strong>restricted use of elements</strong>  *~<br />
  *  ensures proper closure of empty elements like <span class="term">img</span>  ^<br />
  *  <strong>transforms deprecated elements</strong> like <span class="term">font</span>  ^~<br />
  *  can permit HTML <strong>comments</strong> and <strong>CDATA</strong> sections  ^~<br />
  *  can permit all elements, including <span class="term">script</span>, <span class="term">object</span> and <span class="term">form</span>  ~<br />
<br />
  *  can <strong>restrict attributes by element</strong>  ^~<br />
  *  removes <strong>invalid attributes</strong>  ^<br />
  *  lower-cases element and attribute names  ^<br />
  *  provides <strong>required attributes</strong>, like <span class="term">alt</span> for <span class="term">image</span>  ^<br />
  *  <strong>transforms deprecated attributes</strong>  ^~<br />
  *  ensures attributes are <strong>declared only once</strong>  ^<br />
  *  permits <strong>custom</strong>, non-standard attributes as well as custom rules for standard attributes  ~<br />
<br />
  *  declares value for <em>empty</em> (<em>minimized</em> or <em>boolean</em>) attributes like <span class="term">checked</span>  ^<br />
  *  checks for potentially dangerous attribute values  *~<br />
  *  ensures <strong>unique</strong> <span class="term">id</span> attribute values  ^~<br />
  *  <strong>double-quotes</strong> attribute values  ^<br />
  *  lower-cases <strong>standard attribute values</strong> like <span class="term">password</span>  ^<br />
<br />
  *  can restrict <strong>URL protocol/scheme by attribute</strong>  *~<br />
  *  can disable <strong>dynamic expressions</strong> in <span class="term">style</span> values  *~<br />
<br />
  *  neutralizes invalid named <strong>character entities</strong>  ^<br />
  *  converts hexadecimal numeric entities to decimal ones, or vice versa  ^~<br />
  *  converts named entities to numeric ones for generic XML use  ^~<br />
<br />
  *  removes <strong>null</strong> characters  *<br />
  *  neutralizes potentially dangerous proprietary Netscape <strong>Javascript entities</strong>  *<br />
  *  replaces potentially dangerous <strong>soft-hyphen</strong> character in URL-accepting attribute values with spaces  *<br />
<br />
  *  removes common <strong>invalid characters</strong> not allowed in HTML or XML  ^<br />
  *  replaces <strong>characters from Microsoft applications</strong> like <span class="term">Word</span> that are discouraged in HTML or XML  ^~<br />
  *  neutralize entities for characters invalid or discouraged in HTML or XML  ^<br />
  *  appropriately neutralize <span class="term"><</span>, <span class="term">&</span>, <span class="term">"</span>, and <span class="term">></span> characters  ^*<br />
<br />
  *  understands improperly spaced tag content (e.g., spread over more than a line) and properly spaces them<br />
  *  attempts to <strong>balance tags</strong> for well-formedness  ^~<br />
  *  understands when <strong>omitable closing tags</strong> like <span class="term"></p></span> are missing  ^~<br />
  *  attempts to permit only <strong>validly nested tags</strong>  ^~<br />
  *  can <strong>either remove or neutralize bad content</strong> ^~<br />
  *  attempts to <strong>rectify common errors of plain-text misplacement</strong> (e.g., directly inside <span class="term">blockquote</span>) ^~<br />
<br />
  *  has optional <strong>anti-spam</strong> measures such as addition of <span class="term">rel="nofollow"</span> and link-disabling  ~<br />
  *  optionally makes <strong>relative URLs absolute</strong>, and vice versa  ~<br />
<br />
  *  optionally marks <span class="term">&</span> to identify the entities for <span class="term">&</span>, <span class="term"><</span> and <span class="term">></span> introduced by it  ~<br />
<br />
  *  allows deployment of powerful <strong>hook functions</strong> to <strong>inject</strong> HTML, <strong>consolidate</strong> <span class="term">style</span> attributes to <span class="term">class</span>, finely check attribute values, etc.  ~<br />
</div>
<div class="sub-section"><h3>
<a name="s1.3" id="s1.3"></a><span class="item-no">1.3</span>  History
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed was created in 2007 for use with <span class="term">LabWiki</span>, a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like <span class="term">Kses</span> and <span class="term">HTMLPurifier</span> were deemed inadequate, slow, resource-intensive, or dependent on an extension or external application like <span class="term">HTML Tidy</span>. The core logic of htmLawed, that of identifying HTML elements and attributes, was based on the <span class="term">Kses</span> (version 0.2.2) HTML filter software of Ulf Harnhammar (it can still be used with code that uses <span class="term">Kses</span>; see <a href="#s2.6">section 2.6</a>.). Support for HTML version 5 was added in May 2013 in a beta and in February 2017 in a production release.<br />
<br />
  See <a href="#s4.3">section 4.3</a> for a detailed log of changes in htmLawed over the years, and <a href="#s4.10">section 4.10</a> for acknowledgements.<br />
</div>
<div class="sub-section"><h3>
<a name="s1.4" id="s1.4"></a><span class="item-no">1.4</span>  License & copyright
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed is free and open-source software, copyrighted by Santosh Patnaik, MD, PhD, and dual-licensed with LGPL version <a href="http://www.gnu.org/licenses/lgpl-3.0.txt">3</a>, and GPL version <a href="http://www.gnu.org/licenses/gpl-2.0.txt">2</a> (or later) licenses.<br />
</div>
<div class="sub-section"><h3>
<a name="s1.5" id="s1.5"></a><span class="item-no">1.5</span>  Terms used here
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  In this document, only HTML body-level elements are considered. htmLawed does not have support for head-level elements, <span class="term">body</span>, and the frame-level elements, <span class="term">frameset</span>, <span class="term">frame</span> and <span class="term">noframes</span>, and these elements are ignored here.<br />
<br />
  *  <em>administrator</em> - or admin; person setting up the code that utilizes htmLawed; also, <em>user</em><br />
  *  <em>attributes</em> - name-value pairs like <span class="term">href="http://x.com"</span> in opening tags<br />
  *  <em>author</em> - see <em>writer</em><br />
  *  <em>character</em> - atomic unit of text; internally represented by a numeric <em>code-point</em> as specified by the <em>encoding</em> or <em>charset</em> in use<br />
  *  <em>entity</em> - markup like <span class="term">&gt;</span> and <span class="term">&#160;</span> used to refer to a character<br />
  *  <em>element</em> - HTML element like <span class="term">a</span> and <span class="term">img</span><br />
  *  <em>element content</em> -  content between the opening and closing tags of an element, like <span class="term">click</span> of the <span class="term"><a href="x">click</a></span> element<br />
  *  <em>HTML</em> - implies XHTML unless specified otherwise<br />
  *  <em>HTML body</em> - content in the <em>body</em> container of an HTML document<br />
  *  <em>input</em> - text given to htmLawed to process<br />
  *  <em>legal</em> – standard-compliant; also, <em>valid</em><br />
  *  <em>processing</em> - involves filtering, correction, etc., of input<br />
  *  <em>safe</em> - absence or reduction of certain characters and HTML elements and attributes in HTML of text that can otherwise potentially, and circumstantially, expose text readers to security vulnerabilities like cross-site scripting attacks (XSS)<br />
  *  <em>scheme</em> - a URL protocol like <span class="term">http</span> and <span class="term">ftp</span><br />
  *  <em>specification</em> - detailed description including rules that define HTML<br />
  *  <em>standard</em> – widely accepted specification<br />
  *  <em>style property</em> - terms like <span class="term">border</span> and <span class="term">height</span> for which declarations are made in values for the <span class="term">style</span> attribute of elements<br />
  *  <em>tag</em> - markers like <span class="term"><a href="x"></span> and <span class="term"></a></span> delineating element content; the opening tag can contain attributes<br />
  *  <em>tag content</em> - consists of tag markers <span class="term"><</span> and <span class="term">></span>, element names like <span class="term">div</span>, and possibly attributes<br />
  *  <em>user</em> - administrator<br />
  *  <em>valid</em> - see <em>legal</em><br />
  *  <em>writer</em> - end-user like a blog commenter providing the input that is to be processed; also, <em>author</em><br />
  *  <em>XHTML</em> - XML-compliant HTML; parsing rules for XHTML are more strict than for regular HTML<br />
</div>
<div class="sub-section"><h3>
<a name="s1.6" id="s1.6"></a><span class="item-no">1.6</span>  Availability
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed can be downloaded for free at its <a href="https://bioinformatics.org/phplabware/internal_utilities/htmLawed">website</a>. Besides the <span class="term">htmLawed.php</span> file, the download has the htmLawed documentation (this document) in plain <a href="htmLawed_README.txt">text</a> and <a href="htmLawed_README.htm">HTML</a> formats, a script for <a href="htmLawedTest.php">testing</a>, and a text file for <a href="htmLawed_TESTCASE.txt">test-cases</a>. htmLawed can be installed with Composer, and is also available as a PHP class (OOP code) – see the <a href="https://bioinformatics.org/phplabware/internal_utilities/htmLawed">website</a>. Official htmLawed releases are also put up on <a href="https://sourceforge.net/projects/htmlawed/">Sourceforge</a>.<br />
</div>
</div>
<div class="section"><h2>
<a name="s2" id="s2"></a><span class="item-no">2</span>  Usage
</h2><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed works in PHP version 4.4 or higher. Either <span class="term">include()</span> the <span class="term">htmLawed.php</span> file, or copy-paste the entire code.<br />
<br />
  To use with PHP 4.3, have the following code included:<br />
<br />
<code class="code">    if(!function_exists('ctype_digit')){</code>
<br />
<code class="code">     function ctype_digit($var){</code>
<br />
<code class="code">      return ((int) $var == $var);</code>
<br />
<code class="code">     }</code>
<br />
<code class="code">    }</code>
<br />
<div class="sub-section"><h3>
<a name="s2.1" id="s2.1"></a><span class="item-no">2.1</span>  Simple
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  The input text to be processed, <span class="term">$text</span>, is passed as an argument of type string; <span class="term">htmLawed()</span> returns the processed string:<br />
<br />
<code class="code">    $processed = htmLawed($text);</code>
<br />
<br />
  With the <span class="term">htmLawed class</span> (<a href="#s1.6">section 1.6</a>), usage is:<br />
<br />
<code class="code">    $processed = htmLawed::hl($text);</code>
<br />
<br />
  <strong>Notes</strong>: (1) If input is from a <span class="term">$_GET</span> or <span class="term">$_POST</span> value, and <span class="term">magic quotes</span> are enabled on the PHP setup, run <span class="term">stripslashes()</span> on the input before passing to htmLawed. (2) htmLawed does not have support for head-level elements, <span class="term">body</span>, and the frame-level elements, <span class="term">frameset</span>, <span class="term">frame</span> and <span class="term">noframes</span>.<br />
<br />
  By default, htmLawed will process the text allowing all valid HTML elements/tags and commonly used URL schemes and CSS style properties. It will allow Javascript code, <span class="term">CDATA</span> sections and HTML comments, balance tags, and ensure proper nesting of elements. Such actions can be configured using two other optional arguments -- <span class="term">$config</span> and <span class="term">$spec</span>:<br />
<br />
<code class="code">    $processed = htmLawed($text, $config, $spec);</code>
<br />
<br />
  The <span class="term">$config</span> and <span class="term">$spec</span> arguments are detailed below. Some examples are shown in <a href="#s2.9">section 2.9</a>. For maximum protection against <span class="term">XSS</span> and other security vulnerabilities, consider using the <span class="term">safe</span> parameter; see <a href="#s3.6">section 3.6</a>.<br />
</div>
<div class="sub-section"><h3>
<a name="s2.2" id="s2.2"></a><span class="item-no">2.2</span>  Configuring htmLawed using the <span class="term">$config</span> argument
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  <span class="term">$config</span> instructs htmLawed on how to tackle certain tasks. When <span class="term">$config</span> is not specified, or not set as an array (e.g., <span class="term">$config = 1</span>), htmLawed will take default actions. One or many of the task-action or parameter-value pairs can be specified in <span class="term">$config</span> as array key-value pairs. If a parameter is not specified, htmLawed will use the default value for it, indicated further below. In PHP code, parameter values that are integers should not be quoted and should be used as numeric types (unless meant as string/text). Thus, for instance:<br />
<br />
<code class="code">    $config = array('comment'=>0, 'cdata'=>1, 'elements'=>'a, b, strong');</code>
<br />
<code class="code">    $processed = htmLawed($text, $config);</code>
<br />
<br />
  Below are the various parameters that can be specified in <span class="term">$config</span>.<br />
<br />
  Key: <span class="term">*</span> default, <span class="term">^</span> different from htmLawed versions below 1.2, <span class="term">~</span> different default when <span class="term">valid_xhtml</span> is set to <span class="term">1</span> (see <a href="#s3.5">section 3.5</a>), <span class="term">"</span> different default when <span class="term">safe</span> is set to <span class="term">1</span> (see <a href="#s3.6">section 3.6</a>)<br />
<br />
  <strong>abs_url</strong><br />
  Make URLs absolute or relative; <span class="term">$config["base_url"]</span> needs to be set; see <a href="#s3.4.4">section 3.4.4</a><br />
<br />
  <span class="term">-1</span> - make relative<br />
  <span class="term">0</span> - no action  *<br />
  <span class="term">1</span> - make absolute<br />
<br />
  <strong>and_mark</strong><br />
  Mark <span class="term">&</span> characters in the original input; see <a href="#s3.2">section 3.2</a><br />
<br />
  <strong>anti_link_spam</strong><br />
  Anti-link-spam measure; see <a href="#s3.4.7">section 3.4.7</a><br />
<br />
  <span class="term">0</span> - no measure taken  *<br />
  <em>array("regex1", "regex2")</em> - will ensure a <span class="term">rel</span> attribute with <span class="term">nofollow</span> in its value in case the <span class="term">href</span> attribute value matches the regular expression pattern <span class="term">regex1</span>, and/or will remove <span class="term">href</span> if its value matches the regular expression pattern <span class="term">regex2</span>. E.g., <span class="term">array("/./", "/://\W*(?!(abc\.com|xyz\.org))/")</span>; see <a href="#s3.4.7">section 3.4.7</a> for more.<br />
<br />
  <strong>anti_mail_spam</strong><br />
  Anti-mail-spam measure; see <a href="#s3.4.7">section 3.4.7</a><br />
<br />
  <span class="term">0</span> - no measure taken  *<br />
  <em>word</em> - <span class="term">@</span> in mail address in <span class="term">href</span> attribute value is replaced with specified <em>word</em><br />
<br />
  <strong>any_custom_element</strong><br />
  Permit any custom element; regardless of this setting, specific custom elements can be denied or permitted through <span class="term">$config["elements"]</span>; see <a href="#s3.3.6">section 3.3.6</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - yes  *<br />
<br />
  <strong>balance</strong><br />
  Balance tags for well-formedness and proper nesting; see <a href="#s3.3.3">section 3.3.3</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - yes  *<br />
<br />
  <strong>base_url</strong><br />
  Base URL value that needs to be set if <span class="term">$config["abs_url"]</span> is not <span class="term">0</span>; see <a href="#s3.4.4">section 3.4.4</a><br />
<br />
  <strong>cdata</strong><br />
  Handling of <span class="term">CDATA</span> sections; see <a href="#s3.3.1">section 3.3.1</a><br />
<br />
  <span class="term">0</span> - don't consider <span class="term">CDATA</span> sections as markup and proceed as if plain text  "<br />
  <span class="term">1</span> - remove<br />
  <span class="term">2</span> - allow, but neutralize any <span class="term"><</span>, <span class="term">></span>, and <span class="term">&</span> inside by converting them to named entities<br />
  <span class="term">3</span> - allow  *<br />
<br />
  <strong>clean_ms_char</strong><br />
  Replace <em>discouraged</em> characters introduced by Microsoft Word, etc.; see <a href="#s3.1">section 3.1</a><br />
<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> - yes<br />
  <span class="term">2</span> - yes, but replace special single & double quotes with ordinary ones<br />
<br />
  <strong>comment</strong><br />
  Handling of HTML comments; see <a href="#s3.3.1">section 3.3.1</a><br />
<br />
  <span class="term">0</span> - don't consider comments as markup and proceed as if plain text  "<br />
  <span class="term">1</span> - remove<br />
  <span class="term">2</span> - allow, but neutralize any <span class="term"><</span>, <span class="term">></span>, and <span class="term">&</span> inside by converting to named entities<br />
  <span class="term">3</span> - allow  *<br />
<br />
  <strong>css_expression</strong><br />
  Allow dynamic CSS expression by not removing the expression from CSS property values in <span class="term">style</span> attributes; see <a href="#s3.4.8">section 3.4.8</a><br />
<br />
  <span class="term">0</span> - remove  *<br />
  <span class="term">1</span> - allow<br />
<br />
  <strong>deny_attribute</strong><br />
  Denied HTML attributes; see <a href="#s3.4">section 3.4</a><br />
<br />
  <span class="term">0</span> - none  *<br />
  <em>string</em> - dictated by values in <em>string</em><br />
  <span class="term">on*</span> - on* event attributes like <span class="term">onfocus</span> not allowed  "<br />
<br />
  <strong>direct_nest_list</strong><br />
  Allow direct nesting of a list within another without requiring it to be a list item; see <a href="#s3.3.4">section 3.3.4</a><br />
<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> - yes<br />
<br />
  <strong>elements</strong><br />
  Allowed HTML elements; see <a href="#s3.3">section 3.3</a><br />
<br />
  <em>all</em> - *^<br />
  <span class="term">* -acronym -big -center -dir -font -isindex -s -strike -tt</span> -  ~^<br />
  <em>applet, audio, canvas, dialog, embed, iframe, object, script, and video elements not allowed</em> -  "^<br />
<br />
  <strong>hexdec_entity</strong><br />
  Allow hexadecimal numeric entities and do not convert to the more widely accepted decimal ones, or convert decimal to hexadecimal ones; see <a href="#s3.2">section 3.2</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - yes  *<br />
  <span class="term">2</span> - convert decimal to hexadecimal ones<br />
<br />
  <strong>hook</strong><br />
  Name of an optional hook function to alter the input string, <span class="term">$config</span> or <span class="term">$spec</span> before htmLawed enters the main phase of its work; see <a href="#s3.7">section 3.7</a><br />
<br />
  <span class="term">0</span> - no hook function  *<br />
  <em>name</em> - <em>name</em> is name of the hook function<br />
<br />
  <strong>hook_tag</strong><br />
  Name of an optional hook function to alter tag content finalized by htmLawed; see <a href="#s3.4.9">section 3.4.9</a><br />
<br />
  <span class="term">0</span> - no hook function  *<br />
  <em>name</em> - <em>name</em> is name of the hook function<br />
<br />
  <strong>keep_bad</strong><br />
  Neutralize <em>bad</em> tags by converting their <span class="term"><</span> and <span class="term">></span> characters to entities, or remove them; see <a href="#s3.3.3">section 3.3.3</a><br />
<br />
  <span class="term">0</span> - remove<br />
  <span class="term">1</span> - neutralize both tags and element content<br />
  <span class="term">2</span> - remove tags but neutralize element content<br />
  <span class="term">3</span> and <span class="term">4</span> - like <span class="term">1</span> and <span class="term">2</span> but remove if text (<span class="term">pcdata</span>) is invalid in parent element<br />
  <span class="term">5</span> and <span class="term">6</span> * -  like <span class="term">3</span> and <span class="term">4</span> but line-breaks, tabs and spaces are left<br />
<br />
  <strong>lc_std_val</strong><br />
  For XHTML compliance, predefined, standard attribute values, like <span class="term">get</span> for the <span class="term">method</span> attribute of <span class="term">form</span>, must be lowercased; see <a href="#s3.4.5">section 3.4.5</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - yes  *<br />
<br />
  <strong>make_tag_strict</strong><br />
  Transform or remove these deprecated HTML elements, even if they are allowed by the admin: acronym, applet, big, center, dir, font, isindex, s, strike, tt; see <a href="#s3.3.2">section 3.3.2</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - yes, but leave <span class="term">applet</span> and <span class="term">isindex</span> that currently cannot be transformed  *^<br />
  <span class="term">2</span> - yes, removing <span class="term">applet</span> and <span class="term">isindex</span> elements and their contents (nested elements remain)  ~^<br />
<br />
  <strong>named_entity</strong><br />
  Allow non-universal named HTML entities, or convert to numeric ones; see <a href="#s3.2">section 3.2</a><br />
<br />
  <span class="term">0</span> - convert<br />
  <span class="term">1</span> - allow  *<br />
<br />
  <strong>no_deprecated_attr</strong><br />
  Allow deprecated attributes or transform them; see <a href="#s3.4.6">section 3.4.6</a><br />
<br />
  <span class="term">0</span> - allow<br />
  <span class="term">1</span> - transform, but <span class="term">name</span> attributes for <span class="term">a</span> and <span class="term">map</span> are retained  *<br />
  <span class="term">2</span> - transform<br />
<br />
  <strong>parent</strong><br />
  Name of the parent element, possibly imagined, that will hold the input; see <a href="#s3.3">section 3.3</a><br />
<br />
  <strong>safe</strong><br />
  Magic parameter to make input the most secure against vulnerabilities like XSS without needing to specify other relevant <span class="term">$config</span> parameters; see <a href="#s3.6">section 3.6</a><br />
<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> - will auto-adjust other relevant <span class="term">$config</span> parameters (indicated by <span class="term">"</span> in this list)  ^<br />
<br />
  <strong>schemes</strong><br />
  Array of attribute-specific, comma-separated, lower-cased list of schemes (protocols) allowed in attributes accepting URLs (or <span class="term">!</span> to <em>deny</em> any URL); <span class="term">*</span> covers all unspecified attributes; see <a href="#s3.4.3">section 3.4.3</a><br />
<br />
  <span class="term">href: aim, app, feed, file, ftp, gopher, http, https, javascript, irc, mailto, news, nntp, sftp, ssh, tel, telnet, ws, wss; *:data, file, http, https, javascript, ws, wss</span>  *^<br />
  <span class="term">href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, tel, telnet, ws, wss; style: !; *:file, http, https, ws, wss</span>  "<br />
<br />
  <strong>show_setting</strong><br />
  Name of a PHP variable to assign the <em>finalized</em> <span class="term">$config</span> and <span class="term">$spec</span> values; see <a href="#s3.8">section 3.8</a><br />
<br />
  <strong>style_pass</strong><br />
  Ignore <span class="term">style</span> attribute values, letting them through without any alteration<br />
<br />
  <span class="term">0</span> - no *<br />
  <span class="term">1</span> - htmLawed will let through any <span class="term">style</span> value; see <a href="#s3.4.8">section 3.4.8</a><br />
<br />
  <strong>tidy</strong><br />
  Beautify or compact HTML code; see <a href="#s3.3.5">section 3.3.5</a><br />
<br />
  <span class="term">-1</span> - compact<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> or <em>string</em> - beautify (custom format specified by <span class="term">string</span>)<br />
<br />
  <strong>unique_ids</strong><br />
  <span class="term">id</span> attribute value checks; see <a href="#s3.4.2">section 3.4.2</a><br />
<br />
  <span class="term">0</span> - no<br />
  <span class="term">1</span> - remove duplicate and/or invalid ones  *<br />
  <em>word</em> - remove invalid ones and replace duplicate ones with new and unique ones based on the <em>word</em>; the admin-specified <em>word</em> cannot contain a space character<br />
<br />
  <strong>valid_xhtml</strong><br />
  Magic parameter to make input the most valid XHTML without needing to specify other relevant <span class="term">$config</span> parameters; see <a href="#s3.5">section 3.5</a><br />
<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> - will auto-adjust other relevant <span class="term">$config</span> parameters (indicated by <span class="term">~</span> in this list)<br />
<br />
  <strong>xml:lang</strong><br />
  Auto-add <span class="term">xml:lang</span> attribute; see <a href="#s3.4.1">section 3.4.1</a><br />
<br />
  <span class="term">0</span> - no  *<br />
  <span class="term">1</span> - add if <span class="term">lang</span> attribute is present<br />
  <span class="term">2</span> - add if <span class="term">lang</span> attribute is present, and remove <span class="term">lang</span>  ~<br />
</div>
<div class="sub-section"><h3>
<a name="s2.3" id="s2.3"></a><span class="item-no">2.3</span>  Extra HTML specifications using the $spec parameter
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  The <span class="term">$spec</span> argument of htmLawed can be used to disallow an otherwise legal attribute for an element, or to restrict the attribute's values. This can also be helpful as a security measure (e.g., in certain versions of browsers, certain values can cause buffer overflows and denial of service attacks), or in enforcing admin policies. <span class="term">$spec</span> is specified as a string of text containing one or more <em>rules</em>, with multiple rules separated from each other by a semi-colon (<span class="term">;</span>). E.g.,<br />
<br />
<code class="code">    $spec = 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt';</code>
<br />
<code class="code">    $processed = htmLawed($text, $config, $spec);</code>
<br />
<br />
  Or,<br />
<br />
<code class="code">    $processed = htmLawed($text, $config, 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt');</code>
<br />
<br />
  A rule begins with an HTML <strong>element</strong> name(s) (<em>rule-element</em>), for which the rule applies, followed by an equal-to (=) sign. A rule-element may represent multiple elements if comma (,)-separated element names are used. E.g., <span class="term">th,td,tr=</span>.<br />
<br />
  Rest of the rule consists of comma-separated HTML <strong>attribute names</strong>, which can be the wildcard references <span class="term">*</span>, <span class="term">aria*</span>, <span class="term">data*</span>, and <span class="term">on*</span> for the sets of all standard, Aria, data-*, and event (on*) attributes, respectively. A minus (-) character before an attribute means that the attribute is not permitted inside the rule-element. E.g., <span class="term">-width</span>. To deny all attributes, <span class="term">-*</span> can be used. All Aria, data-*, and event (on*) attributes can similarly be denined using <span class="term">aria*</span>, <span class="term">data*</span>, and <span class="term">on*</span>, respectively.<br />
<br />
  Following shows examples of rule excerpts with rule-element <span class="term">a</span> and the attributes that are being permitted:<br />
<br />
  *  <span class="term">a=</span> - all<br />
  *  <span class="term">a=id</span> - all<br />
  *  <span class="term">a=href, title, -id, -onclick</span> - all except <span class="term">id</span> and <span class="term">onclick</span><br />
  *  <span class="term">a=*, id, -id</span> - all except <span class="term">id</span><br />
  *  <span class="term">a=-*</span> - none<br />
  *  <span class="term">a=-*, href, title</span> - none except <span class="term">href</span> and <span class="term">title</span><br />
  *  <span class="term">a=-*, -id, href, title</span> - none except <span class="term">href</span> and <span class="term">title</span><br />
  *  <span class="term">a=-on*, -id, href, onclick, title</span> - all except <span class="term">id</span> and on* other than <span class="term">onclick</span><br />
<br />
  Rules regarding <strong>attribute values</strong> are optionally specified inside round brackets after attribute names – which cannot be wildcard references like <span class="term">*</span> or <span class="term">data*</span> – in solidus (/)-separated <em>parameter = value</em> pairs. E.g., <span class="term">title(maxlen=30/minlen=5)</span>. None or one or more of the following parameters may be specified:<br />
<br />
  *  <span class="term">oneof</span> - one or more choices separated by <span class="term">|</span> that the value should match; if only one choice is provided, then the value must match that choice; matching is case-sensitive<br />
<br />
  *  <span class="term">noneof</span> - one or more choices separated by <span class="term">|</span> that the value should not match; matching is case-sensitive<br />
<br />
  *  <span class="term">maxlen</span> and <span class="term">minlen</span> - upper and lower limits for the number of characters in the attribute value; specified in numbers<br />
<br />
  *  <span class="term">maxval</span> and <span class="term">minval</span> - upper and lower limits for the numerical value specified in the attribute value; specified in numbers<br />
<br />
  *  <span class="term">match</span> and <span class="term">nomatch</span> - pattern that the attribute value should or should not match; specified as PHP/PCRE-compatible regular expressions with delimiters and possibly modifiers (e.g., to specify case-sensitivity for matching)<br />
<br />
  *  <span class="term">default</span> - a value to force on the attribute if the value provided by the writer does not fit any of the specified parameters<br />
<br />
  If <span class="term">default</span> is not set and the attribute value does not satisfy any of the specified parameters, then the attribute is removed. The <span class="term">default</span> value can also be used to force all attribute declarations to take the same value (by getting the values declared illegal by setting, e.g., <span class="term">maxlen</span> to <span class="term">-1</span>).<br />
<br />
  Examples with <em>input</em> <span class="term"><input title="WIDTH" value="10em" /><input title="length" value="5" class="ic1 ic2" /></span> are shown below.<br />
<br />
  <em>Rule</em>: <span class="term">input=title(maxlen=60/minlen=6), value</span><br />
  <em>Output</em>: <span class="term"><input value="10em" /><input title="length" value="5" class="ic1 ic2" /></span><br />
<br />
  <em>Rule</em>: <span class="term">input=title(), value(maxval=8/default=6)</span><br />
  <em>Output</em>: <span class="term"><input title="WIDTH" value="6" /><input title="length" value="5" class="ic1 ic2" /></span><br />
<br />
  <em>Rule</em>: <span class="term">input=title(nomatch=%w.d%i), value(match=%em%/default=6em)</span><br />
  <em>Output</em>: <span class="term"><input value="10em" /><input title="length" value="6em" class="ic1 ic2" /></span><br />
<br />
  <em>Rule</em>: <span class="term">input=class(noneof=ic2|ic3/oneof=ic1|ic4), title(oneof=height|depth/default=depth), value(noneof=5|6)</span><br />
  <em>Output</em>: <span class="term"><input title="depth" value="10em" /><input title="depth" class="ic1" /></span><br />
<br />
  <strong>Special characters</strong>: The characters <span class="term">;</span>, <span class="term">,</span>, <span class="term">/</span>, <span class="term">(</span>, <span class="term">)</span>, <span class="term">|</span>, <span class="term">~</span> and space have special meanings in the rules. Words in the rules that use such characters, or the characters themselves, should be <em>escaped</em> by enclosing in pairs of double-quotes (<span class="term">"</span>). A back-tick (<span class="term">`</span>) can be used to escape a literal <span class="term">"</span>. An example rule illustrating this is <span class="term">input=value(maxlen=30/match="/^\w/"/default="your `"ID`"")</span>.<br />
<br />
  <strong>Attributes that accept multiple values</strong>: If an attribute is <span class="term">accesskey</span>, <span class="term">class</span>, <span class="term">itemtype</span> or <span class="term">rel</span>, or <span class="term">archive</span> in case of <span class="term">object</span> element, which can have multiple, space-separated values, or <span class="term">archive</span> in case of <span class="term">object</span> element and <span class="term">srcset</span>, which can have multiple, comma-separated values, htmLawed will parse the attribute value for such multiple values and will individually test each of them. The parsing is performed after any URL assessment of the attribute values (<a href="#s3.4.3">section 3.4.3</a>).<br />
<br />
  <strong>Note</strong>: To deny an attribute for all elements for which it is legal, <span class="term">$config["deny_attribute"]</span> (see <a href="#s3.4">section 3.4</a>) can be used instead of <span class="term">$spec</span>. Also, attributes can be allowed element-specifically through <span class="term">$spec</span> while being denied globally through <span class="term">$config["deny_attribute"]</span>. The <span class="term">hook_tag</span> parameter (<a href="#s3.4.9">section 3.4.9</a>) can also be possibly used to implement a functionality like that achieved using <span class="term">$spec</span> functionality.<br />
<br />
  <strong>Note</strong>: Attributes permitted through <span class="term">$spec</span> are permitted regardless of any denial through <span class="term">$config</span>. An attribute for which $spec indicates both permission and denial will be permitted. E.g., <span class="term">onclick</span> with <span class="term">$spec</span> value of <span class="term">a = *, -onclick, onclick</span>, <span class="term">a = -on*, onclick</span> or <span class="term">a = on*, -onclick</span> will be permitted inside <span class="term">a</span>.<br />
<br />
  <strong>Note</strong>: Attributes' specifications for an element may be (inadvertently) set through multiple rules. In case of conflict, the attribute specification in the first rule will get precedence.<br />
<br />
  <span class="term">$spec</span> can also be used to permit <strong>custom or non-standard attributes</strong>. Thus, the following value of <span class="term">$spec</span> will permit the custom uses of the standard <span class="term">rel</span> attribute in <span class="term">input</span> (not permitted as per standards) and of a non-standard attribute, <span class="term">vFlag</span>, in <span class="term">img</span>.<br />
<br />
<code class="code">    $spec = 'img=vFlag; input=rel'</code>
<br />
<br />
  The attribute names must begin with an alphabet and cannot have space, equal-to (=) and solidus (/) characters.<br />
</div>
<div class="sub-section"><h3>
<a name="s2.4" id="s2.4"></a><span class="item-no">2.4</span>  Performance time & memory usage
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  The time and memory consumed during text processing by htmLawed depends on its configuration, the size of the input, and the amount, nestedness and well-formedness of the HTML markup within the input. In particular, tag balancing and beautification each can increase the processing time by about a quarter.<br />
<br />
  The htmLawed <a href="htmLawedTest.php">demo</a> can be used to evaluate the performance and effects of different types of input and <span class="term">$config</span>.<br />
</div>
<div class="sub-section"><h3>
<a name="s2.5" id="s2.5"></a><span class="item-no">2.5</span>  Some security risks to keep in mind
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, one should bear in mind that the setting may let through potentially <em>dangerous</em> HTML code which is meant to steal user-data, deface a website, render a page non-functional, etc. Unless end-users, either people or software, supplying the content are completely trusted, security issues arising from the degree of HTML usage permitted through htmLawed's setting should be considered. For example, following increase security risks:<br />
<br />
  *  Allowing <span class="term">script</span>, <span class="term">applet</span>, <span class="term">embed</span>, <span class="term">iframe</span>, <span class="term">canvas</span>, <span class="term">audio</span>, <span class="term">video</span>, <span class="term">dialog</span> or <span class="term">object</span> elements, or certain of their attributes like <span class="term">allowscriptaccess</span><br />
<br />
  *  Allowing HTML comments (some Internet Explorer versions are vulnerable with, e.g., <span class="term"><!--[if gte IE 4]><script>alert("xss");</script><![endif]--></span><br />
<br />
  *  Allowing dynamic CSS expressions (some Internet Explorer versions are vulnerable)<br />
<br />
  *  Allowing the <span class="term">style</span> attribute<br />
<br />
  To remove <em>unsecure</em> HTML, code-developers using htmLawed must set <span class="term">$config</span> appropriately. E.g., <span class="term">$config["elements"] = "* -script"</span> to deny the <span class="term">script</span> element (<a href="#s3.3">section 3.3</a>), <span class="term">$config["safe"] = 1</span> to auto-configure ceratin htmLawed parameters for maximizing security (<a href="#s3.6">section 3.6</a>), etc.<br />
<br />
  Permitting the <span class="term">*style*</span> attribute brings in risks of <em>click-jacking</em>, <em>phishing</em>, web-page overlays, etc., <em>even</em> when the <span class="term">safe</span> parameter is enabled (see <a href="#s3.6">section 3.6</a>). Except for URLs and a few other things like CSS dynamic expressions, htmLawed currently does not check every CSS style property. It does provide ways for the code-developer implementing htmLawed to do such checks through htmLawed's <span class="term">$spec</span> argument, and through the <span class="term">hook_tag</span> parameter (see <a href="#s3.4.8">section 3.4.8</a> for more). Disallowing <span class="term">style</span> completely and relying on CSS classes and stylesheet files is recommended.<br />
<br />
  htmLawed does not check or correct the character <strong>encoding</strong> of the input it receives. In conjunction with permissive circumstances, such as when the character encoding is left undefined through HTTP headers or HTML <span class="term">meta</span> tags, this can allow for an exploit (like Google's <em>UTF-7/XSS</em> vulnerability of the past).<br />
<br />
  Ocassionally, though very rarely, the default settings with which htmLawed runs may change between different versions of htmLawed. Admins should keep this in mind when upgrading htmLawed. Important changes in htmLawed's default behavior in new releases of the software are noted in <a href="#s4.5">section 4.5</a> on upgrades.<br />
</div>
<div class="sub-section"><h3>
<a name="s2.6" id="s2.6"></a><span class="item-no">2.6</span>  Use with <span class="term">kses()</span> code
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  The <span class="term">Kses</span> PHP script for HTML filtering is used by many applications (like <span class="term">WordPress</span>, as in year 2012). It is possible to have such applications use htmLawed instead, since it is compatible with code that calls the <span class="term">kses()</span> function declared in the <span class="term">Kses</span> file (usually named <span class="term">kses.php</span>). E.g., application code like this will continue to work after replacing <span class="term">Kses</span> with htmLawed:<br />
<br />
<code class="code">    $comment_filtered = kses($comment_input, array('a'=>array(), 'b'=>array(), 'i'=>array()));</code>
<br />
<br />
  If the application uses a <span class="term">Kses</span> file that has the <span class="term">kses()</span> function declared, then, to have the application use htmLawed instead of <span class="term">Kses</span>, rename <span class="term">htmLawed.php</span> (to <span class="term">kses.php</span>, e.g.) and replace the <span class="term">Kses</span> file (or just replace the code in the <span class="term">Kses</span> file with the htmLawed code). If the <span class="term">kses()</span> function in the <span class="term">Kses</span> file had been renamed by the application developer (e.g., in <span class="term">WordPress</span>, it is named <span class="term">wp_kses()</span>), then appropriately rename the <span class="term">kses()</span> function in the htmLawed code. Then, add the following code (which was a part of htmLawed prior to version 1.2):<br />
<br />
<code class="code">    // kses compatibility</code>
<br />
<code class="code">    function kses($t, $h, $p=array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'gopher', 'mailto')){</code>
<br />
<code class="code">     foreach($h as $k=>$v){</code>
<br />
<code class="code">      $h[$k]['n']['*'] = 1;</code>
<br />
<code class="code">     }</code>
<br />
<code class="code">     $C['cdata'] = $C['comment'] = $C['make_tag_strict'] = $C['no_deprecated_attr'] = $C['unique_ids'] = 0;</code>
<br />
<code class="code">     $C['keep_bad'] = 1;</code>
<br />
<code class="code">     $C['elements'] = count($h) ? strtolower(implode(',', array_keys($h))) : '-*';</code>
<br />
<code class="code">     $C['hook'] = 'kses_hook';</code>
<br />
<code class="code">     $C['schemes'] = '*:'. implode(',', $p);</code>
<br />
<code class="code">     return htmLawed($t, $C, $h);</code>
<br />
<code class="code">     }</code>
<br />
<br />
<code class="code">    function kses_hook($t, &$C, &$S){</code>
<br />
<code class="code">     return $t;</code>
<br />
<code class="code">    }</code>
<br />
<br />
  If the <span class="term">Kses</span> file used by the application has been significantly altered by the application developers, then one may need a different approach. E.g., with <span class="term">WordPress</span> (as in the year 2012), it is best to copy the htmLawed code, along with the above-mentioned additions, to <span class="term">wp_includes/kses.php</span>, rename the newly added function <span class="term">kses()</span> to <span class="term">wp_kses()</span>, and delete the code for the original <span class="term">wp_kses()</span> function.<br />
<br />
  If the <span class="term">Kses</span> code has a non-empty hook function (e.g., <span class="term">wp_kses_hook()</span> in case of <span class="term">WordPress</span>), then the code for htmLawed's <span class="term">kses_hook()</span> function should be appropriately edited. However, the requirement of the hook function should be re-evaluated considering that htmLawed has extra capabilities. With <span class="term">WordPress</span>, the hook function is an essential one. The following code is suggested for the htmLawed <span class="term">kses_hook()</span> in case of <span class="term">WordPress</span>:<br />
<br />
<code class="code">    // kses compatibility</code>
<br />
<code class="code">    function kses_hook($string, &$cf, &$spec){</code>
<br />
<code class="code">     $allowed_html = $spec;</code>
<br />
<code class="code">     $allowed_protocols = array();</code>
<br />
<code class="code">     foreach($cf['schemes'] as $v){</code>
<br />
<code class="code">      foreach($v as $k2=>$v2){</code>
<br />
<code class="code">       if(!in_array($k2, $allowed_protocols)){</code>
<br />
<code class="code">        $allowed_protocols[] = $k2;</code>
<br />
<code class="code">       }</code>
<br />
<code class="code">      }</code>
<br />
<code class="code">     }</code>
<br />
<code class="code">     return wp_kses_hook($string, $allowed_html, $allowed_protocols);</code>
<br />
<code class="code">    }</code>
<br />
</div>
<div class="sub-section"><h3>
<a name="s2.7" id="s2.7"></a><span class="item-no">2.7</span>  Tolerance for ill-written HTML
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed can work with ill-written HTML code in the input. However, HTML that is too ill-written may not be <em>read</em> as HTML, and may therefore get identified as mere plain text. Following statements indicate the degree of <em>looseness</em> that htmLawed can work with, and can be provided in instructions to writers:<br />
<br />
  *  Tags must be flanked by <span class="term"><</span> and <span class="term">></span> with no <span class="term">></span> inside -- any needed <span class="term">></span> should be put in as <span class="term">&gt;</span>. It is possible for tag content (element name and attributes) to be spread over many lines instead of being on one. A space may be present between the tag content and <span class="term">></span>, like <span class="term"><div ></span> and <span class="term"><img / ></span>, but not after the <span class="term"><</span>.<br />
<br />
  *  Element and attribute names need not be lower-cased.<br />
<br />
  *  Attribute string of elements may be liberally spaced with tabs, line-breaks, etc.<br />
<br />
  *  Attribute values may be single- and not double-quoted.<br />
<br />
  *  Left-padding of numeric entities (like, <span class="term">&#0160;</span>, <span class="term">&x07ff;</span>) with <span class="term">0</span> is okay as long as the number of characters between between the <span class="term">&</span> and the <span class="term">;</span> does not exceed 8. All entities must end with <span class="term">;</span> though.<br />
<br />
  *  Named character entities must be properly cased. Thus, <span class="term">&Lt;</span> or <span class="term">&TILDE;</span> will not be recognized as entities and will be <em>neutralized</em>.<br />
<br />
  *  HTML comments should not be inside element tags (they can be between tags), and should begin with <span class="term"><!--</span> and end with <span class="term">--></span>. Characters like <span class="term"><</span>, <span class="term">></span>, and <span class="term">&</span> may be allowed inside depending on <span class="term">$config</span>, but any <span class="term">--></span> inside should be put in as <span class="term">--&gt;</span>. Any <span class="term">--</span> inside will be automatically converted to <span class="term">-</span>, and a space will be added before the <span class="term">--></span> comment-closing marker  unless <span class="term">$config["comments"]</span> is set to <span class="term">4</span> (<a href="#s3.3.1">section 3.3.1</a>).<br />
<br />
  *  <span class="term">CDATA</span> sections should not be inside element tags, and can be in element content only if plain text is allowed for that element. They should begin with <span class="term"><[CDATA[</span> and end with <span class="term">]]></span>. Characters like <span class="term"><</span>, <span class="term">></span>, and <span class="term">&</span> may be allowed inside depending on <span class="term">$config</span>, but any <span class="term">]]></span> inside should be put in as <span class="term">]]&gt;</span>.<br />
<br />
  *  For attribute values, character entities <span class="term">&lt;</span>, <span class="term">&gt;</span> and <span class="term">&amp;</span> should be used instead of characters <span class="term"><</span> and <span class="term">></span>, and <span class="term">&</span> (when <span class="term">&</span> is not part of a character entity). This applies even for Javascript code in values of attributes like <span class="term">onclick</span>.<br />
<br />
  *  Characters <span class="term"><</span>, <span class="term">></span>, <span class="term">&</span> and <span class="term">"</span> that are part of actual Javascript, etc., code in <span class="term">script</span> elements should be used as such and not be put in as entities like <span class="term">&gt;</span>. Otherwise, though the HTML will be valid, the code may fail to work. Further, if such characters have to be used, then they should be put inside <span class="term">CDATA</span> sections.<br />
<br />
  *  Simple instructions like "an opening tag cannot be present between two closing tags" and "nested elements should be closed in the reverse order of how they were opened" can help authors write balanced HTML. If tags are imbalanced, htmLawed will try to balance them, but in the process, depending on <span class="term">$config["keep_bad"]</span>, some code/text may be lost.<br />
<br />
  *  Input authors should be notified of admin-specified allowed elements, attributes, configuration values (like conversion of named entities to numeric ones), etc.<br />
<br />
  *  With <span class="term">$config["unique_ids"]</span> not <span class="term">0</span> and the <span class="term">id</span> attribute being permitted, writers should carefully avoid using duplicate or invalid <span class="term">id</span> values as even though htmLawed will correct/remove the values, the final output may not be the one desired. E.g., when <span class="term"><a id="home"></a><input id="home" /><label for="home"></label></span> is processed into<br />
<span class="term"><a id="home"></a><input id="prefix_home" /><label for="home"></label></span>.<br />
<br />
  *  Even if intended HTML is lost from an ill-written input, the processed output will be more secure and standard-compliant.<br />
<br />
  *  For URLs, unless <span class="term">$config["scheme"]</span> is appropriately set, writers should avoid using escape characters or entities in schemes. E.g., <span class="term">htt&#112;</span> (which many browsers will read as the harmless <span class="term">http</span>) may be considered bad by htmLawed.<br />
<br />
  *  htmLawed will attempt to put plain text present directly inside <span class="term">blockquote</span>, <span class="term">form</span>, <span class="term">map</span> and <span class="term">noscript</span> elements (illegal as per the specifications) inside auto-generated <span class="term">div</span> elements during tag balancing (<a href="#s3.3.3">section 3.3.3</a>).<br />
</div>
<div class="sub-section"><h3>
<a name="s2.8" id="s2.8"></a><span class="item-no">2.8</span>  Limitations & work-arounds
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  htmLawed's main objective is to make the input text <em>more</em> standard-compliant, secure for readers, and free of HTML elements and attributes considered undesirable by the administrator. Some of its current limitations, regardless of this objective, are noted below along with possible work-arounds.<br />
<br />
  It should be borne in mind that no browser application is 100% standard-compliant, standard specifications continue to evolve, and many browsers accept commonly used non-standard HTML. Regarding security, note that <em>unsafe</em> HTML code is not legally invalid per se.<br />
<br />
  *  htmLawed might not strictly adhere to <em>current</em> HTML standards as standard specification for HTML by <a href="http://www.whatwg.org">WHATWG</a> is continuously evolving, and there is laxity among HTML interpreters (browsers) regarding standards. Admins can configure htmLawed to be more strict about standard compliance.<br />
<br />
  *  In general, htmLawed processes input to generate output that is most likely to be standard-compatible in most users' browsers. Thus, for example, it does not enforce the required value of <span class="term">0</span> on <span class="term">border</span> attribute of <span class="term">img</span> (an HTML version 5 specification).<br />
<br />
  *  htmLawed is meant for input that goes into the <span class="term">body</span> of HTML documents. HTML's head-level elements are not supported, nor are the frame-specific elements <span class="term">frameset</span>, <span class="term">frame</span> and <span class="term">noframes</span>. However, content of the latter elements can be individually filtered through htmLawed.<br />
<br />
  *  It cannot handle input that has non-HTML code like <span class="term">SVG</span> and <span class="term">MathML</span>. One way around is to break the input into pieces and passing only those without non-HTML code to htmLawed. Another is described in <a href="#s3.9">section 3.9</a>. A third way may be to some how take advantage of the <span class="term">$config["and_mark"]</span> parameter (see <a href="#s3.2">section 3.2</a>).<br />
<br />
  *  By default, htmLawed won't check many attribute values for standard compliance. E.g., <span class="term">width="20m"</span> with the dimension in non-standard <span class="term">m</span> is let through. Implementing universal and strict attribute value checks can make htmLawed slow and resource-intensive. Admins should look at the <span class="term">hook_tag</span> parameter (<a href="#s3.4.9">section 3.4.9</a>) or <span class="term">$spec</span> to enforce finer checks on attribute values.<br />
<br />
  *  By default, htmLawed considers all ARIA, data-*, event, and microdata attributes as global attributes and permits them in all elements. This is not strictly standard-compliant. E.g., the <span class="term">itemtype</span> microdata attribute is permitted only in elements that also have the <span class="term">itemscope</span> attribute. Admins can configure htmLawed to be more strict about this (<a href="#s2.3">section 2.3</a>).<br />
<br />
  *  The attributes, whether deprecated (which can be transformed by htmLawed) or not, that it supports are largely those that are in the specifications. Only a few of the proprietary attributes are supported. However, <span class="term">$spec</span> can be used to allow custom attributes (<a href="#s2.3">section 2.3</a>).<br />
<br />
  *  Except for contained URLs and dynamic expressions (also optional), htmLawed does not check CSS style property values. Admins should look at using the <span class="term">hook_tag</span> parameter (<a href="#s3.4.9">section 3.4.9</a>) or <span class="term">$spec</span> for finer checks. Perhaps the best option is to disallow <span class="term">style</span> but allow <span class="term">class</span> attributes with the right <span class="term">oneof</span> or <span class="term">match</span> values for <span class="term">class</span>, and have the various class style properties in <span class="term">.css</span> CSS stylesheet files.<br />
<br />
  *  htmLawed does not parse emoticons, decode <em>BBcode</em>, or <em>wikify</em>, auto-converting text to proper HTML. Similarly, it won't convert line-breaks to <span class="term">br</span> elements. Such functions are beyond its purview. Admins should use other code to pre- or post-process the input for such purposes.<br />
<br />
  *  htmLawed cannot be used to have links force-opened in new windows (by auto-adding appropriate <span class="term">target</span> and <span class="term">onclick</span> attributes to <span class="term">a</span>). Admins should look at Javascript-based DOM-modifying solutions for this. Admins may also be able to use a custom hook function to enforce such checks (<span class="term">hook_tag</span> parameter; see <a href="#s3.4.9">section 3.4.9</a>).<br />
<br />
  *  Nesting-based checks are not possible. E.g., one cannot disallow <span class="term">p</span> elements specifically inside <span class="term">td</span> while permitting it elsewhere. Admins may be able to use a custom hook function to enforce such checks (<span class="term">hook_tag</span> parameter; see <a href="#s3.4.9">section 3.4.9</a>).<br />
<br />
  *  Except for optionally converting absolute or relative URLs to the other type, htmLawed will not alter URLs (e.g., to change the value of query strings or to convert <span class="term">http</span> to <span class="term">https</span>. Having absolute URLs may be a standard-requirement, e.g., when HTML is embedded in email messages, whereas altering URLs for other purposes is beyond htmLawed's goals. Admins may be able to use a custom hook function to enforce such checks (<span class="term">hook_tag</span> parameter; see <a href="#s3.4.9">section 3.4.9</a>).<br />
<br />
  *  Pairs of opening and closing tags that do not enclose any content (like <span class="term"><em></em></span>) are not removed. This may be against the standard specification for certain elements (e.g., <span class="term">table</span>). However, presence of such standard-incompliant code will not break the display or layout of content. Admins can also use simple regex-based code to filter out such code.<br />
<br />
  *  htmLawed does not check for certain element orderings described in the standard specifications (e.g., in a <span class="term">table</span>, <span class="term">tbody</span> is allowed before <span class="term">tfoot</span>). Admins may be able to use a custom hook function to enforce such checks (<span class="term">hook_tag</span> parameter; see <a href="#s3.4.9">section 3.4.9</a>).<br />
<br />
  *  htmLawed does not check the number of nested elements. E.g., it will allow two <span class="term">caption</span> elements in a <span class="term">table</span> element, illegal as per standard specifications. Admins may be able to use a custom hook function to enforce such checks (<span class="term">hook_tag</span> parameter; see <a href="#s3.4.9">section 3.4.9</a>).<br />
<br />
  *  There are multiple ways to interpret ill-written HTML. E.g., in <span class="term"><small><small>text</small></span>, is it that the second closing tag for <span class="term">small</span> is missing or is it that the second opening tag for <span class="term">small</span> was put in by mistake? htmLawed corrects the HTML in the string assuming the former, while the user may have intended the string for the latter. This is an issue that is impossible to address perfectly.<br />
<br />
  *  htmLawed might convert certain entities to actual characters and remove backslashes and CSS comment-markers (<span class="term">/*</span>) in <span class="term">style</span> attribute values in order to detect malicious HTML like crafted, Internet Explorer browser-specific dynamic expressions like <span class="term">&#101;xpression...</span>. If this is too harsh, admins can allow CSS expressions through htmLawed core but then use a custom function through the <span class="term">hook_tag</span> parameter (<a href="#s3.4.9">section 3.4.9</a>) to more specifically identify CSS expressions in the <span class="term">style</span> attribute values. Also, using <span class="term">$config["style_pass"]</span>, it is possible to have htmLawed pass <span class="term">style</span> attribute values without even looking at them (<a href="#s3.4.8">section 3.4.8</a>).<br />
<br />
  *  htmLawed does not correct certain possible attribute-based security vulnerabilities (e.g., <span class="term"><a href="http://x%22+style=%22background-image:xss">x</a></span>). These arise when browsers mis-identify markup in <em>escaped</em> text, defeating the very purpose of escaping text (a bad browser will read the given example as <span class="term"><a href="http://x" style="background-image:xss">x</a></span>).<br />
<br />
  *  Because of inadequate Unicode support in PHP, htmLawed does not remove the <em>high value</em> HTML-invalid characters with multi-byte code-points. Such characters however are extremely unlikely to be in the input. (see <a href="#s3.1">section 3.1</a>).<br />
<br />
  *  htmLawed does not check or correct the character encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML <span class="term">meta</span> tags, this can permit an exploit (like Google's <em>UTF-7/XSS</em> vulnerability of the past). Also, htmLawed can mangle input text if it is not well-formed in terms of character encoding. Administrators can consider using code available elsewhere to check well-formedness of input text characters to correct any defect.<br />
<br />
  *  htmLawed is expected to work with input texts in ASCII standard-compatible single-byte encodings such as national variants of ASCII (like ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (like ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (like Windows 1252), EBCDIC, Shift JIS (Japanese), GB-Roman (Chinese), and KS-Roman (Korean). It should also properly handle texts with variable-byte encodings like UTF-7 (Unicode) and UTF-8 (Unicode). However, htmLawed may mangle input texts with double-byte encodings like UTF-16 (Unicode), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), or the UTF-32 (Unicode) quadruple-byte encoding. If an input text has such an encoding, administrators can use PHP's <a href="http://php.net/manual/en/book.iconv.php">iconv</a> functions, or some other mean, to convert text to UTF-8 before passing it to htmLawed.<br />
<br />
  *  Like any script using PHP's PCRE regex functions, PHP setup-specific low PCRE limit values can cause htmLawed to at least partially fail with very long input texts.<br />
</div>
<div class="sub-section"><h3>
<a name="s2.9" id="s2.9"></a><span class="item-no">2.9</span>  Examples of usage
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  Safest, allowing only <em>safe</em> HTML markup --<br />
<br />
<code class="code">    $config = array('safe'=>1);</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Simplest, allowing all valid HTML markup including Javascript --<br />
<br />
<code class="code">    $out = htmLawed($in);</code>
<br />
<br />
  Allowing all valid HTML markup but restricting URL schemes in <span class="term">src</span> attribute values to <span class="term">http</span> and <span class="term">https</span> --<br />
<br />
<code class="code">    $config = array('schemes'=>'*:*; src:http, https');</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Allowing only <span class="term">safe</span> HTML and the elements <span class="term">a</span>, <span class="term">em</span>, and <span class="term">strong</span> --<br />
<br />
<code class="code">    $config = array('safe'=>1, 'elements'=>'a, em, strong');</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Not allowing elements <span class="term">script</span> and <span class="term">object</span> --<br />
<br />
<code class="code">    $config = array('elements'=>'* -script -object');</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Not allowing attributes <span class="term">id</span> and <span class="term">style</span> --<br />
<br />
<code class="code">    $config = array('deny_attribute'=>'id, style');</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Permitting only attributes <span class="term">title</span> and <span class="term">href</span> --<br />
<br />
<code class="code">    $config = array('deny_attribute'=>'* -title -href');</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Remove bad/disallowed tags altogether instead of converting them to entities --<br />
<br />
<code class="code">    $config = array('keep_bad'=>0);</code>
<br />
<code class="code">    $out = htmLawed($in, $config);</code>
<br />
<br />
  Allowing attribute <span class="term">title</span> only in <span class="term">a</span> and not allowing attributes <span class="term">id</span>, <span class="term">style</span>, or scriptable <em>on*</em> attributes like <span class="term">onclick</span> --<br />
<br />
<code class="code">    $config = array('deny_attribute'=>'title, id, style, on*');</code>
<br />
<code class="code">    $spec = 'a=title';</code>
<br />
<code class="code">    $out = htmLawed($in, $config, $spec);</code>
<br />
<br />
  Allowing a custom attribute, <span class="term">vFlag</span>, in <span class="term">img</span> and permitting custom use of the standard attribute, <span class="term">rel</span>, in <span class="term">input</span> --<br />
<br />
<code class="code">    $spec = 'img=vFlag; input=rel';</code>
<br />
<code class="code">    $out = htmLawed($in, $config, $spec);</code>
<br />
<br />
  Some case-studies are presented below.<br />
<br />
  <strong>1.</strong> A blog administrator wants to allow only <span class="term">a</span>, <span class="term">em</span>, <span class="term">strike</span>, <span class="term">strong</span> and <span class="term">u</span> in comments, but needs <span class="term">strike</span> and <span class="term">u</span> transformed to <span class="term">span</span> for better XHTML 1-strict compliance, and, he wants the <span class="term">a</span> links to point only to <span class="term">http</span> or <span class="term">https</span> resources:<br />
<br />
<code class="code">    $processed = htmLawed($in, array('elements'=>'a, em, strike, strong, u', 'make_tag_strict'=>1, 'safe'=>1, 'schemes'=>'*:http, https'), 'a=href');</code>
<br />
<br />
  <strong>2.</strong> An author uses a custom-made web application to load content on his website. He is the only one using that application and the content he generates has all types of HTML, including scripts. The web application uses htmLawed primarily as a tool to correct errors that creep in while writing HTML and to take care of the occasional <em>bad</em> characters in copy-paste text introduced by Microsoft Office. The web application provides a preview before submitted input is added to the content. For the previewing process, htmLawed is set up as follows:<br />
<br />
<code class="code">    $processed = htmLawed($in, array('css_expression'=>1, 'keep_bad'=>1, 'make_tag_strict'=>1, 'schemes'=>'*:*', 'valid_xhtml'=>1));</code>
<br />
<br />
  For the final submission process, <span class="term">keep_bad</span> is set to <span class="term">6</span>. A value of <span class="term">1</span> for the preview process allows the author to note and correct any HTML mistake without losing any of the typed text.<br />
<br />
  <strong>3.</strong> A data-miner is scraping information in a specific table of similar web-pages and is collating the data rows, and uses htmLawed to reduce unnecessary markup and white-spaces:<br />
<br />
<code class="code">    $processed = htmLawed($in, array('elements'=>'tr, td', 'tidy'=>-1), 'tr, td =');</code>
<br />
</div>
</div>
<div class="section"><h2>
<a name="s3" id="s3"></a><span class="item-no">3</span>  Details
</h2><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<div class="sub-section"><h3>
<a name="s3.1" id="s3.1"></a><span class="item-no">3.1</span>  Invalid/dangerous characters
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  Valid characters (more correctly, their code-points) in HTML or XML are, hexadecimally, <span class="term">9</span>, <span class="term">a</span>, <span class="term">d</span>, <span class="term">20</span> to <span class="term">d7ff</span>, and <span class="term">e000</span> to <span class="term">10ffff</span>, except <span class="term">fffe</span> and <span class="term">ffff</span> (decimally, <span class="term">9</span>, <span class="term">10</span>, <span class="term">13</span>, <span class="term">32</span> to <span class="term">55295</span>, and <span class="term">57344</span> to <span class="term">1114111</span>, except <span class="term">65534</span> and <span class="term">65535</span>). htmLawed removes the invalid characters <span class="term">0</span> to <span class="term">8</span>, <span class="term">b</span>, <span class="term">c</span>, and <span class="term">e</span> to <span class="term">1f</span>.<br />
<br />
  Because of PHP's poor native support for multi-byte characters, htmLawed cannot check for the remaining invalid code-points. However, for various reasons, it is very unlikely for any of those characters to be in the input.<br />
<br />
  Characters that are discouraged (see <a href="#s5.1">section 5.1</a>) but not invalid are not removed by htmLawed.<br />
<br />
  It (function <span class="term">hl_tag()</span>) also replaces the potentially dangerous (in some Mozilla [Firefox] and Opera browsers) soft-hyphen character (code-point, hexadecimally, <span class="term">ad</span>, or decimally, <span class="term">173</span>) in attribute values with spaces. Where required, the characters <span class="term"><</span>, <span class="term">></span>, <span class="term">&</span>, and <span class="term">"</span> are converted to entities.<br />
<br />
  With <span class="term">$config["clean_ms_char"]</span> set as <span class="term">1</span> or <span class="term">2</span>, many of the discouraged characters (decimal code-points <span class="term">127</span> to <span class="term">159</span> except <span class="term">133</span>) that many Microsoft applications incorrectly use (as per the <span class="term">Windows 1252</span> [<span class="term">Cp-1252</span>] or a similar encoding system), and the character for decimal code-point <span class="term">133</span>, are converted to appropriate decimal numerical entities (or removed for a few cases)-- see appendix in <a href="#s5.4">section 5.4</a>. This can help avoid some display issues arising from copying-pasting of content.<br />
<br />
  With <span class="term">$config["clean_ms_char"]</span> set as <span class="term">2</span>, characters for the hexadecimal code-points <span class="term">82</span>, <span class="term">91</span>, and <span class="term">92</span> (for special single-quotes), and <span class="term">84</span>, <span class="term">93</span>, and <span class="term">94</span> (for special double-quotes) are converted to ordinary single and double quotes respectively and not to entities.<br />
<br />
  The character values are replaced with entities/characters and not character values referred to by the entities/characters to keep this task independent of the character-encoding of input text.<br />
<br />
  The <span class="term">$config["clean_ms_char"]</span> parameter should not be used if authors do not copy-paste Microsoft-created text, or if the input text is not believed to use the <span class="term">Windows 1252</span> (<span class="term">Cp-1252</span>) or a similar encoding like <span class="term">Cp-1251</span> (otherwise, for example when UTF-8 encoding is in use, Japanese or Korean characters can get mangled). Further, the input form and the web-pages displaying it or its content should have the character encoding appropriately marked-up.<br />
</div>
<div class="sub-section"><h3>
<a name="s3.2" id="s3.2"></a><span class="item-no">3.2</span>  Character references/entities
</h3><span class="totop"><a href="#peak">(to top)</a></span><br style="clear: both;" />
<br />
  Valid character entities take the form <span class="term">&*;</span> where <span class="term">*</span> is <span class="term">#x</span> followed by a hexadecimal number (hexadecimal numeric entity; like <span class="term">&#xA0;</span> for non-breaking space), or alphanumeric like <span class="term">gt</span> (external or named entity; like <span class="term">&nbsp;</span> for non-breaking space), or <span class="term">#</span> followed by a number (decimal numeric entity; like <span class="term">&#160;</span> for non-breaking space). Character entities referring to the soft-hyphen character (the <span class="term">&shy;</span> or <span class="term">\xad</span> character; hexadecimal code-point <span class="term">ad</span> [decimal <span class="term">173</span>]) in URL-accepting attribute values are always replaced with spaces; soft-hyphens in attribute values introduce vulnerabilities in some older versions of the Opera and Mozilla [Firefox] browsers.<br />
<br />
  htmLawed (function <span class="term">hl_entity()</span>):<br />
<br />
  *  Neutralizes entities with multiple leading zeroes or missing semi-colons (potentially dangerous)<br />
<br />
  *  Lowercases the <span class="term">X</span> (for XML-compliance) and <span class="term">A-F</span> of hexadecimal numeric entities<br />
<br />
  *  Neutralizes entities referring to characters that are HTML-invalid (see <a href="#s3.1">section 3.1</a>)<br />
<br />
  *  Neutralizes entities referring to characters that are HTML-discouraged (code-points, hexadecimally, <span class="term">7f</span> to <span class="term">84</span>, <span class="term">86</span> to <span class="term">9f</span>, and <span class="term">fdd0</span> to <span class="term">fddf</span>, or decimally, <span class="term">127</span> to <span class="term">132</span>, <span class="term">134</span> to <span class="term">159</span>, and <span class="term">64991</span> to <span class="term">64976</span>). Entities referring to the remaining discouraged characters (see <a href="#s5.1">section 5.1</a> for a full list) are let through.<br />
<br />
  *  Neutralizes named entities that are not in the HTML5 specification<br />
<br />
  *  Optionally converts valid HTML-specific named entities except <span class="term">&gt;</span>, <span class="term">&lt;</span>, <span class="term">&quot;</span>, and <span class="term">&amp;</span> to decimal numeric ones (hexadecimal if $config["hexdec_entity"] is <span class="term">2</span>) for generic XML-compliance. For this, <span class="term">$config["named_entity"]</span> should be <span class="term">1</span>.<br />
<br />