-
-
Notifications
You must be signed in to change notification settings - Fork 186
/
README.html
1569 lines (1521 loc) · 108 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>README</title>
<style type="text/css">
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
</style>
<style type="text/css">
a.sourceLine { display: inline-block; line-height: 1.25; }
a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; }
a.sourceLine:empty { height: 1.2em; position: absolute; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
a.sourceLine { text-indent: -1em; padding-left: 1em; }
}
pre.numberSource a.sourceLine
{ position: relative; }
pre.numberSource a.sourceLine:empty
{ position: absolute; }
pre.numberSource a.sourceLine::before
{ content: attr(data-line-number);
position: absolute; left: -5em; text-align: right; vertical-align: baseline;
border: none; pointer-events: all;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
a.sourceLine::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<!-- pandoc -s -f gfm -t html README.md -o README.html -->
<h1 id="hash-collisions-and-exploitations">Hash collisions and exploitations</h1>
<p>By Ange Albertini and Marc Stevens.</p>
<h2 id="faq-tldr">FAQ (TL;DR)</h2>
<p>Q: Is it possible to make a file get an arbitrary MD2/MD4/MD5/MD6/SHA1/SHA2/SHA3, or the same hash as another file?<br/> A: No.</p>
<p>Q: Can one create 2 different files with the same hash?<br/> A: With MD5, in <a href="#fastcoll-md5">a few seconds on a standard computer</a>. With SHA1, it's <a href="#shattered-sha1">possible</a> but not practical for end-users (Complexity: 2^61.2 Price: $11k).</p>
<p>Q: Can one make 2 different files get the same hash by appending stuff?<br/> A: With MD5, in <a href="#hashclash-md5">a few hours on a standard computer</a>. With SHA1, it's <a href="#shambles-sha-1">possible</a> but not practical for end-users (Complexity: 2^63.4 Price: $45K)</p>
<p>Q: Will the 2 files remain valid?<br/> A: In general, yes, as most file formats tolerate appended data. OTOH files signatures will be likely broken.</p>
<p>Q: Can one make 2 different files with arbitrary contents and the same hash?<br/> A: Yes, it can be instant by relying on special file structures:<br/></p>
<ol>
<li>a special format header (or pair) with tricks, acting as a switch between 2 contents (some formats won't allow such tricks).<br/></li>
<li>pre-computed collisions, based on the specific header(s).<br/></li>
<li>two contents of specific formats, both presents after the collision (added after the computation).</li>
</ol>
<p>Q: Which formats can I get instant MD5-colliding files pair for?<br/> A: <a href="scripts/jpg.py">JPG</a>, <a href="scripts/png.py">PNG</a>, <a href="scripts/gif.py">GIF</a>, <a href="scripts/gz.py">GZIP</a>, <a href="scripts/pe.py">Portable Executable</a>, <a href="scripts/mp4.py">MP4</a>, <a href="scripts/jp2.py">JPEG2000</a>, <a href="scripts/pdf.py">PDF</a>, <a href="scripts/zinsider.py">DOCX/PPTX/XSLX</a>, <a href="scripts/zinsider.py">EPUB</a>, <a href="scripts/zinsider.py">3MF</a>, <a href="scripts/zinsider.py">XPS</a>. Just run the specific script.</p>
<p>Q: What about for SHA1?<br/> A: For SHA1, <a href="https://github.com/nneonneo/sha1collider">JPG in a PDF</a> is computed and implemented.</p>
<p>Q: What about formats already supported for MD5 (JPG, PNG...), but for SHA1 instead?<br/> A: They're most likely supported with SHA1 too, but their collisions hasn't been computed.</p>
<p>Q: Are computations faster for similar (but different) contents?<br/> A: No. Any tiny difference requires a full computation.</p>
<p>Q: Which formats don't have such shortcut?<br/> A: ELF, Mach-O, Java Class, TAR, ZIP (among others...)</p>
<p>Q: Are classic collisions (in a few hours) still possible with these formats?<br/> A: Yes, as long as any amount of appended data is tolerated (ie likely not ZIP or Class).</p>
<p>Q: Do you provide examples of collisions?<br/> A: <a href="examples/free/README.md">Yes</a>.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#status">Status</a></li>
<li><a href="#attacks">Attacks</a>
<ul>
<li><a href="#identical-prefix">Identical prefix</a>
<ul>
<li><a href="#fastcoll-md5">FastColl (MD5)</a></li>
<li><a href="#unicoll-md5">UniColl (MD5)</a></li>
<li><a href="#shattered-sha1">Shattered (SHA1)</a></li>
</ul></li>
<li><a href="#chosen-prefix-collisions">Chosen-prefix collisions</a>
<ul>
<li><a href="#hashclash-md5">HashClash (MD5)</a></li>
<li><a href="#shambles-sha-1">Shambles (SHA1)</a></li>
</ul></li>
<li><a href="#attacks-summary">Attacks summary</a></li>
</ul></li>
<li><a href="#exploitations">Exploitations</a>
<ul>
<li><a href="#standard-strategy">Standard strategy</a>
<ul>
<li><a href="#jpg">JPG</a>
<ul>
<li><a href="#custom-scans">custom scans</a></li>
</ul></li>
<li><a href="#png">PNG</a>
<ul>
<li><a href="#incompatibility">incompatibility</a></li>
</ul></li>
<li><a href="#gif">GIF</a></li>
<li><a href="#gzip">GZIP</a></li>
<li><a href="#lz4--zstandard">LZ4 / Zstandard</a></li>
<li><a href="#portable-executable">Portable Executable</a></li>
<li><a href="#mp4-and-others">MP4 and others</a>
<ul>
<li><a href="#jpeg2000">JPEG2000</a></li>
</ul></li>
<li><a href="#pdf">PDF</a>
<ul>
<li><a href="#jpg-in-pdf">JPG in PDF</a></li>
</ul></li>
<li><a href="#zip">ZIP</a>
<ul>
<li><a href="#zip-based-formats">Zip-based formats</a></li>
</ul></li>
<li><a href="#others">Others</a></li>
</ul></li>
<li><a href="#uncommon-strategies">Uncommon strategies</a>
<ul>
<li><a href="#multicolls-multiple-collisions-chain">MultiColls: multiple collisions chain</a>
<ul>
<li><a href="#hashquines">Hashquines</a></li>
</ul></li>
<li><a href="#validity">Validity</a></li>
<li><a href="#polycolls-collisions-of-different-file-types">PolyColls: collisions of different file types</a>
<ul>
<li><a href="#pe---jpg">PE - JPG</a></li>
<li><a href="#pdf---pe">PDF - PE</a></li>
<li><a href="#pdf---png">PDF - PNG</a></li>
</ul></li>
<li><a href="#pileups-multi-collision">PileUps (multi-collision)</a>
<ul>
<li><a href="#pe---png---mp4---pdf">PE - PNG - MP4 - PDF</a></li>
</ul></li>
</ul></li>
<li><a href="#use-cases">Use cases</a>
<ul>
<li><a href="#gotta-collide-em-all">Gotta collide 'em all!</a></li>
<li><a href="#incriminating-files">Incriminating files</a></li>
</ul></li>
<li><a href="#failures">Failures</a>
<ul>
<li><a href="#elf">ELF</a></li>
<li><a href="#mach-o">Mach-O</a></li>
<li><a href="#java-class">Java Class</a></li>
<li><a href="#tar">TAR</a></li>
</ul></li>
<li><a href="#exploitations-summary">Exploitations summary</a></li>
<li><a href="#test-files">Test files</a></li>
</ul></li>
<li><a href="#detection">Detection</a>
<ul>
<li><a href="#safe-hashes">Safe hashes</a></li>
</ul></li>
<li><a href="#references">References</a></li>
<li><a href="#credits">Credits</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>The goal is to explore extensively existing attacks - and show on the way how weak MD5 is (instant collisions of any JPG, PNG, PDF, MP4, PE...) - and also explore in detail common file formats to determine how they can be exploited with present or with future attacks.</p>
<p>Indeed, the same file format trick can be used on several hashes (the same JPG tricks were used for <a href="https://archive.org/stream/pocorgtfo14#page/n49/mode/1up">MD5</a>, <a href="https://malicioussha1.github.io/">malicious SHA-1</a> and <a href="http://shattered.io">SHA1</a>), as long as the collisions follow the same byte patterns.</p>
<p>This document is <strong>not</strong> about new attacks (the most recent one was documented in 2012), but about new forms of exploitations of existing attacks.</p>
<h1 id="status">Status</h1>
<p>Current status of known attacks:</p>
<ul>
<li><p>get a file to get another file's hash or a given hash: <strong>impossible</strong></p>
<ul>
<li>it's still even not practical with <a href="https://eprint.iacr.org/2008/089.pdf">MD2</a> or <a href="https://who.paris.inria.fr/Gaetan.Leurent/files/MD4_FSE08.pdf">MD4</a>.</li>
<li>works for simpler hashes(*) <!-- Thanks Sven! --></li>
</ul></li>
<li><p>get two different files with the same MD5: <strong>instant</strong></p>
<ul>
<li>examples: <a href="examples/single-ipc1.bin">1</a> ⟷ <a href="examples/single-ipc2.bin">2</a></li>
</ul></li>
<li><p>make two arbitrary files get the same MD5: <strong>a few hours</strong> (72 hours.core)</p>
<ul>
<li>examples: <a href="examples/single-cpc1.bin">1</a> ⟷ <a href="examples/single-cpc2.bin">2</a></li>
</ul></li>
<li><p>make two arbitrary files of specific file formats (PNG, JPG, PE...) get the same MD5: <strong>instant</strong></p>
<ul>
<li>read below</li>
</ul></li>
<li><p>get two different files with the same SHA1: 6500 years.core</p>
<ul>
<li>get two different PDFs with the same SHA-1 to show a different picture: <a href="https://github.com/nneonneo/sha1collider">instant</a> (the prefixes are already computed)</li>
</ul></li>
</ul>
<p>(*) example with <a href="https://docs.python.org/3/library/crypt.html">crypt</a> - thanks <a href="https://twitter.com/svblxyz">Sven</a>!</p>
<pre><code>>>> import crypt
>>> crypt.crypt("5dUD&66", "br")
'brokenOz4KxMc'
>>> crypt.crypt("O!>',%$", "br")
'brokenOz4KxMc'
</code></pre>
<h1 id="attacks">Attacks</h1>
<p>MD5 and SHA1 work with blocks of 64 bytes.</p>
<p>If two contents A & B have the same hash, then appending the same contents C to both will keep the same hash.</p>
<pre class="text"><code>hash(A) = hash(B) -> hash(A + C) = hash(B + C)
</code></pre>
<p>Collisions work by inserting at a block boundary a number of computed collision blocks that depends on what came before in the file. These collision blocks are very random-looking with some minor differences (that follow a specific pattern for each attack) and they will introduce tiny differences while eventually getting hashes the same value after these blocks.</p>
<p>These differences are abused to craft valid files with specific properties.</p>
<p>File formats also work top-down, and most of them work by byte-level chunks.</p>
<p>Some 'comment' chunks can be inserted to align file chunks to block boundaries, to align specific structures to collision blocks differences, to hide the rest of the collision blocks randomness from the file parsers, and to hide otherwise valid content from the parser (so that it will see another content).</p>
<p>These 'comment' chunks are often not officially real comments: they are just used as data containers that are ignored by the parser (for example, PNG chunks with a lowercase-starting ID are ancillary, not critical).</p>
<p>Most of the time, a difference in the collision blocks is used to modify the length of a comment chunk, which is typically declared just before the data of this chunk: in the gap between the smaller and the longer version of this chunk, another comment chunk is declared to jump over one file's content <code>A</code>. After this file content <code>A</code>, just append another file content <code>B</code>.</p>
<p><img src="pics/layout.png" /></p>
<p>Since file formats usually define a terminator that will make parsers stop after it, <code>A</code> will terminate parsing, which will make the appended content <code>B</code> ignored.</p>
<p>So typically at least two comments are needed - often three:</p>
<ol>
<li>alignment</li>
<li>hide collision blocks</li>
<li>hide one file content (for re-usable collisions)</li>
</ol>
<p>These common properties of file formats make it possible - they are not typically seen as weaknesses, but they can be detected or normalized out:</p>
<ul>
<li>dummy chunks - used as comments</li>
<li>more than one comment</li>
<li>huge comments (lengths: 64b for MP4, 32b for PNG -> trivial collisions. 16b for JPG, 8b for GIF -> no generic collision for GIF, limited for JPG)</li>
<li>store any data in a comment (ASCII or UTF8 could be enforced)</li>
<li>store anything after the terminator (usually used only for malicious purposes) - can be avoided by using two comments finishing at the same offsets.</li>
<li>no integrity check. CRC32 in PNG are usually ignored. However they can be all correct since the collision blocks declare chunks of different lengths - so even if the chunk's data starts differently, the chunk lengths are different</li>
<li>flat structure: <a href="https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One">ASN.1</a> defines parent structure with the length of all the enclosed substructures, which prevents these constructs: you'd need to abuse a length, but also the length of the parent.</li>
<li>put a comment before the header - this makes generic re-usable collisions possible.</li>
</ul>
<h2 id="identical-prefix">Identical prefix</h2>
<ol>
<li>Define an arbitrary prefix - its content and length don't matter.</li>
<li>The prefix is padded to the next 64-byte block.</li>
<li>Collision block(s) are computed depending on the prefix and appended. Both sides are very random. The differences are predetermined by the attack.</li>
<li>After this[these] block[s], the hash value is the same despite the file differences.</li>
<li>Any arbitrary identical suffix can be added.</li>
</ol>
<table>
<thead>
<tr class="header">
<th style="text-align: center;">Prefix</th>
<th style="text-align: center;">=</th>
<th style="text-align: center;">Prefix</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Collision <em>A</em></td>
<td style="text-align: center;">≠</td>
<td style="text-align: center;">Collision <em>B</em></td>
</tr>
<tr class="even">
<td style="text-align: center;">Suffix</td>
<td style="text-align: center;">=</td>
<td style="text-align: center;">Suffix</td>
</tr>
</tbody>
</table>
<p>Both files are almost identical (their content have only a few bits of differences)</p>
<p><strong>Exploitation</strong>:</p>
<p>Bundle two contents, then either:</p>
<ul>
<li>Data exploit: run code that checks for differences and displays one or the other (typically trivial since differences are known in advance).</li>
<li>Structure exploit: exploit file structure (typically, the length of a comment) to hide one content or show the other (depends on the file format and its parsers).</li>
</ul>
<p>Two files with this structure:</p>
<table>
<thead>
<tr class="header">
<th style="text-align: center;">Prefix</th>
<th style="text-align: center;">=</th>
<th style="text-align: center;">Prefix</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Collision <em>A</em></td>
<td style="text-align: center;">≠</td>
<td style="text-align: center;">Collision <em>B</em></td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong>A</strong></td>
<td style="text-align: center;">=</td>
<td style="text-align: center;"><del>A</del></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><del>B</del></td>
<td style="text-align: center;">=</td>
<td style="text-align: center;"><strong>B</strong></td>
</tr>
</tbody>
</table>
<p>will show either A or B.</p>
<img alt='identical prefix collisions' src=pics/identical.png width=350/>
<h3 id="fastcoll-md5"><a href="https://www.win.tue.nl/hashclash/">FastColl</a> (MD5)</h3>
<p>Final version in 2009.</p>
<ul>
<li>time: a few seconds of computation</li>
<li>space: two blocks</li>
<li>differences: no control before, no control after. FastColl difference mask:
<pre><code>.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. X. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. X. .X ..
.. .. .. .. .. .. .. .. .. .. .. X. .. .. .. ..
</code></pre></li>
<li>exploitation: hard</li>
</ul>
<p>The differences aren't near the start/end of the blocks, so it's very hard to exploit since you don't control any nearby byte. A potential solution is to brute-force the surrounding bytes - cf <a href="https://github.com/angea/pocorgtfo#0x14">PoCGTFO 14:10</a>.</p>
<p><strong>Examples</strong>:</p>
<p>With an empty prefix:</p>
<pre><code>MD5: fe6c446ee3a831ee010f33ac9c1b602c
SHA256: c5dd2ef7c74cd2e80a0fd16f1dd6955c626b59def888be734219d48da6b9dbdd
00: 37 75 C1 F1-C4 A7 5A E7-9C E0 DE 7A-5B 10 80 26 7u┴±─ºZτ£α▐z[►Ç&
10: 02 AB D9 39-C9 6C 5F 02-12 C2 7F DA-CD 0D A3 B0 ☻½┘9╔l_☻↕┬⌂┌═♪ú░
20: 8C ED FA F3-E1 A3 FD B4-EF 09 E7 FB-B1 C3 99 1D îφ·≤ßú²┤∩○τ√▒├Ö↔
30: CD 91 C8 45-E6 6E FD 3D-C7 BB 61 52-3E F4 E0 38 ═æ╚Eµn²=╟╗aR>⌠α8 \
40: 49 11 85 69-EB CC 17 9C-93 4F 40 EB-33 02 AD 20 I◄àiδ╠↨£ôO@δ3☻¡
50: A4 09 2D FB-15 FA 20 1D-D1 DB 17 CD-DD 29 59 1E ñ○-√§· ↔╤█↨═▌)Y▲ ................
60: 39 89 9E F6-79 46 9F E6-8B 85 C5 EF-DE 42 4F 46 9ë₧÷yFƒµïà┼∩▐BOF ...X............
70: C2 78 75 9D-8B 65 F4 50-EA 21 C5 59-18 62 FF 7B ┬xu¥ïe⌠PΩ!┼Y↑b { .............XX.
...........X....
................
00: 37 75 C1 F1-C4 A7 5A E7-9C E0 DE 7A-5B 10 80 26 7u┴±─ºZτ£α▐z[►Ç& ...X............
10: 02 AB D9 B9-C9 6C 5F 02-12 C2 7F DA-CD 0D A3 B0 ☻½┘╣╔l_☻↕┬⌂┌═♪ú░ .............XX.
20: 8C ED FA F3-E1 A3 FD B4-EF 09 E7 FB-B1 43 9A 1D îφ·≤ßú²┤∩○τ√▒CÜ↔ ...........X....
30: CD 91 C8 45-E6 6E FD 3D-C7 BB 61 D2-3E F4 E0 38 ═æ╚Eµn²=╟╗a╥>⌠α8
40: 49 11 85 69-EB CC 17 9C-93 4F 40 EB-33 02 AD 20 I◄àiδ╠↨£ôO@δ3☻¡ /
50: A4 09 2D 7B-15 FA 20 1D-D1 DB 17 CD-DD 29 59 1E ñ○-{§· ↔╤█↨═▌)Y▲
60: 39 89 9E F6-79 46 9F E6-8B 85 C5 EF-DE C2 4E 46 9ë₧÷yFƒµïà┼∩▐┬NF
70: C2 78 75 9D-8B 65 F4 50-EA 21 C5 D9-18 62 FF 7B ┬xu¥ïe⌠PΩ!┼┘↑b {
MD5: fe6c446ee3a831ee010f33ac9c1b602c
SHA256: e27cf3073c704d0665da42d597d4d20131013204eecb6372a5bd60aeddd5d670
</code></pre>
<p>Other examples, with an identical prefix: <a href="examples/fastcoll1.bin">1</a> ⟷ <a href="examples/fastcoll2.bin">2</a></p>
<p><strong>Variant</strong>: there is a <a href="https://marc-stevens.nl/research/md5-1block-collision/">single-block MD5 collision</a> but it takes five weeks of computation.</p>
<p>Here is a <a href="examples/fastcoll.svg">recording</a> of a FastColl computation without any prefix and <a href="examples/fastcoll-prefix.svg">another one</a> with a prefix.</p>
<h3 id="unicoll-md5"><a href="unicoll.md">UniColl</a> (MD5)</h3>
<p>Documented in <a href="https://www.cwi.nl/system/files/PhD-Thesis-Marc-Stevens-Attacks-on-Hash-Functions-and-Applications.pdf#page=199">2012</a>, implemented in <a href="https://github.com/cr-marcstevens/hashclash/blob/95c2619a8078990056beb7aaa59104021714ee3c/scripts/poc_no.sh">2017</a></p>
<p><a href="https://github.com/cr-marcstevens/hashclash#create-you-own-identical-prefix-collision">UniColl</a> lets you control a few bytes in the collision blocks, before and after the first difference, which makes it an identical-prefix collision with some controllable differences, almost like a chosen-prefix collision. This is very handy, and even better the difference can be very predictable: in the case of <code>m2+= 2^8</code> (a.k.a. <code>N=1</code> / <code>m2 9</code> in HashClash <a href="https://github.com/cr-marcstevens/hashclash/blob/master/scripts/poc_no.sh#L30">poc_no.sh</a> script), the difference is +1 on the 9th byte, which makes it very exploitable, as you can even think about the collision in your head: the 9th character of that sentence will be replaced with the next one: <code>0</code> replaced by <code>1</code>, <code>a</code> replaced by <code>b</code>..</p>
<ul>
<li>time: a few minutes (depends on the amount of byte you want to control )</li>
<li>space: two blocks</li>
<li>differences:
<pre><code>.. .. .. .. DD .. .. .. ..
.. .. .. .. +1 .. .. .. ..
</code></pre></li>
<li>exploitation: very easy - controlled bytes before and after the difference, and the difference is predictable. The only restrictions are alignment and that you 'only' control 10 bytes after the difference.</li>
</ul>
<p>Examples with <code>N=1</code> and 20 bytes of set text in the collision blocks:</p>
<pre><code>00: 55 6E 69 43-6F 6C 6C 20-31 20 70 72-65 66 69 78 UniColl 1 prefix
10: 20 32 30 62-F5 48 34 B9-3B 1C 01 9F-C8 6B E6 44 20b⌡H4╣;∟☺ƒ╚kµD
20: FE F6 31 3A-63 DB 99 3E-77 4D C7 5A-6E B0 A6 88 ■÷1:c█Ö>wM╟Zn░ªê
30: 04 05 FB 39-33 21 64 BF-0D A4 FE E2-A6 9D 83 36 ♦♣√93!d┐♪ñ■Γª¥â6 \
40: 4B 14 D7 F2-47 53 84 BA-12 2D 4F BB-83 78 6C 70 K¶╫≥GSä║↕-O╗âxlp
50: C6 EB 21 F2-F6 59 9A 85-14 73 04 DD-57 5F 40 3C ╞δ!≥÷YÜà¶s♦▌W_@< .........X......
60: E1 3F B0 DB-E8 B4 AA B0-D5 56 22 AF-B9 04 26 FC ß?░█Φ┤¬░╒V"»╣♦&ⁿ ................
70: 9F D2 0C 00-86 C8 ED DE-85 7F 03 7B-05 28 D7 0F ƒ╥♀ å╚φ▐à⌂♥{♣(╫☼ ................
................
.........X......
00: 55 6E 69 43-6F 6C 6C 20-31 21 70 72-65 66 69 78 UniColl 1!prefix ................
10: 20 32 30 62-F5 48 34 B9-3B 1C 01 9F-C8 6B E6 44 20b⌡H4╣;∟☺ƒ╚kµD ................
20: FE F6 31 3A-63 DB 99 3E-77 4D C7 5A-6E B0 A6 88 ■÷1:c█Ö>wM╟Zn░ªê ................
30: 04 05 FB 39-33 21 64 BF-0D A4 FE E2-A6 9D 83 36 ♦♣√93!d┐♪ñ■Γª¥â6
40: 4B 14 D7 F2-47 53 84 BA-12 2C 4F BB-83 78 6C 70 K¶╫≥GSä║↕,O╗âxlp /
50: C6 EB 21 F2-F6 59 9A 85-14 73 04 DD-57 5F 40 3C ╞δ!≥÷YÜà¶s♦▌W_@<
60: E1 3F B0 DB-E8 B4 AA B0-D5 56 22 AF-B9 04 26 FC ß?░█Φ┤¬░╒V"»╣♦&ⁿ
70: 9F D2 0C 00-86 C8 ED DE-85 7F 03 7B-05 28 D7 0F ƒ╥♀ å╚φ▐à⌂♥{♣(╫☼
</code></pre>
<p>UniColl has less control than a true chosen-prefix collision, but it's much faster especially since it takes only two blocks.</p>
<p>Here is a <a href="examples/unicoll.svg">recording</a> of a UniColl computation.</p>
<h3 id="shattered-sha1"><a href="http://shattered.io">Shattered</a> (SHA1)</h3>
<p>Documented in <a href="https://marc-stevens.nl/research/papers/EC13-S.pdf">2013</a>, computed in <a href="http://shattered.io">2017</a>.</p>
<ul>
<li>time: 6500 years.CPU and 110 year.GPU</li>
<li>space: two blocks</li>
<li>differences:
<pre><code>.. .. .. DD ?? ?? ?? ??
or
?? ?? ?? DD .. .. .. ..
</code></pre></li>
<li>exploitation: medium. The differences are right at the start and at the end of the collision blocks. So no control before <strong>and</strong> after a length in the prefix/in the suffix: PNG stores its length before the chunk type, so it won't work. However it will work with JP2 files when they use the JFIF form (the same as JPG), and likely MP4 and other atom/box formats if you use long lengths on 64bits (in this case, they're placed <em>after</em> the atom type).</li>
</ul>
<p>The difference between collision blocks of each side is this Xor mask:</p>
<pre><code>0C 00 00 02 C0 00 00 10 B4 00 00 1C 3C 00 00 04
BC 00 00 1A 20 00 00 10 24 00 00 1C EC 00 00 14
0C 00 00 02 C0 00 00 10 B4 00 00 1C 2C 00 00 04
BC 00 00 18 B0 00 00 10 00 00 00 0C B8 00 00 10
</code></pre>
<img alt='Shattered PoCs side by side' src=pics/shattered.png width=1000 />
<p>Examples: <a href="https://github.com/angea/pocorgtfo#0x18">PoC||GTFO 0x18</a> is using the computed SHA1 prefixes, re-using the image directly from PDFLaTeX source (see <a href="https://archive.org/stream/pocorgtfo18#page/n62/mode/1up">article 18:10</a>), but also checking the value of the prefixes via JavaScript in the HTML page (the file is polyglot, ZIP HTML and PDF).</p>
<h2 id="chosen-prefix-collisions">Chosen-prefix collisions</h2>
<p>They allow to collide any content.</p>
<table>
<thead>
<tr class="header">
<th style="text-align: center;">𝓐</th>
<th style="text-align: center;">≠</th>
<th style="text-align: center;">𝔅</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Collision <em>A</em></td>
<td style="text-align: center;">≠</td>
<td style="text-align: center;">Collision <em>B</em></td>
</tr>
</tbody>
</table>
<ol>
<li>take two arbitrary prefixes</li>
<li>pad the shortest to be as long as the longest. both are padded to the next block - minus 12 bytes</li>
</ol>
<ul>
<li>these 12 bytes of random data will be added on both sides to randomize the birthday search</li>
</ul>
<ol start="3">
<li><p>X near-collision blocks will be computed and appended.</p>
<p>The fewer blocks, the longer the computation.</p>
<p>Ex: <a href="https://www.win.tue.nl/hashclash/SingleBlock/">400 kHours for one block</a>. 72 hours.cores for nine blocks with <a href="https://github.com/cr-marcstevens/hashclash">HashClash</a>.</p></li>
</ol>
<img alt='chosen-prefix collisions' src=pics/chosen.png width=400/>
<p>Chosen-prefix collisions are almighty, but they can take a long time just for a pair of files.</p>
<h3 id="hashclash-md5"><a href="https://github.com/cr-marcstevens/hashclash">HashClash</a> (MD5)</h3>
<p>Final version in <a href="https://www.win.tue.nl/hashclash/ChosenPrefixCollisions/">2009</a>.</p>
<p>Examples: let's collide <code>yes</code> and <code>no</code>. It took three hours on 24 cores.</p>
<pre><code>'yes' prefix:
000: 79 65 73 0A-3D 62 84 11-01 75 D3 4D-EB 80 93 DE yes◙=bä◄☺u╙MδÇô▐ - Prefix, padding
010: 31 C1 D9 30-45 FB BE 1E-71 F0 0A 63-75 A8 30 AA 1┴┘0E√╛▲q≡◙cu¿0¬
020: 98 17 CA E3-A2 6B 8E 3D-44 A9 8F F2-0E 67 96 48 ÿ↨╩πókÄ=D⌐Å≥♫gûH
030: 97 25 A6 FB-00 00 00 00-49 08 09 33-F0 62 C4 E8 ù%ª√ I◘○3≡b─Φ
040: D5 F1 54 CD-CA A1 42 90-7F 9D 3D 9A-67 C4 1B 0F ╒±T═╩íBÉ⌂¥=Üg─←☼ - Collision blocks start
050: 04 9F 19 E8-92 C3 AA 19-43 31 1A DB-DA 96 01 54 ♦ƒ↓ΦÆ├¬↓C1→█┌û☺T
060: 85 B5 9A 88-D8 A5 0E FB-CD 66 9A DA-4F 20 8A AA à╡Üê╪Ñ♫√═fÜ┌O è¬
070: BA E3 9C F0-78 31 8F D1-14 5F 3E B9-0F 9F 3E 19 ║π£≡x1Å╤¶_>╣☼ƒ>↓
080: 09 9C BB A9-45 89 BA A8-03 E6 C0 31-A0 54 D6 26 ○£╗⌐Eë║¿♥µ└1áT╓&
090: 3F 80 4C 06-0F C7 D9 19-09 D3 DA 14-FD CB 39 84 ?ÇL♠☼╟┘↓○╙┌¶²╦9ä
0A0: 1F 0D 77 5F-55 AA 7A 07-4C 24 8B 13-0A 54 A2 BC ▼♪w_U¬z•L$ï‼◙Tó╝
0B0: C5 12 7D 4F-E0 5E F2 23-C5 07 61 E4-80 91 B2 13 ┼↕}Oα^≥#┼•aΣÇæ▓‼
0C0: E7 79 07 2A-CF 1B 66 39-8C F0 8E 7E-75 25 22 1D τy•*╧←f9î≡Ä~u%"↔
0D0: A7 3B 49 4A-32 A4 3A 07-61 26 64 EA-6B 83 A2 8D º;IJ2ñ:•a&dΩkâóì
0E0: BE A3 FF BE-4E 71 AE 18-E2 D0 86 4F-20 00 30 26 ╛ú ╛Nq«↑Γ╨åO 0&
0F0: 0A 71 DE 1F-40 B4 F4 8F-9C 50 5C 78-DD CD 72 89 ◙q▐▼@┤⌠Å£P\x▌═rë
100: BA D1 BF F9-96 80 E3 06-96 F3 B9 7C-77 2D EB 25 ║╤┐∙ûÇπ♠û≤╣|w-δ%
110: 1E 56 70 D7-14 1F 55 4D-EC 11 58 59-92 45 E1 33 ▲Vp╫¶▼UM∞◄XYÆEß3
120: 3E 0E A1 6E-FF D9 90 AD-F6 A0 AD 0E-C6 D6 88 12 >♫ín ┘É¡÷á¡♫╞╓ê↕
130: B8 74 F2 9E-DD 53 F7 88-19 73 85 39-AA 9B E0 8D ╕t≥₧▌S≈ê↓sà9¬¢αì
\
140: 82 BF 9C 5E-58 42 1E 3B-94 CF 5B 54-73 5F A8 4A é┐£^XB▲;ö╧[Ts_¿J
150: FD 5B 64 CF-59 D1 96 74-14 B3 0C AF-11 1C F9 47 ²[d╧Y╤ût¶│♀»◄∟∙G ................
160: C5 7A 2C F7-D5 24 F5 EB-BE 54 3E 12-B0 24 67 3F ┼z,≈╒$⌡δ╛T>↕░$g? ................
170: 01 DD 95 76-8D 0D 58 FB-50 23 70 3A-BD ED BE AC ☺▌òvì♪X√P#p:╜φ╛¼ ...............X
................
180: B8 32 DB AE-E8 DC 3A 83-7A C8 D5 0F-08 90 1D 99 ╕2█«Φ▄:âz╚╒☼◘É↔Ö
190: 2D 7D 17 34-4E A8 21 98-61 1A 65 DA-FC 9B A4 BA -}↨4N¿!ÿa→e┌ⁿ¢ñ║ ................
1A0: E1 42 2B 86-0C 94 2A F6-D6 A4 81 B5-2B 0B E9 37 ßB+å♀ö*÷╓ñü╡+♂Θ7 ................
1B0: 44 D2 E4 23-14 7C 16 B8-84 90 8B E0-A1 A7 BD 27 D╥Σ#¶|▬╕äÉïαíº╜' ..............X.
................
1C0: C7 7E E6 17-1A 93 C5 EE-59 70 91 26-4E 9D C7 7C ╟~µ↨→ô┼εYpæ&N¥╟|
1D0: 1D 3D AB F1-B4 F4 F1 D9-86 48 75 77-6E FE 98 84 ↔=½±┤⌠±┘åHuwn■ÿä ................
1E0: EF 3C 1C C7-16 5A 1F 83-60 EC 5C FE-CA 17 0C 74 ∩<∟╟▬Z▼â`∞\■╩↨♀t ................
1F0: EB 8E 9D F6-90 A3 CD 08-65 D5 5A 4C-2E C6 BE 54 δÄ¥÷Éú═◘e╒ZL.╞╛T ...............X
................
'no' prefix: ................
000: 6E 6F 0A E5-5F D0 83 01-9B 4D 55 06-61 AB 88 11 no◙σ_╨â☺¢MU♠a½ê◄ ................
010: 8A FA 4D 34-B3 75 59 46-56 97 EF 6C-4A 07 90 CC è·M4│uYFVù∩lJ•É╠ ............X...
020: FE 19 D7 CF-6F 92 03 9C-91 AA A5 DA-56 92 C1 04 ■↓╫╧oÆ♥£æ¬Ñ┌VÆ┴♦ ................
030: E6 4C 08 A3-00 00 00 00-8D B6 4E 47-FF AF 7A 3C µL◘ú ì╢NG »z<
................
040: D5 F1 54 CD-CA A1 42 90-7F 9D 3D 9A-67 C4 1B 0F ╒±T═╩íBÉ⌂¥=Üg─←☼ ................
050: 04 9F 19 E8-92 C3 AA 19-43 31 1A DB-DA 96 01 54 ♦ƒ↓ΦÆ├¬↓C1→█┌û☺T ............X...
060: 85 B5 9A 88-D8 A5 0E FB-CD 66 9A DA-4F 20 8A A9 à╡Üê╪Ñ♫√═fÜ┌O è⌐ ................
070: BA E3 9C F0-78 31 8F D1-14 5F 3E B9-0F 9F 3E 19 ║π£≡x1Å╤¶_>╣☼ƒ>↓
................
080: 09 9C BB A9-45 89 BA A8-03 E6 C0 31-A0 54 D6 26 ○£╗⌐Eë║¿♥µ└1áT╓& ................
090: 3F 80 4C 06-0F C7 D9 19-09 D3 DA 14-FD CB 39 84 ?ÇL♠☼╟┘↓○╙┌¶²╦9ä .............X..
0A0: 1F 0D 77 5F-55 AA 7A 07-4C 24 8B 13-0A 54 B2 BC ▼♪w_U¬z•L$ï‼◙T▓╝ ................
0B0: C5 12 7D 4F-E0 5E F2 23-C5 07 61 E4-80 91 B2 13 ┼↕}Oα^≥#┼•aΣÇæ▓‼
................
0C0: E7 79 07 2A-CF 1B 66 39-8C F0 8E 7E-75 25 22 1D τy•*╧←f9î≡Ä~u%"↔ ................
0D0: A7 3B 49 4A-32 A4 3A 07-61 26 64 EA-6B 83 A2 8D º;IJ2ñ:•a&dΩkâóì ...............X
0E0: BE A3 FF BE-4E 71 AE 18-E2 D0 86 4F-20 00 30 22 ╛ú ╛Nq«↑Γ╨åO 0" ................
0F0: 0A 71 DE 1F-40 B4 F4 8F-9C 50 5C 78-DD CD 72 89 ◙q▐▼@┤⌠Å£P\x▌═rë
/
100: BA D1 BF F9-96 80 E3 06-96 F3 B9 7C-77 2D EB 25 ║╤┐∙ûÇπ♠û≤╣|w-δ%
110: 1E 56 70 D7-14 1F 55 4D-EC 11 58 59-92 45 E1 33 ▲Vp╫¶▼UM∞◄XYÆEß3
120: 3E 0E A1 6E-FF D9 90 AD-F6 A0 AD 0E-CA D6 88 12 >♫ín ┘É¡÷á¡♫╩╓ê↕
130: B8 74 F2 9E-DD 53 F7 88-19 73 85 39-AA 9B E0 8D ╕t≥₧▌S≈ê↓sà9¬¢αì
140: 82 BF 9C 5E-58 42 1E 3B-94 CF 5B 54-73 5F A8 4A é┐£^XB▲;ö╧[Ts_¿J
150: FD 5B 64 CF-59 D1 96 74-14 B3 0C AF-11 1C F9 47 ²[d╧Y╤ût¶│♀»◄∟∙G
160: C5 7A 2C F7-D5 24 F5 EB-BE 54 3E 12-70 24 67 3F ┼z,≈╒$⌡δ╛T>↕p$g?
170: 01 DD 95 76-8D 0D 58 FB-50 23 70 3A-BD ED BE AC ☺▌òvì♪X√P#p:╜φ╛¼
180: B8 32 DB AE-E8 DC 3A 83-7A C8 D5 0F-08 90 1D 99 ╕2█«Φ▄:âz╚╒☼◘É↔Ö
190: 2D 7D 17 34-4E A8 21 98-61 1A 65 DA-FC 9B A4 BA -}↨4N¿!ÿa→e┌ⁿ¢ñ║
1A0: E1 42 2B 86-0C 94 2A F6-D6 A4 81 B5-2B 2B E9 37 ßB+å♀ö*÷╓ñü╡++Θ7
1B0: 44 D2 E4 23-14 7C 16 B8-84 90 8B E0-A1 A7 BD 27 D╥Σ#¶|▬╕äÉïαíº╜'
1C0: C7 7E E6 17-1A 93 C5 EE-59 70 91 26-4E 9D C7 7C ╟~µ↨→ô┼εYpæ&N¥╟|
1D0: 1D 3D AB F1-B4 F4 F1 D9-86 48 75 77-6E FE 98 84 ↔=½±┤⌠±┘åHuwn■ÿä
1E0: EF 3C 1C C7-16 5A 1F 83-60 EC 5C FE-CA 17 0C 54 ∩<∟╟▬Z▼â`∞\■╩↨♀T
1F0: EB 8E 9D F6-90 A3 CD 08-65 D5 5A 4C-2E C6 BE 54 δÄ¥÷Éú═◘e╒ZL.╞╛T
</code></pre>
<p>Here is a <a href="examples/cpc.html">log</a> of the whole operation.</p>
<h3 id="shambles-sha-1"><a href="https://sha-mbles.github.io/">Shambles</a> (SHA-1)</h3>
<p>Shambles is a very expensive chosen-prefix collision that uses 9 blocks.</p>
<p>Each block has the same xor pattern as Shattered:</p>
<pre><code>0C 00 00 02 C0 00 00 10 B4 00 00 1C 3C 00 00 04
BC 00 00 1A 20 00 00 10 24 00 00 1C EC 00 00 14
0C 00 00 02 C0 00 00 10 B4 00 00 1C 2C 00 00 04
BC 00 00 18 B0 00 00 10 00 00 00 0C B8 00 00 10
</code></pre>
<p>But even if Shattered is much easier to exploit than FastColl, the constraints of the differences in the collision blocks are irrelevant since Shambles is a Chosen Prefix Collision.</p>
<h2 id="attacks-summary">Attacks summary</h2>
<table>
<thead>
<tr class="header">
<th>Hash</th>
<th>Name</th>
<th>Date</th>
<th>Duration</th>
<th>Prefix type</th>
<th>Control near diff</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MD5</td>
<td>FastColl</td>
<td>2009</td>
<td>2s</td>
<td>Identical</td>
<td>none</td>
</tr>
<tr class="even">
<td> </td>
<td>UniColl</td>
<td>2012</td>
<td>7-40min</td>
<td>Identical</td>
<td>4-10 bytes</td>
</tr>
<tr class="odd">
<td> </td>
<td>HashClash</td>
<td>2009</td>
<td>72h</td>
<td>Chosen</td>
<td>n/a</td>
</tr>
<tr class="even">
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td></td>
</tr>
<tr class="odd">
<td>SHA1</td>
<td>Shattered</td>
<td>2013</td>
<td>6500yr</td>
<td>Identical</td>
<td>prefix & suffix</td>
</tr>
<tr class="even">
<td> </td>
<td>Shambles</td>
<td>2020</td>
<td>?</td>
<td>Chosen</td>
<td>n/a</td>
</tr>
</tbody>
</table>
<h1 id="exploitations">Exploitations</h1>
<p>Identical prefix collisions is usually seen as (very) limited, but chosen-prefix is time consuming.</p>
<p>Another approach is to craft re-usable prefixes via either identical-prefix attack such as UniColl - or chosen-prefix to overcome some limitations - but re-use that prefix pair in combinations with two payloads like a classic identical prefix attack.</p>
<p>Once the prefix pair has been computed, it makes colliding two contents instant: it's just a matter of massaging file data (according to specific file formats) so that it fits the file formats specifications and the precomputed prefix requirements.</p>
<h2 id="standard-strategy">Standard strategy</h2>
<p>Classic collisions of two valid files with the same file type.</p>
<h3 id="jpg">JPG</h3>
<img alt='a JPG file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/JPG.png width=500/>
<p>Theoretical limitations and workarounds:</p>
<ul>
<li><p>the <em>Application</em> segment should in theory right after the <em>Start of Image</em> marker. In practice, this is not necessary, so our collision can be generic: the only limitation is the size of the smallest image.</p></li>
<li><p>a comment's length is stored on two bytes, so the amount it can store is limited to it's limited to 65536 bytes (roughly the size of a 400x400 photo)</p></li>
<li><p>rather than jumping over a complete JPG file, one can split that file in its segments, and add jump trampolines between segments</p>
<img alt='comments over each image segments' src=pics/jpgcom1.png width=500/>
<p><em>comments over each image segments</em></p>
<img alt='how comments trampoline work' src=pics/jpgcom2.png width=500/>
<p><em>how comments trampoline work</em></p></li>
<li><p>while most of a JPG structure is made of segments that are all limited to 65536 bytes in size, the actual compressed data is stored in the <em>Entropy Coded Segment</em> which doesn't respect its limitations: its size is unknown in advance and grows beyond that limit. It grows with the size of the image, making most of the file size in a baseline (non progressive) image. To make the whole image fit into 64kb chunks, the easy way is to first try to save the image as progressive (which any software can do, and splits the ECS in typically up to six scans). The more advanced way is to use <em>JPEGTran</em> with its 'wizard' <code>--scans</code> command line parameter and define custom scans.</p></li>
</ul>
<p>There's no other restriction besides the scans segments, so an MD5 collision of two arbitrary JPGs is <em>instant</em>, and needs no chosen-prefix collision, just UniColl.</p>
<p>With the <a href="scripts/jpg.py">script</a>:</p>
<pre><code>21:07:35.65>jpg.py Ange.jpg Marc.jpg
21:07:35.75>
</code></pre>
<p>Examples:</p>
<p><img alt='identical prefix collisions' src=examples/collision1.jpg height=200/> ⟷ <img alt='identical prefix collisions' src=examples/collision2.jpg height=200/></p>
<h4 id="custom-scans">custom scans</h4>
<p><em>2 MD5-colliding JPGs</em></p>
<p>Here's an example of <em>JPEGTran</em> scans definition to turn <a href="pics/pocorgtfo14.png">a 1944x2508 RGB image</a> into a 100% JPG with 20 scans in which they all fit in 64kb.</p>
<pre><code>// <component>: <minbyte>-<maxbyte>, <minbit>, <maxbit>;
// 0=luma
0: 0-0, 0, 0;
0: 1-1, 0, 0;
0: 2-6, 0, 0;
0: 7-10, 0, 0;
0: 11-13, 0, 0;
0: 14-20, 0, 0;
0: 21-26, 0, 0;
0: 27-32, 0, 0;
0: 33-40, 0, 0;
0: 41-48, 0, 0;
0: 49-54, 0, 0;
0: 55-63, 0, 0;
// 1=blueness
1: 0-0, 0, 0;
1: 1-16, 0, 0;
1: 17-32, 0, 0;
1: 33-63, 0, 0;
// 2=redness
2: 0-0, 0, 0;
2: 1-16, 0, 0;
2: 17-32, 0, 0;
2: 33-63, 0, 0;
</code></pre>
<p>Result:</p>
<img alt='a 1944x2508 RGB image as a 100% JPG with 20 scans' src=examples/20scans.jpg width=300/>
<p><em>a 1944x2508 RGB image as a 100% JPG with 20 scans</em></p>
<h3 id="png">PNG</h3>
<img alt='a PNG file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/PNG.png width=500/>
<p>Theoretical limitations and workarounds:</p>
<ul>
<li>PNG uses CRC32 at the end of its chunks, but in practice they're ignored. They can be correct but it's not required.</li>
<li>the image meta data (dimensions, color space...) are stored in the <code>IHDR</code> chunk, which should in theory be right after the signature (ie, before any potential comment), so it would mean that we can only precompute collisions of images with the same meta data. However, that chunk can actually be after a comment block (in the vast majority of readers, except Apple ones), so we can put the collision data before the header, which enables to collide any pair of PNG with a single precomputation.</li>
</ul>
<p>Since a PNG chunk has a length on four bytes, there's no need to modify the structure of either file: we can jump over a whole image in one go.</p>
<p>We can insert as many discarded chunks as we want, so we can add one for alignment, then one which length will be altered by a UniColl. so the length will be <code>00</code> <code>75</code> and <code>01</code> <code>75</code>.</p>
<p>So an MD5 collision of two arbitrary PNG images is <em>instant</em>, with no prerequisite (no computation, just some minor file changes), and needs no chosen-prefix collision, just UniColl.</p>
<p>With the <a href="scripts/png.py">script</a>:</p>
<pre><code>19:27:04.79>png.py nintendo.png sega.png
19:27:04.87>
</code></pre>
<p>Examples:</p>
<p><img alt='identical prefix collisions' src=examples/collision1.png width=40% /> ⟷ <img alt='identical prefix collisions' src=examples/collision2.png width=40% /></p>
<p><em>2 MD5-colliding PNGs with different properties</em></p>
<p>Here is a <a href="examples/pngGen.svg">recording</a> of the whole operation.</p>
<p><img src="examples/pngGen.svg" alt="a recording of a universal (abusive) PNG collision" /></p>
<h4 id="incompatibility">incompatibility</h4>
<p>Most readers accept flawlessly PNG files that start with a chunk that is not <code>IHDR</code>.</p>
<p>However, some (such as Safari and Preview - any other?) don't tolerate it. In this case, the image header and its properties (dimensions, color space) must be first, before any collision blocks.</p>
<p>In this case, both colliding files must have the same properties. Again, UniColl is enough, and of course the computed prefix pair can be reused for any other pair of files with the same properties</p>
<p>Here is a <a href="scripts/pngStd.py">script</a> to collide any pair of such files that launches UniColl if needed to compute the prefix pair.</p>
<p>Examples:</p>
<p><img alt='identical prefix collisions' src=examples/0a959025-1.png width=350/> ⟷ <img alt='identical prefix collisions' src=examples/0a959025-2.png width=350/></p>
<p><img alt='identical prefix collisions' src=examples/aac2423a-1.png width=350/> ⟷ <img alt='identical prefix collisions' src=examples/aac2423a-2.png width=350/></p>
<p><em>2 pairs of MD5-colliding PNGs with identical properties for maximum compatibility</em></p>
<p>Here is a <a href="examples/pngUniColl.svg">recording</a> of the whole operation when UniColl is invoked,</p>
<p><img src="examples/pngUniColl.svg" alt="a recording of PNG UniColl collision" /></p>
<p>and <a href="examples/pngSpec.svg">another one</a> when the prefix has been already computed.</p>
<p><img src="examples/pngSpec.svg" alt="a recording of precomputed PNG collision" /></p>
<h3 id="gif">GIF</h3>
<img alt='a GIF file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/GIF.png width=500/>
<p>GIF is tricky:</p>
<ul>
<li>it stores its meta data in the header before any comment is possible, so there can't be a generic prefix for all GIF files.</li>
<li>if the file has a global palette, it is also stored before a comment is possible too.</li>
<li>its comment chunks are limited to a single byte in length, so a maximum of 256 bytes!</li>
</ul>
<p>However, the comment chunks follow a peculiar structure: it's a chain of <code><length:1></code> <code><data:length></code> until a null length is defined. So it makes any non-null byte a valid 'jump forward'. Which makes it suitable to be used with FastColl, as shown in <a href="https://github.com/angea/pocorgtfo#0x14">PoC||GTFO 14:11</a>.</p>
<p>So at least, even if we can't have a generic prefix, we can collide any pair of GIF of same metadata (dimensions, palette) and we only need a second of FastColl to compute its prefix.</p>
<p>Now the problem is that we can't jump over a whole image like PNG or over a big structure like JPG.</p>
<p>A possible workaround is to massage the compressed data or to chunk the image in tiny areas like in the case of the GIF hashquine, but this is not optimal.</p>
<p>Another idea that works generically is that the image data is also stored using this <code>length data</code> sequence structure: so if we take two GIFs with no animation, we only have to:</p>
<ul>
<li>normalize the palette</li>
<li>set the first frame duration to the maximum</li>
<li>craft a comment that will jump to the start of the first frame data, so that the comment will sled over the image data as a comment, and end the same way: until a null length is encountered. Then the parser will meet the next frame, and display it.</li>
</ul>
<p>With a minor setup (only a few hundred bytes of overhead), we can sled over any GIF image and work around the 256 bytes limitation. This idea was suggested by Marc, and it's brilliant!</p>
<p>So in the end, the current GIF limitations for <em>instant</em> MD5 collisions are:</p>
<ul>
<li>no animation</li>
<li>the images have to be normalized to the same palette - see <a href="https://www.lcdf.org/gifsicle/"><code>gifsicle --use-colormap web</code></a></li>
<li>the images have to be the same dimensions</li>
<li>after 11 minutes, both files will show the same image</li>
</ul>
<p>An easy shortcut to normalize still GIF images is to make them animation frames of the same image, then we can use a <a href="scripts/gif.py">script</a> to re-use or compute FastColl blocks to make a file pair that shows each of them.</p>
<p>Examples:</p>
<p><img alt='identical prefix collisions' src=examples/collision1.gif width=350/> ⟷ <img alt='identical prefix collisions' src=examples/collision2.gif width=350/></p>
<p><em>2 MD5-colliding GIFs - pics by <a href="https://www.kidmograph.com/">KidMoGraph</a></em></p>
<p>Here is a <a href="examples/gifFastColl.svg">recording</a> of the whole operation.</p>
<p><img src="examples/gifFastColl.svg" alt="a recording of a GIF FastColl collision" /></p>
<h3 id="gzip">GZIP</h3>
<img alt='a GZIP file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/GZip.png width=500/>
<p>GZIP specs v4.3: <a href="https://datatracker.ietf.org/doc/html/rfc1952">RFC 1952</a> (1996).</p>
<ul>
<li>a Gzip file is made of one or more 'members' (gzip streams) concatenated. They will be all decompressed and their uncompressed content appended to each other - even if the member's uncompressed content is empty.</li>
<li>these members can be separated with zeroes. Zeroes will be just skipped, except at file start. Any non-null byte will be checked for the signature <code>1F 8B</code>. If not matching the signature, the parsing will stop, which can be used to forcibly stop parsing between two payloads, but will trigger some warnings that might cause problems. Another strategy is to add one extra empty member at the end of the file, and make parsing of both payloads finish there - on the member or on its body.</li>
<li>The optional <code>filename</code> and <code>file comment</code> are null-terminated whereas the <code>Extra field</code> is size16-defined, therefore abusable. It's made of one or more subfield(s), with an ID and its own sublength, but subfields are not enforced - very few are officially defined.</li>
</ul>
<p>Therefore an empty gzip member with an extra field is a perfect parasite host.</p>
<p>If the top file is too big to fit in an extra field, then its uncompressed stream can be split in smaller files until they all fit in extra fields.</p>
<p>After the header of a member come its compressed body, its CRC32 and its uncompressed size (not enforced). Therefore an empty data body with its null CRC32 and size make a generic postwrap, which can even be shared by different member headers.</p>
<p>Various implementations rely on the uncompressed size of the last member instead of the sum of all members. So our collided files will show that they are null-sized, because these files finish with an empty member used as trampoline.</p>
<p>Here is a <a href="scripts/gz.py">script</a> to generate instant MD5 collisions of two GZip files. It's taking most of its time to decompress and recompress data if the input files are big - the collisions prefix are pre-computed. Splitting members without decompressing is not possible as the uncompressed CRC32 needs to be calculated.</p>
<p>A <code>.tar.gz</code> is just the <code>gzip</code> archive of a <code>tar</code> archive. It will work fine with gzipped tar, unlike <code>tar</code> itself.</p>
<p>Examples: <a href="examples/collision1.tar.gz">collision1.tar.gz</a> (Pacome) ⟷ <a href="examples/collision2.tar.gz">collision2.tar.gz</a> (Reg)</p>
<h3 id="lz4--zstandard">LZ4 / Zstandard</h3>
<img alt='an Zstandard file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/zstd_skip.png width=500/>
<img alt='an LZ4 file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/lz4.png width=500/>
<p>LZ4 and Zstandard are 2 different compression formats, with a similar overall structure: they're made of frames, each starting with a specific magic: <code>0xFD2FB528</code> for Zstandard frames, <code>0x184D2204</code> for Lz4 frames.</p>
<p>They also <strong>share</strong> the same 'skippable' TLV frames, starting with 4 bytes <em>magics</em> in the range <code>0x184D2A50</code> - <code>0x184D2A5F</code>, then the <em>Length</em> of the user data (4 bytes, little-endian), then the <em>User Data</em> itself. These frames are entirely optional, of any length, and repeatable. The files can start with these frames. So these frames can be chained to make a perfect generic collision prefix, across 2 formats.</p>
<p>Here is a <a href="scripts/zstd-lz4.py">script</a> to generate instant MD5 collisions of two Zstd/Lz4 files. Like Gzip, 2 different archives will be visible from the outside no matter the content: for example, a <code>.cpio.zst</code>.</p>
<p>Examples:</p>
<ul>
<li><a href="examples/free/md5-1.lz4">md5-1.lz4</a> ⟷ <a href="examples/free/md5-2.lz4">md5-2.lz4</a></li>
<li><a href="examples/free/md5-1.zstd">md5-1.zstd</a> ⟷ <a href="examples/free/md5-2.zstd">md5-2.zstd</a></li>
<li><a href="examples/free/md5-c6a611ce.zstd">md5-c6a611ce.zstd</a> ⟷ <a href="examples/free/md5-c6a611ce.lz4">md5-c6a611ce.lz4</a></li>
</ul>
<h3 id="portable-executable">Portable Executable</h3>
<img alt='a PE file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/PE.png width=600/>
<p>The Portable Executable has a peculiar structure:</p>
<ul>
<li>the old DOS header is almost useless, and points to the next structure, the PE header. The DOS headers has no other role. DOS headers can be exchanged between executables.</li>
<li>the DOS header has to be at offset 0, and has a fixed length of a full block, and the pointer is at the end of the structure, beyond UniColl's reach: so only chosen-prefix collision is useful to collide PE files this way.</li>
<li>The PE header and what follows defines the whole file.</li>
</ul>
<p>So the strategy is:</p>
<ol>
<li>the PE header can be moved down to leave room for collision blocks after the DOS header.</li>
<li>The DOS header can be exploited (via chosen-prefix collisions) to point to two different offsets, where two different PE headers will be moved.</li>
<li>The sections can be put next to each other, after the <code>DOS/Collisions/Header1/Header2</code> structure. You just need to apply a delta to the offsets of the two section tables.</li>
</ol>
<p>This means that it's possible to instantly collide any pair of PE executables. Even if they use different subsystems or architecture.</p>
<p>While executables collisions is usually trivial via any loader, this kind of exploitation here is transparent: the code is identical and loaded at the same address.</p>
<p>Examples: <a href="examples/collision1.exe">tweakPNG.exe</a> (GUI) ⟷ <a href="examples/collision2.exe">fastcoll.exe</a> (CLI)</p>
<p>Here is a <a href="scripts/pe.py">script</a> to generate instant MD5 collisions of Windows Executables.</p>
<img alt='collision of fastcoll.exe (CLI) and tweakPNG(GUI)' src=pics/pe.png width=500/>
<h3 id="mp4-and-others">MP4 and others</h3>
<p>This format's container is a sequence of <code>Length Type Value</code> chunks called Atoms. The length is a 32 bit big-endian and covers itself, the type and the value, so the minimum normal length is 8 (the type is a 4 ASCII characters string).</p>
<p>If the length is null, then the atom takes the rest of the file - such as <code>jp2c</code> atoms in JP2 files. If it's 1, then the Type is followed by a 64bit length, changing the atom to <code>Type Length Value</code>, making it compatible with other collisions like Shattered.</p>
<p>Some atoms contain other atoms: in this cases, they're called boxes. That's why this otherwise unnamed structure is called "atom/box".</p>
<p>This "atom/box" format used in MP4 is actually a derivate of Apple Quicktime, and is used by <a href="http://www.ftyps.com/">many other formats</a> (JP2, HEIF, F4V).</p>
<p>The first atom type is <em>usually</em> <code>ftyp</code>, which enables to differentiate the actual file format.</p>
<p>The format is quite permissive: just chain <code>free</code> atoms, abuse one's length with UniColl, then jump over the first payload.</p>
<p>For MP4 files, the only thing to add is to adjust the <code>stco</code> (Sample Table - Chunk Offsets) or <code>co64</code> (the 64 bit equivalent) tables, since they are absolute(!) offsets pointing to the <code>mdat</code> movie data - and they are actually enforced!</p>
<p>This gives a <a href="scripts/mp4.py">script</a> that instantly collides any arbitrary video - and as mentioned, it may work on other format than MP4.</p>
<p><img src="pics/mp4.png" alt="Nirvana - Smells like Teen Spirit / Weird Al Yankovik - Smells like Nirvana" /></p>
<p>Examples (videos by <a href="https://www.kidmograph.com/">KidMoGraph</a>):</p>
<ul>
<li><p>32b lengths (standard) <a href="examples/collision1.mp4">collision1.mp4</a> ⟷ <a href="examples/collision2.mp4">collision2.mp4</a></p>
<p><video width=300 controls> <source src="examples/collision1.mp4" type="video/mp4">🏭</video> ⟷ <video width=300 controls> <source src="examples/collision2.mp4" type="video/mp4">🛣️</video></p></li>
<li><p>64b lengths <a href="examples/collisionl1.mp4">collisionl1.mp4</a> ⟷ <a href="examples/collisionl2.mp4">collisionl2.mp4</a></p>
<p><video width=300 controls> <source src="examples/collisionl1.mp4" type="video/mp4">☀️</video> ⟷ <video width=300 controls> <source src="examples/collisionl2.mp4" type="video/mp4">🌙</video></p></li>
</ul>
<p><video><img src="pics/mp4-pocs.png" alt="how it should look (but your markdown doesn't render video tags)" /></video></p>
<p>Note that some viewers (OS X, Safari, FireFox) don't allow a file that starts with an Atom that is not <code>ftyp</code>. In this case, the prefix have to cover this, and it's not so generic, but besides it's the same strategy - only limited to a single file type.</p>
<h4 id="jpeg2000">JPEG2000</h4>
<p>JPEG2000 files usually start with the Atom/Box structure like MP4, then the last atom <code>jp2c</code> is typically until the end of the file (null length), then from this point on it follows the JFIF structure, like JPEG (starting with <code>FF 4F</code> as a segment marker).</p>
<p>The pure-JFIF form is also tolerated, in which case collision is like JPEG: Shattered-compatible, but with comments limited to 64Kb.</p>
<p>On the other hand, if you manipulate JPEG2000 files with the Atom/Box, you don't have this limitation.</p>
<p>As mentioned before, if you're trying to collide this structure and if there are more restriction - for example starting with a <code>free</code> atom is not tolerated by some format - then you can compute another UniColl prefix pairs specific to this format: JPEG2000 seems to <a href="https://github.com/uclouvain/openjpeg/blob/d2205ba2ee78faeea659263383446c4472b1f9df/src/bin/wx/OPJViewer/source/imagjpeg2000.cpp#L100-L111">enforce</a> a <code>'jP '</code> atom first before the usual <code>ftyp</code>, but besides, that's the only restriction: there's no need to relocate anything.</p>
<p>So the resulting <a href="scripts/jp2.py">script</a> is even simpler!</p>
<p><img src="pics/jp2.png" alt="Oded Goldreich / Neal Koblitz" /></p>
<p>Examples: <a href="examples/collision1.jp2">collision1.jp2</a> ⟷ <a href="examples/collision2.jp2">collision2.jp2</a></p>
<h3 id="pdf">PDF</h3>
<img alt='a PDF file' src=https://raw.githubusercontent.com/corkami/pics/master/binary/PDF.png width=300/>
<p><strong>about Shattered</strong></p>
<p>Shattered exploitation was not a PDF trick, but a JPG trick in a PDF.</p>
<p>It only enabled a PDF to contain a JPG-compressed object that could have two different contents. Both PDFs needed to be totally identical beside.</p>
<p>Note that the documents can be totally normal, and can just clip the collision JPG and display it in difference places, such as multi-page documents.</p>
<p>Examples: <a href="examples/shattered1.pdf">the Shattered paper, modified</a> ⟷ <a href="examples/shattered2.pdf">the Shattered paper, original</a></p>
<img alt='the Shattered paper using a colliding JPG in the authors' src=pics/shattereddoc1.png width=350/>
<img alt='the Shattered paper using a colliding JPG in a figure' src=pics/shattereddoc2.png width=350/>
<p><em>the Shattered paper using a colliding JPG in two places</em></p>
<p><strong>PDF collisions with MD5</strong></p>
<p>With MD5 (and other collision patterns), we can do PDF collisions at document level, with no restrictions at all on either file!</p>
<p>PDF has a very different structure from other file formats. It uses object numbers and references to define a tree. The whole document depends on the Root element.</p>
<!--
digraph {
rankdir=LR;
root -> "catalog#1"
"catalog#1" -> "pages#2"
"pages#2" -> "page#3"
"page#3" -> "pages#2"
"page#3" -> "content#4"
"content#4" -> "Hello World!"
}
-->
<p><img src="pics/pdf.svg" /></p>
<p>This (valid) PDF</p>
<pre class="text"><code>%PDF-1.
1 0 obj<</Pages 2 0 R>>endobj
2 0 obj<</Kids[3 0 R]/Count 1>>endobj
3 0 obj<</Parent 2 0 R>>endobj
trailer <</Root 1 0 R>>
</code></pre>
<p>is equivalent to:</p>
<pre class="text"><code>%PDF-1.
11 0 obj<</Pages 12 0 R>>endobj
12 0 obj<</Kids[13 0 R]/Count 1>>endobj
13 0 obj<</Parent 12 0 R>>endobj
trailer <</Root 11 0 R>>
</code></pre>
<p>Tricks:</p>
<ul>
<li>Storing unused objects in a PDF is tolerated.</li>
<li>Skipping any object numbers is also OK. There's even an official way to skip numbers in the <code>XREF</code> table.</li>
</ul>
<p>So storing two document trees in the same file is OK. We just need to make the root object refer to either root object of both documents.</p>
<p>So we just need to take two documents, renumber objects and references so that there is no overlap, craft a collision so that the element number referenced as Root object can be changed while keeping the same hash value, which is a perfect fit for UniColl with <code>N=1</code>, and adjust the <code>XREF</code> table accordingly.</p>
<!--
digraph {
rankdir=LR;
"trailer" -> "catalog#1" [color=green]
"catalog#1" -> "pages#2"
"pages#2" -> "page#3"
"page#3" -> "pages#2"
"page#3" -> "content#4"
"content#4" -> "Hello World!"
trailer -> "catalog#11" [<col>or=red, style=dashed]
"catalog#11" -> "pages#12"
"pages#12" -> "page#13"
"page#13" -> "pages#12"
"page#13" -> "content#14"
"content#14" -> "Bye World!";
}
-->
<p><img src="pics/pdfcollision.svg" /></p>
<p>This way, we can safely collide any pair of PDFs, no matter the page numbers, dimensions, images...</p>
<p><strong>comments</strong></p>
<p>PDF can store foreign data in two ways:</p>
<ul>
<li>as a line comment, in which the only forbidden characters are newline (<code>\r</code> and <code>\n</code>). This can be used inside a dictionary object, to modify for example an object reference, via UniColl. So this is a valid PDF object even if it contains binary collision blocks - just retry until you have no newline characters:
<pre><code>1 0 obj
<< /Type /Catalog /MD5_is /REALLY_dead_now__ /Pages 2 0 R
%¥┬•σe╕█╙X₧_~π▌╒εX∟■φe♦%τ8╞■[...]p╛╬ûFZ»‼v◘Åp↑╝%▓% ▼σφj╔◄dZ▀c²aU≤╨╩[├└─yNΓ5╔+▀╪yδ☻ß⌐░¼à(☺z₧
>>
endobj
</code></pre></li>
<li>as a stream object, in which case any data is possible, but since we're inside an object, we can't alter the whole PDF structure, so it requires a chosen-prefix collision to modify the structure outside the containing stream object.</li>
</ul>
<p><strong>colliding text</strong></p>
<p>The first case makes it possible to highlight the beauty of UniColl, a collision where differences are predictable, so you can write poetry over colliding data - thanks <a href="https://github.com/Jurph/word-decrementer">Jurph</a>!</p>
<p>Rather than modifying the structure of the document and fool parsers, we'll just use collision blocks directly to produce directly text, with alternate reading!</p>
<pre><code> V V
Now he hash MD5, Now he hath MD5,
No enemy cares! No enemy dares!
Only he gave Only he have
the shards. the shares.
Can’t be owned & Can’t be pwned &
his true gold, his true hold,
like One Frail, like One Grail,
sound as fold. sound as gold.
^ ^
</code></pre>
<p>Examples: <a href="examples/poeMD5_A.pdf">poeMD5 A</a> ⟷ <a href="examples/poeMD5_B.pdf">poeMD5 B</a></p>
<img alt='2 Poems colliding via UniColl' src=pics/poeMD5.png width=500/>
<p><em>A true cryptographic artistic creation :)</em></p>
<p>(Note I screwed up with Adobe compatibility, but that's my fault, not UniColl's)</p>
<p><strong>colliding document structure</strong></p>
<p>Whether you use UniColl as inline comment or chosen-prefix in a dummy stream object, the strategy is similar: shuffle objects numbers around, then make Root object point to different objects, so unlike Shattered, this means instant collision of any arbitrary pair of PDF, at document level.</p>
<p>A useful trick is that <a href="https://mupdf.com/docs/manual-mutool-clean.html"><code>mutool clean</code></a> output is reliably predictable, so it can be used to normalize PDFs as input, and fix your merged PDF while keeping the important parts of the file unmodified. MuTool doesn't discard bogus key/values - unless asked, and keep them in the same order, so using fake dictionary entries such as <code>/MD5_is /REALLY_dead_now__</code> is perfect to align things predictably without needing another kind of comments. However it won't keep comments in dictionaries (so no inline-comment trick)</p>
<p>An easy way to do the object-shuffling operation without hassle is just to merge both PDF files via <code>mutool merge</code> then split the <code>/Pages</code> object in two.</p>
<p>To make room for this object, just merge in front of the two documents a dummy PDF.</p>
<p>Optionally, create a fake reference to the dangling array to prevent garbage collection from deleting the second set of pages.</p>
<p><strong>Example</strong>: with this <a href="scripts/pdf.py">script</a>, it takes <a href="examples/pdf.log">less than a second</a> to collide the two public PDF papers like Spectre and Meltdown:</p>
<p>Examples: <a href="examples/collision1.pdf">spectre.pdf</a> ⟷ <a href="examples/collision2.pdf">meltdown.pdf</a></p>
<img alt='identical prefix PDF collisions' src=pics/pdf.png width=600/>
<p>Possible extension: chain UniColl blocks to also keep pairs of the various <a href="https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=81">non-critical objects</a> that can be referenced in the Root object - such as <code>Outlines</code>, <code>Names</code>, <code>AcroForm</code> and Additional Actions (<code>AA</code>) - in the original source files.</p>
<p><strong>in PDFLaTeX</strong></p>
<p>The previous techniques work with just a pair of PDF files, but it's also possible to do it directly from TeX sources via <a href="http://texdoc.net/texmf-dist/doc/pdftex/manual/pdftex-a.pdf">specific PDFTeX operators</a>.</p>
<p>You can define objects directly - including dummy key and values for alignments - and define empty objects to reserve some object slots by including this at the very start of your TeX sources:</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode latex"><code class="sourceCode latex"><a class="sourceLine" id="cb18-1" data-line-number="1"><span class="co">% set PDF version low to prevent stream XREF</span></a>
<a class="sourceLine" id="cb18-2" data-line-number="2"><span class="fu">\pdfminorversion</span>=3</a>
<a class="sourceLine" id="cb18-3" data-line-number="3"></a>
<a class="sourceLine" id="cb18-4" data-line-number="4"><span class="fu">\begingroup</span></a>
<a class="sourceLine" id="cb18-5" data-line-number="5"></a>
<a class="sourceLine" id="cb18-6" data-line-number="6"> <span class="co">% disable compression to keep alignments</span></a>
<a class="sourceLine" id="cb18-7" data-line-number="7"> <span class="fu">\pdfcompresslevel</span>=0<span class="fu">\relax</span></a>
<a class="sourceLine" id="cb18-8" data-line-number="8"></a>
<a class="sourceLine" id="cb18-9" data-line-number="9"> <span class="fu">\immediate</span></a>
<a class="sourceLine" id="cb18-10" data-line-number="10"> <span class="fu">\pdfobj</span>{<<</a>
<a class="sourceLine" id="cb18-11" data-line-number="11"> /Type /Catalog</a>
<a class="sourceLine" id="cb18-12" data-line-number="12"></a>
<a class="sourceLine" id="cb18-13" data-line-number="13"> <span class="co">% cool alignment padding</span></a>
<a class="sourceLine" id="cb18-14" data-line-number="14"> /MD5_is /REALLY_dead_now__</a>
<a class="sourceLine" id="cb18-15" data-line-number="15"></a>
<a class="sourceLine" id="cb18-16" data-line-number="16"> <span class="co">% the first reference number should be on offset 0x49,</span></a>
<a class="sourceLine" id="cb18-17" data-line-number="17"> <span class="co">% so the '2' object number will be changed to '3' by UniColl</span></a>
<a class="sourceLine" id="cb18-18" data-line-number="18"> /Pages 2 0 R</a>
<a class="sourceLine" id="cb18-19" data-line-number="19"></a>
<a class="sourceLine" id="cb18-20" data-line-number="20"> <span class="co">% now padding so that the collision blocks (ends at 0xC0) are covered</span></a>
<a class="sourceLine" id="cb18-21" data-line-number="21"> /0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF</a>
<a class="sourceLine" id="cb18-22" data-line-number="22"> <span class="co">% with an extra character to be replaced by a return char</span></a>
<a class="sourceLine" id="cb18-23" data-line-number="23"> /0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0</a>
<a class="sourceLine" id="cb18-24" data-line-number="24"> >>}</a>
<a class="sourceLine" id="cb18-25" data-line-number="25"></a>
<a class="sourceLine" id="cb18-26" data-line-number="26"> <span class="co">% the original catalog of the shifted doc</span></a>
<a class="sourceLine" id="cb18-27" data-line-number="27"> <span class="fu">\immediate\pdfobj</span>{<</Type/Pages/Count 1/Kids[8 0 R]>>}</a>
<a class="sourceLine" id="cb18-28" data-line-number="28"></a>
<a class="sourceLine" id="cb18-29" data-line-number="29"> <span class="co">% the original catalog of the host doc</span></a>
<a class="sourceLine" id="cb18-30" data-line-number="30"> <span class="fu">\immediate\pdfobj</span>{<</Type/Pages/Count 1/Kids[33 0 R]>>}</a>
<a class="sourceLine" id="cb18-31" data-line-number="31"></a>
<a class="sourceLine" id="cb18-32" data-line-number="32"> <span class="co">% now we need to reserve PDF Objects so that there is no overlap</span></a>
<a class="sourceLine" id="cb18-33" data-line-number="33"> <span class="fu">\newcount\objcount</span></a>
<a class="sourceLine" id="cb18-34" data-line-number="34"></a>
<a class="sourceLine" id="cb18-35" data-line-number="35"> <span class="co">% the host size (+3 for spare object slots) - 1</span></a>
<a class="sourceLine" id="cb18-36" data-line-number="36"> <span class="co">% putting a higher margin will just work, and XREF can have huge gaps</span></a>
<a class="sourceLine" id="cb18-37" data-line-number="37"> <span class="fu">\objcount</span>=25</a>
<a class="sourceLine" id="cb18-38" data-line-number="38"> <span class="fu">\loop</span></a>
<a class="sourceLine" id="cb18-39" data-line-number="39"> <span class="fu">\message</span>{<span class="fu">\the\objcount</span>}</a>
<a class="sourceLine" id="cb18-40" data-line-number="40"> <span class="fu">\advance</span> <span class="fu">\objcount</span> -1</a>
<a class="sourceLine" id="cb18-41" data-line-number="41"></a>
<a class="sourceLine" id="cb18-42" data-line-number="42"> <span class="fu">\immediate\pdfobj</span>{<<>>} <span class="co">% just an empty object</span></a>
<a class="sourceLine" id="cb18-43" data-line-number="43"></a>
<a class="sourceLine" id="cb18-44" data-line-number="44"> <span class="fu">\ifnum</span> <span class="fu">\objcount</span>>0</a>
<a class="sourceLine" id="cb18-45" data-line-number="45"> <span class="fu">\repeat</span></a>
<a class="sourceLine" id="cb18-46" data-line-number="46"></a>
<a class="sourceLine" id="cb18-47" data-line-number="47"><span class="fu">\endgroup</span></a></code></pre></div>
<p>Don't forget to normalize PDFLaTeX output - with <code>mutool</code> for example - if needed: PDFLaTeX is hard to get reproducible builds across distributions - you may even want to hook the time on execution to get the exact hash if required.</p>
<h4 id="jpg-in-pdf">JPG in PDF</h4>
<p>You could expect JPG to be only images, but in a PDF and some PDF readers (non browsers, such as Evince and Adobe Reader), it can be used as page content just like any other embedded object, that is embedded in a JPEG image.</p>
<p>To store the JPEG data losslessly, store it as grayscale 100%, then either use a picture of single row/column, or repeat the data line 8 times (since JPEG blocks are 8x8), and your data is stored losslessly and referenced by the PDF pages.</p>
<p>Examples of SHA-1 colliding two PDFs via JPEG page data (a grayscale picture rendering colors) as vector page content:</p>
<p><a href="examples/jpgpage1.pdf">If</a> ⟷ <a href="examples/jpgpage2.pdf">Shattered - the movie</a></p>
<img alt='2 SHA-1 colliding PDFs with image data stored as JPG' src=pics/jpgpage.png width=700/>
<p><em>2 SHA-1 colliding PDFs with image data stored as JPG</em></p>
<p>It's possible to reference the colliding JPG twice: as a page content, losslessly, which also refers to itself as a lossy image to be displayed. Again, the image to be displayed is grayscale, but the page content can render some colors via PDF operators.</p>
<p>The top of the image shows the page content repeated 8 times.</p>
<p>Examples of SHA-1 colliding two PDFs via JPEG used as page data and picture to be displayed:</p>