RGBA 32-bit and initial aarch64 SIMD
I corrected some errors in the 16 different permutations of subsampling and scaling options. I also added an experimental set of code to optimize the color conversion for aarch64 (Arm NEON) for the 4:2:0 subsampling, full size output. On my MacBook Air M1, it doubles the decode speed. A 126K 938x698 file decodes in just 8 milliseconds (previously 15 milliseconds). I can optimize this code for x86 and Arm desktop usage, but need to evaluate the cost/benefit of investing the time. I believe my code can beat libjpeg-turbo for certain situations (if I fully deploy SIMD optimizations). Please let me know if you need this code optimized for your desktop application.