Opt. ARM64 SIMD decompr. for in-order pipelines Decompression speedup relative to libjpeg-turbo 1.4.2 (ISLOW IDCT): 48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 60-113% (avg. 86%) Cortex-A53 (Nexus 5X), Android, 64-bit: 6.8-27% (avg. 14%) Cortex-A57 (Nexus 5X), Android, 64-bit: 2.0-14% (avg. 6.8%) Decompression speedup relative to libjpeg-turbo 1.4.2 (IFAST IDCT): 48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 51-98% (avg. 75%) Minimal speedup (1-5%) observed on iPhone 5S (Cortex-A7) NOTE: This commit avoids the st3 instruction for non-Android and non-Apple builds, which may cause a performance regression against libjpeg-turbo 1.4.x on ARM64 systems that are running plain Linux. Since ThunderX is the only platform known to suffer from slow ld3 and st3 instructions, it is probably better to check for the CPU type at run time and disable ld3/st3 only if ThunderX is detected. This commit also enables the use of ld3 on Android platforms, which should be a safe bet, at least for now. This speeds up compression on the afore-mentioned Nexus Cortex-A53 by 5.5-19% (avg. 12%) and on the Nexus Cortex-A57 by 1.2-14% (avg. 6.3%), relative to the previous commits. This commit also removes unnecessary macros. Refer to #52 for discussion. Closes #52. Based on: https://github.com/mayeut/libjpeg-turbo/commit/6bad905034e6e73b33ebf07a74a6b72f58319f62 https://github.com/mayeut/libjpeg-turbo/commit/488dd7bf1726e2f6af6e9294ccf77b729fec1f20 https://github.com/mayeut/libjpeg-turbo/commit/4f4d057c1fb31d643536e6effb46a5946e15c465 https://github.com/mayeut/libjpeg-turbo/commit/d3198afc43450989a4fc63d2dcbe3272c8a0a3c1