kmx git

Commit	Date	Message
98bc3eeb	2022-02-24T23:09:58	Neon/AArch64: Fix/suppress UBSan warnings - Suppress a UBSan warning regarding storing a 64-bit value to a non-64-bit-aligned address. That behavior is technically undefined per the C spec but is supported in the context of the AArch64 architecture and compilers. - Explicitly promote block_diff[i] to unsigned int prior to left shifting it, in order to avoid a UBSan warning. This warning also described behavior that is technically undefined per the C spec but is supported in the context of the AArch64 architecture and compilers. Changing the type cast order eliminated the warning without changing the generated assembly code. Closes #582
147548c0	2021-09-06T11:31:37	Neon/AArch64: Accelerate Huffman encoding - Make better use of 128-bit vector registers, thus reducing the number of Neon instructions required to construct the AC coefficient bitmap. - Refactor the Neon computations of 'nbits' and 'diff' to use shorter and higher-throughput instruction sequences. DRC's notes: This commit partially integrates #570. Arm reported a 1-4% speedup on Cortex-A55 and Neoverse-N1 cores when using recent compilers but little or no speedup with Clang 10. I observed no speedup with Clang 10 on my Cortex-A53 and Cortex-A72 cores. Thus, referring to #582, the primary purpose of this commit is to fix UBSan warnings regarding the shift operations previously located at Line 253: https://github.com/libjpeg-turbo/libjpeg-turbo/blob/d640a457305164417a60f30c6457d316f0b44a9d/simd/arm/aarch64/jchuff-neon.c#L253
74e6ea45	2021-01-05T20:23:11	Neon: Fix Huffman enc. error w/Visual Studio+Clang The GNU builtin function __builtin_clzl() accepts an unsigned long argument, which is 8 bytes wide on LP64 systems (most Un*x systems, including Mac) but 4 bytes wide on LLP64 systems (Windows.) This caused the Neon intrinsics implementation of Huffman encoding to produce mathematically incorrect results when compiled using Visual Studio with Clang. This commit changes all invocations of __builtin_clzl() in the Neon SIMD extensions to __builtin_clzll(), which accepts an unsigned long long argument that is guaranteed to be 8 bytes wide on all systems. Fixes #480 Closes #490
eb14189c	2020-11-17T12:48:49	Fix Neon SIMD build issues with Visual Studio - Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for compile-time detection of Arm builds, since __arm__ and __aarch64__ are only present in GNU-compatible compilers. - Neon/intrinsics: Use the _CountLeadingZeros() and _CountLeadingZeros64() intrinsics provided by Visual Studio, since __builtin_clz() and __builtin_clzl() are only present in GNU-compatible compilers. - Neon/intrinsics: Since Visual Studio does not support static vector initialization, replace static initialization of Neon vectors with the appropriate intrinsics. Compared to the static initialization approach, this produces identical assembly code with both GCC and Clang. - Neon/intrinsics: Since Visual Studio does not support inline assembly code, provide alternative code paths for Visual Studio whenever inline assembly is used. - Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds (Visual Studio does not emit fused multiply-add [FMA] instructions by default for such builds.) - Neon/intrinsics: Move temporary buffer allocation outside of nested loops. Since Visual Studio configures Arm builds with a relatively small amount of stack memory, attempting to allocate those buffers within the inner loops caused a stack overflow. Closes #461 Closes #475
33859880	2020-11-13T12:12:47	Neon: Auto-detect compiler intrinsics completeness This allows the Neon intrinsics code to be built successfully (albeit likely with reduced run-time performance) with Xcode 5.0-6.2 (iOS/AArch64) and Android NDK < r19 (AArch32). Note that Xcode 5.0-6.2 will not build the Armv8 GAS code without gas-preprocessor.pl, and no version of Xcode will build the Armv7 GAS code without gas-preprocessor.pl, so we always use the full Neon intrinsics implementation by default with macOS and iOS builds. Auto-detecting the completeness of the compiler's set of Neon intrinsics also allows us to more intelligently set the default value of NEON_INTRINSICS, based on the values of HAVE_VLD1. This is a reasonable, albeit imperfect, proxy for whether a compiler has a full and optimal set of Neon intrinsics. Specific notes: - 64-bit RGB-to-YCbCr color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer forward DCT uses vld1_s16_x3(), regresses with GCC - 64-bit Huffman encoding uses vld1q_u8_x4(), regresses with GCC - 64-bit YCbCr-to-RGB color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 64-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. - 32-bit RGB-to-YCbCr color conversion: uses vld1_u16_x2(), regresses with GCC - 32-bit accurate integer forward DCT uses vld1_s16_x3(), regression irrelevant because there was no previous implementation - 32-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 32-bit fast integer inverse DCT does not use any of the intrinsics in question, regresses with GCC - 32-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. Presumably when GCC includes a full and optimal set of Neon intrinsics, the HAVE_VLD1 tests will pass, and the full Neon intrinsics implementation will be enabled automatically.
f3c3f01d	2018-09-24T04:35:20	Neon: Intrinsics impl. of Huffman encoding The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

98bc3eeb

2022-02-24T23:09:58

Neon/AArch64: Fix/suppress UBSan warnings - Suppress a UBSan warning regarding storing a 64-bit value to a non-64-bit-aligned address. That behavior is technically undefined per the C spec but is supported in the context of the AArch64 architecture and compilers. - Explicitly promote block_diff[i] to unsigned int prior to left shifting it, in order to avoid a UBSan warning. This warning also described behavior that is technically undefined per the C spec but is supported in the context of the AArch64 architecture and compilers. Changing the type cast order eliminated the warning without changing the generated assembly code. Closes #582

147548c0

2021-09-06T11:31:37

Neon/AArch64: Accelerate Huffman encoding - Make better use of 128-bit vector registers, thus reducing the number of Neon instructions required to construct the AC coefficient bitmap. - Refactor the Neon computations of 'nbits' and 'diff' to use shorter and higher-throughput instruction sequences. DRC's notes: This commit partially integrates #570. Arm reported a 1-4% speedup on Cortex-A55 and Neoverse-N1 cores when using recent compilers but little or no speedup with Clang 10. I observed no speedup with Clang 10 on my Cortex-A53 and Cortex-A72 cores. Thus, referring to #582, the primary purpose of this commit is to fix UBSan warnings regarding the shift operations previously located at Line 253: https://github.com/libjpeg-turbo/libjpeg-turbo/blob/d640a457305164417a60f30c6457d316f0b44a9d/simd/arm/aarch64/jchuff-neon.c#L253

74e6ea45

2021-01-05T20:23:11

Neon: Fix Huffman enc. error w/Visual Studio+Clang The GNU builtin function __builtin_clzl() accepts an unsigned long argument, which is 8 bytes wide on LP64 systems (most Un*x systems, including Mac) but 4 bytes wide on LLP64 systems (Windows.) This caused the Neon intrinsics implementation of Huffman encoding to produce mathematically incorrect results when compiled using Visual Studio with Clang. This commit changes all invocations of __builtin_clzl() in the Neon SIMD extensions to __builtin_clzll(), which accepts an unsigned long long argument that is guaranteed to be 8 bytes wide on all systems. Fixes #480 Closes #490

eb14189c

2020-11-17T12:48:49

Fix Neon SIMD build issues with Visual Studio - Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for compile-time detection of Arm builds, since __arm__ and __aarch64__ are only present in GNU-compatible compilers. - Neon/intrinsics: Use the _CountLeadingZeros() and _CountLeadingZeros64() intrinsics provided by Visual Studio, since __builtin_clz() and __builtin_clzl() are only present in GNU-compatible compilers. - Neon/intrinsics: Since Visual Studio does not support static vector initialization, replace static initialization of Neon vectors with the appropriate intrinsics. Compared to the static initialization approach, this produces identical assembly code with both GCC and Clang. - Neon/intrinsics: Since Visual Studio does not support inline assembly code, provide alternative code paths for Visual Studio whenever inline assembly is used. - Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds (Visual Studio does not emit fused multiply-add [FMA] instructions by default for such builds.) - Neon/intrinsics: Move temporary buffer allocation outside of nested loops. Since Visual Studio configures Arm builds with a relatively small amount of stack memory, attempting to allocate those buffers within the inner loops caused a stack overflow. Closes #461 Closes #475

33859880

2020-11-13T12:12:47

Neon: Auto-detect compiler intrinsics completeness This allows the Neon intrinsics code to be built successfully (albeit likely with reduced run-time performance) with Xcode 5.0-6.2 (iOS/AArch64) and Android NDK < r19 (AArch32). Note that Xcode 5.0-6.2 will not build the Armv8 GAS code without gas-preprocessor.pl, and no version of Xcode will build the Armv7 GAS code without gas-preprocessor.pl, so we always use the full Neon intrinsics implementation by default with macOS and iOS builds. Auto-detecting the completeness of the compiler's set of Neon intrinsics also allows us to more intelligently set the default value of NEON_INTRINSICS, based on the values of HAVE_VLD1*. This is a reasonable, albeit imperfect, proxy for whether a compiler has a full and optimal set of Neon intrinsics. Specific notes: - 64-bit RGB-to-YCbCr color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer forward DCT uses vld1_s16_x3(), regresses with GCC - 64-bit Huffman encoding uses vld1q_u8_x4(), regresses with GCC - 64-bit YCbCr-to-RGB color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 64-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. - 32-bit RGB-to-YCbCr color conversion: uses vld1_u16_x2(), regresses with GCC - 32-bit accurate integer forward DCT uses vld1_s16_x3(), regression irrelevant because there was no previous implementation - 32-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 32-bit fast integer inverse DCT does not use any of the intrinsics in question, regresses with GCC - 32-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. Presumably when GCC includes a full and optimal set of Neon intrinsics, the HAVE_VLD1* tests will pass, and the full Neon intrinsics implementation will be enabled automatically.

f3c3f01d

2018-09-24T04:35:20

Neon: Intrinsics impl. of Huffman encoding The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

kc3-lang/libjpeg-turbo/simd/arm/aarch64/jchuff-neon.c

simd/arm/aarch64/jchuff-neon.c

Log