kmx git

Commit	Date	Message
eb14189c	2020-11-17T12:48:49	Fix Neon SIMD build issues with Visual Studio - Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for compile-time detection of Arm builds, since __arm__ and __aarch64__ are only present in GNU-compatible compilers. - Neon/intrinsics: Use the _CountLeadingZeros() and _CountLeadingZeros64() intrinsics provided by Visual Studio, since __builtin_clz() and __builtin_clzl() are only present in GNU-compatible compilers. - Neon/intrinsics: Since Visual Studio does not support static vector initialization, replace static initialization of Neon vectors with the appropriate intrinsics. Compared to the static initialization approach, this produces identical assembly code with both GCC and Clang. - Neon/intrinsics: Since Visual Studio does not support inline assembly code, provide alternative code paths for Visual Studio whenever inline assembly is used. - Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds (Visual Studio does not emit fused multiply-add [FMA] instructions by default for such builds.) - Neon/intrinsics: Move temporary buffer allocation outside of nested loops. Since Visual Studio configures Arm builds with a relatively small amount of stack memory, attempting to allocate those buffers within the inner loops caused a stack overflow. Closes #461 Closes #475
33859880	2020-11-13T12:12:47	Neon: Auto-detect compiler intrinsics completeness This allows the Neon intrinsics code to be built successfully (albeit likely with reduced run-time performance) with Xcode 5.0-6.2 (iOS/AArch64) and Android NDK < r19 (AArch32). Note that Xcode 5.0-6.2 will not build the Armv8 GAS code without gas-preprocessor.pl, and no version of Xcode will build the Armv7 GAS code without gas-preprocessor.pl, so we always use the full Neon intrinsics implementation by default with macOS and iOS builds. Auto-detecting the completeness of the compiler's set of Neon intrinsics also allows us to more intelligently set the default value of NEON_INTRINSICS, based on the values of HAVE_VLD1. This is a reasonable, albeit imperfect, proxy for whether a compiler has a full and optimal set of Neon intrinsics. Specific notes: - 64-bit RGB-to-YCbCr color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer forward DCT uses vld1_s16_x3(), regresses with GCC - 64-bit Huffman encoding uses vld1q_u8_x4(), regresses with GCC - 64-bit YCbCr-to-RGB color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 64-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. - 32-bit RGB-to-YCbCr color conversion: uses vld1_u16_x2(), regresses with GCC - 32-bit accurate integer forward DCT uses vld1_s16_x3(), regression irrelevant because there was no previous implementation - 32-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 32-bit fast integer inverse DCT does not use any of the intrinsics in question, regresses with GCC - 32-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. Presumably when GCC includes a full and optimal set of Neon intrinsics, the HAVE_VLD1 tests will pass, and the full Neon intrinsics implementation will be enabled automatically.
f3c3f01d	2018-09-24T04:35:20	Neon: Intrinsics impl. of Huffman encoding The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.
d0004de5	2018-08-22T13:38:37	Neon: Intrinsics impl. of accurate int forward DCT The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. There was no previous AArch32 GAS implementation.
3d84668d	2018-08-23T14:22:23	Neon: Intrinsics impl. of fast integer forward DCT The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.
951d3677	2018-08-24T18:04:21	Neon: Intrinsics impl. of int sample conv./quant. The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.
366168aa	2018-08-06T15:14:34	Neon: Intrinsics impl. of h2v1 & h2v2 downsampling The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance. There was no previous AArch32 GAS implementation.
f73b1dbc	2018-08-09T15:08:21	Neon: Intrinsics implementation of RGB->Grayscale There was no previous GAS implementation.
141f26ff	2018-09-18T18:28:31	Neon: Intrinsics impl. of 2x2 and 4x4 scaled IDCTs The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementations provide the same or better performance.
4574f01f	2018-06-28T16:17:36	Neon: Intrinsics impl. of h2v1 & h2v2 plain upsamp There was no previous GAS implementation. NOTE: This doesn't produce much of a speedup when using -O3, because -O3 already enables Neon autovectorization, which works well for the scalar C implementation of plain upsampling. However, the Neon SIMD implementation will benefit other optimization levels.
ba52a3de	2018-07-19T18:46:24	Neon: Intrinsics impl of h2v1 & h2v2 merged upsamp There was no previous GAS implementation. This commit also reverts 40557b23015d2f8b576420231b8dd1f39f2ceed8 and 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57. 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57 was only necessary because there was no Neon implementation of merged upsampling/color conversion, and 40557b23015d2f8b576420231b8dd1f39f2ceed8 was only necessary because of 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57.
240ba417	2020-01-07T16:40:32	Neon: Intrinsics impl. of prog. Huffman encoding The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance. There was no previous AArch32 GAS implementation.
ed581cd9	2019-06-12T18:16:53	Neon: Intrinsics impl. of accurate int inverse DCT The previous AArch32 and AArch64 GAS implementations are retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable.
2c6b68e2	2018-09-25T18:20:25	Neon: Intrinsics impl. of fast integer Inverse DCT The previous AArch32 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.
2acfb93c	2019-05-08T15:43:26	Neon: Intrinsics impl. of h1v2 fancy upsamling There was no previous GAS implementation.
97530777	2018-06-15T11:13:52	Neon: Intrinsics impl. of h2v1 & h2v2 fancy upsamp The previous AArch32 GAS implementation of h2v1 fancy upsampling has been removed, since the intrinsics implementation provides the same or better performance. There was no previous GAS implementation of h2v2 fancy upsampling, and there was no previous AArch64 GAS implementation of h2v1 fancy upsampling.
5dbd3932	2018-08-01T16:52:31	Neon: Intrinsics implementation of YCbCr->RGB565 The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.
0f35cd68	2018-07-16T10:25:14	Neon: Intrinsics implementation of YCbCr->RGB The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.
4f2216b4	2019-11-26T18:14:33	Neon: Intrinsics implementation of RGB->YCbCr The previous AArch32 and AArch64 GAS implementations are retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using a new NEON_INTRINSICS CMake variable.

eb14189c

2020-11-17T12:48:49

Fix Neon SIMD build issues with Visual Studio - Use the _M_ARM and _M_ARM64 macros provided by Visual Studio for compile-time detection of Arm builds, since __arm__ and __aarch64__ are only present in GNU-compatible compilers. - Neon/intrinsics: Use the _CountLeadingZeros() and _CountLeadingZeros64() intrinsics provided by Visual Studio, since __builtin_clz() and __builtin_clzl() are only present in GNU-compatible compilers. - Neon/intrinsics: Since Visual Studio does not support static vector initialization, replace static initialization of Neon vectors with the appropriate intrinsics. Compared to the static initialization approach, this produces identical assembly code with both GCC and Clang. - Neon/intrinsics: Since Visual Studio does not support inline assembly code, provide alternative code paths for Visual Studio whenever inline assembly is used. - Build: Set FLOATTEST appropriately for AArch64 Visual Studio builds (Visual Studio does not emit fused multiply-add [FMA] instructions by default for such builds.) - Neon/intrinsics: Move temporary buffer allocation outside of nested loops. Since Visual Studio configures Arm builds with a relatively small amount of stack memory, attempting to allocate those buffers within the inner loops caused a stack overflow. Closes #461 Closes #475

33859880

2020-11-13T12:12:47

Neon: Auto-detect compiler intrinsics completeness This allows the Neon intrinsics code to be built successfully (albeit likely with reduced run-time performance) with Xcode 5.0-6.2 (iOS/AArch64) and Android NDK < r19 (AArch32). Note that Xcode 5.0-6.2 will not build the Armv8 GAS code without gas-preprocessor.pl, and no version of Xcode will build the Armv7 GAS code without gas-preprocessor.pl, so we always use the full Neon intrinsics implementation by default with macOS and iOS builds. Auto-detecting the completeness of the compiler's set of Neon intrinsics also allows us to more intelligently set the default value of NEON_INTRINSICS, based on the values of HAVE_VLD1*. This is a reasonable, albeit imperfect, proxy for whether a compiler has a full and optimal set of Neon intrinsics. Specific notes: - 64-bit RGB-to-YCbCr color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer forward DCT uses vld1_s16_x3(), regresses with GCC - 64-bit Huffman encoding uses vld1q_u8_x4(), regresses with GCC - 64-bit YCbCr-to-RGB color conversion does not use any of the intrinsics in question, regresses with GCC - 64-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 64-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. - 32-bit RGB-to-YCbCr color conversion: uses vld1_u16_x2(), regresses with GCC - 32-bit accurate integer forward DCT uses vld1_s16_x3(), regression irrelevant because there was no previous implementation - 32-bit accurate integer inverse DCT uses vld1_s16_x3(), regresses with GCC - 32-bit fast integer inverse DCT does not use any of the intrinsics in question, regresses with GCC - 32-bit 4x4 inverse DCT uses vld1_s16_x3(). I did not test this algorithm in isolation, so it may in fact regress with GCC, but the regression may be hidden by the speedup from the new SIMD-accelerated upsampling algorithms. Presumably when GCC includes a full and optimal set of Neon intrinsics, the HAVE_VLD1* tests will pass, and the full Neon intrinsics implementation will be enabled automatically.

f3c3f01d

2018-09-24T04:35:20

Neon: Intrinsics impl. of Huffman encoding The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

d0004de5

2018-08-22T13:38:37

Neon: Intrinsics impl. of accurate int forward DCT The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. There was no previous AArch32 GAS implementation.

3d84668d

2018-08-23T14:22:23

Neon: Intrinsics impl. of fast integer forward DCT The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.

951d3677

2018-08-24T18:04:21

Neon: Intrinsics impl. of int sample conv./quant. The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementation provides the same or better performance.

366168aa

2018-08-06T15:14:34

Neon: Intrinsics impl. of h2v1 & h2v2 downsampling The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance. There was no previous AArch32 GAS implementation.

f73b1dbc

2018-08-09T15:08:21

Neon: Intrinsics implementation of RGB->Grayscale There was no previous GAS implementation.

141f26ff

2018-09-18T18:28:31

Neon: Intrinsics impl. of 2x2 and 4x4 scaled IDCTs The previous AArch32 and AArch64 GAS implementations have been removed, since the intrinsics implementations provide the same or better performance.

4574f01f

2018-06-28T16:17:36

Neon: Intrinsics impl. of h2v1 & h2v2 plain upsamp There was no previous GAS implementation. NOTE: This doesn't produce much of a speedup when using -O3, because -O3 already enables Neon autovectorization, which works well for the scalar C implementation of plain upsampling. However, the Neon SIMD implementation will benefit other optimization levels.

ba52a3de

2018-07-19T18:46:24

Neon: Intrinsics impl of h2v1 & h2v2 merged upsamp There was no previous GAS implementation. This commit also reverts 40557b23015d2f8b576420231b8dd1f39f2ceed8 and 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57. 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57 was only necessary because there was no Neon implementation of merged upsampling/color conversion, and 40557b23015d2f8b576420231b8dd1f39f2ceed8 was only necessary because of 7723d7f7d0aa40349d5bdd1fbe4f8631fd5a2b57.

240ba417

2020-01-07T16:40:32

Neon: Intrinsics impl. of prog. Huffman encoding The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance. There was no previous AArch32 GAS implementation.

ed581cd9

2019-06-12T18:16:53

Neon: Intrinsics impl. of accurate int inverse DCT The previous AArch32 and AArch64 GAS implementations are retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable.

2c6b68e2

2018-09-25T18:20:25

Neon: Intrinsics impl. of fast integer Inverse DCT The previous AArch32 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch64 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

2acfb93c

2019-05-08T15:43:26

Neon: Intrinsics impl. of h1v2 fancy upsamling There was no previous GAS implementation.

97530777

2018-06-15T11:13:52

Neon: Intrinsics impl. of h2v1 & h2v2 fancy upsamp The previous AArch32 GAS implementation of h2v1 fancy upsampling has been removed, since the intrinsics implementation provides the same or better performance. There was no previous GAS implementation of h2v2 fancy upsampling, and there was no previous AArch64 GAS implementation of h2v1 fancy upsampling.

5dbd3932

2018-08-01T16:52:31

Neon: Intrinsics implementation of YCbCr->RGB565 The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

0f35cd68

2018-07-16T10:25:14

Neon: Intrinsics implementation of YCbCr->RGB The previous AArch64 GAS implementation is retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using the new NEON_INTRINSICS CMake variable. The previous AArch32 GAS implementation has been removed, since the intrinsics implementation provides the same or better performance.

4f2216b4

2019-11-26T18:14:33

Neon: Intrinsics implementation of RGB->YCbCr The previous AArch32 and AArch64 GAS implementations are retained by default when using GCC, in order to avoid a performance regression. The intrinsics implementation can be forced on or off using a new NEON_INTRINSICS CMake variable.

kc3-lang/libjpeg-turbo/simd/arm/aarch32

simd/arm/aarch32

Log