|
eb753630
|
2024-08-31T16:50:08
|
|
Update URLs
- Eliminate unnecessary "www."
- Use HTTPS.
- Update Java, MSYS, tdm-gcc, and NSIS URLs.
- Update URL and title of Agner Fog's assembly language optimization
manual.
- Remove extraneous information about MASM and Borland Turbo Assembler
and outdated NASM URLs from the x86 assembly headers, and mention
Yasm.
|
|
3202feb0
|
2024-02-29T16:10:20
|
|
x86-64 SIMD: Support CET if C compiler enables it
- Detect at configure time, via the __CET__ C preprocessor macro,
whether the C compiler will include either indirect branch tracking
(IBT) or shadow stack support, and define a NASM macro (__CET__) if
so.
- Modify the x86-64 SIMD code so that it includes appropriate endbr64
instructions (to support IBT) and an appropriate .note.gnu.property
section (to support both IBT and shadow stack) when __CET__ is
defined.
Closes #350
|
|
13355475
|
2024-02-29T12:18:49
|
|
x86 SIMD: Capitalize all instruction-like macros
(to improve code readability)
|
|
7b844bfd
|
2023-07-28T11:46:10
|
|
x86-64 SIMD: Use std stack frame/prologue/epilogue
This allows debuggers and profilers to reliably capture backtraces from
within the x86-64 SIMD functions.
In places where rbp was previously used to access temporary variables
(after stack alignment), we now use r15 and save/restore it accordingly.
The total amount of work is approximately the same, because the previous
code pushed the pre-alignment stack pointer to the aligned stack. The
new prologue and epilogue actually have fewer instructions.
Also note that the {un}collect_args macros now use rbp instead of rax to
access arguments passed on the stack, so we save a few instructions
there as well.
Based on:
https://github.com/alk/libjpeg-turbo/commit/debcc7c3b436467aea8d02c66a514c5099d0ad37
Closes #707
Closes #708
|
|
e795afc3
|
2021-03-25T22:36:15
|
|
SSE2: Fix prog Huff enc err if Sl%32==0 && Al!=0
(regression introduced by 16bd984557fa2c490be0b9665e2ea0d4274528a8)
This implements the same fix for
jsimd_encode_mcu_AC_refine_prepare_sse2() that
a81a8c137b3f1c65082aa61f236aa88af61b3ad4 implemented for
jsimd_encode_mcu_AC_first_prepare_sse2().
Based on:
https://github.com/MegaByte/libjpeg-turbo/commit/1a59587397150c9ef9dffc5813cb3891db4bc0c8
https://github.com/MegaByte/libjpeg-turbo/commit/eb176a91d87a470bf8c987be786668aa944dd1dd
Fixes #509
Closes #510
|
|
9a51a87a
|
2019-10-17T11:21:32
|
|
x86 SIMD: Remove obsolete [TAB8] comments
With apologies to Richard Hendricks, our assembly code no longer uses
tabs.
|
|
a81a8c13
|
2019-08-14T13:17:11
|
|
SSE2 SIMD: Fix prog Huffman enc. error if Sl%16==0
(regression introduced by 5b177b3cab5cfb661256c1e74df160158ec6c34e)
The SSE2 implementation of progressive Huffman encoding performed
extraneous iterations when the scan length was a multiple of 16.
Based on:
https://github.com/rouault/libjpeg-turbo/commit/bb7f1ef98305da915e581b59bd0ec2ef7bdb8468
Fixes #335
Closes #367
|
|
5b177b3c
|
2018-03-22T11:36:43
|
|
C/SSE2 optimization of encode_mcu_AC_first()
This commit adds C and SSE2 optimizations for the encode_mcu_AC_first()
function used in progressive Huffman encoding.
The image used for testing can be retrieved from this page:
https://blog.cloudflare.com/doubling-the-speed-of-jpegtran
All timings done on `Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz`
clang version is `Apple LLVM version 9.0.0 (clang-900.0.39.2)`
gcc-5 version is `gcc-5 (Homebrew GCC 5.5.0) 5.5.0`
gcc-7 version is `gcc-7 (Homebrew GCC 7.2.0) 7.2.0`
Here are the results in comparison to libjpeg-turbo@293263c using
`time ./jpegtran -outfile /dev/null -progressive -optimise -copy none print_poster_0025.jpg`
C
clang x86_64: +19%
gcc-5 x86_64: +80%
gcc-7 x86_64: +57%
clang i386: +5%
gcc-5 i386: +59%
gcc-7 i386: +51%
SSE2
clang x86_64: +79%
gcc-5 x86_64: +158%
gcc-7 x86_64: +122%
clang i386: +71%
gcc-5 i386: +134%
gcc-7 i386: +135%
Discussion in libjpeg-turbo/libjpeg-turbo#46
|
|
16bd9845
|
2018-03-02T22:33:19
|
|
C/SSE2 optimization of encode_mcu_AC_refine()
This commit adds C and SSE2 optimizations for the encode_mcu_AC_refine()
function used in progressive Huffman encoding.
The image used for testing can be retrieved from this page:
https://blog.cloudflare.com/doubling-the-speed-of-jpegtran
All timings done on `Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz`
clang version is `Apple LLVM version 9.0.0 (clang-900.0.39.2)`
gcc-5 version is `gcc-5 (Homebrew GCC 5.5.0) 5.5.0`
gcc-7 version is `gcc-7 (Homebrew GCC 7.2.0) 7.2.0`
Here are the results in comparison to libjpeg-turbo@3c54642 using
`time ./jpegtran -outfile /dev/null -progressive -optimise -copy none print_poster_0025.jpg`
C
clang x86_64: +7%
gcc-5 x86_64: +30%
gcc-7 x86_64: +33%
clang i386: +0%
gcc-5 i386: +24%
gcc-7 i386: +23%
SSE2
clang x86_64: +42%
gcc-5 x86_64: +53%
gcc-7 x86_64: +64%
clang i386: +35%
gcc-5 i386: +46%
gcc-7 i386: +49%
Discussion in libjpeg-turbo/libjpeg-turbo#46
|