|
d38b4f21
|
2016-01-16T01:53:32
|
|
Optimize ARM64 SIMD code for Cavium ThunderX
Per @ssvb:
ThunderX is an ARM64 chip that dedicates most of its transistor real
estate to providing 48 cores, so each core is not as fast as a result.
Each core is dual-issue & in-order for scalar instructions and has only
a single-issue half-width NEON unit, so the peak throughput is one
128-bit instruction per 2 cycles. So careful instruction scheduling is
important. Furthermore, ThunderX has an extremely slow implementation
of ld2 and ld3, so this commit implements the equivalent of those
instructions using ld1.
Compression speedup relative to libjpeg-turbo 1.4.2:
48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 58-85% (avg. 74%)
relative to jpeg-6b: 1.75-2.14x (avg. 1.95x)
Refer to #49 and #51 for discussion.
Closes #51.
This commit also wordsmiths the ChangeLog entry (the ARMv8 SIMD
implementation is "complete" only for compression-- it still lacks some
decompression algorithms, as does the ARMv7 implementation.)
Based on:
https://github.com/mayeut/libjpeg-turbo/commit/9405b5fd031558113bdfeae193a2b14baa589a75
which is based on:
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/f561944ff70adef65bb36212913bd28e6a2926d6
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/962c8ab21feb3d7fc2a7a1ec8d26f6b985bbb86f
|
|
ec6941f7
|
2016-01-15T09:29:11
|
|
Complete the ARM64 NEON SIMD implementation
This adds 64-bit NEON coverage for all of the algorithms that are
covered by the 32-bit NEON implementation, except for h2v1 (4:2:2) fancy
upsampling (used when decompressing 4:2:2 JPEG images.) It also adds
64-bit NEON SIMD coverage for:
* slow integer forward DCT (compressor)
* h2v2 (4:2:0) downsampling (compressor)
* h2v1 (4:2:2) downsampling (compressor)
which are not covered in the 32-bit implementation.
Compression speedups relative to libjpeg-turbo 1.4.2:
Apple A7 (iPhone 5S), iOS, 64-bit: 113-150% (reported)
48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 2.1-33% (avg. 15%)
Refer to #44 and #49 for discussion
This commit also removes the unnecessary
if (simd_support & JSIMD_ARM_NEON)
statements from the jsimd* algorithm functions. Since the jsimd_can*()
functions check for the existence of NEON, the corresponding algorithm
functions will never be called if NEON isn't available.
Based on:
https://github.com/mayeut/libjpeg-turbo/commit/dcd9d84f10fae192c0e3935818dc289bca9c3e29
https://github.com/mayeut/libjpeg-turbo/commit/b0d87b811f37bd560083deea8c6e7d704e5cd944
https://github.com/mayeut/libjpeg-turbo/commit/70cd5c8a493a67f4d54dd2067ae6dedb65d95389
https://github.com/mayeut/libjpeg-turbo/commit/3e58d9a064648503c57ec2650ee79880f749a52b
https://github.com/mayeut/libjpeg-turbo/commit/837b19542f53fa81af83e6ba002d559877aaf597
https://github.com/mayeut/libjpeg-turbo/commit/73dc43ccc870c2e10ba893e9764b8e48d6836585
https://github.com/mayeut/libjpeg-turbo/commit/a82b71a261b4c0213f558baf4bc745f1c27356d8
https://github.com/mayeut/libjpeg-turbo/commit/c1b1188c2106d6ea7b76644b6023b57edeb602e1
https://github.com/mayeut/libjpeg-turbo/commit/305c89284e1bb222b34fbc7261f697a0cc452a41
https://github.com/mayeut/libjpeg-turbo/commit/7f443f99950b4d7d442b9b879648eca5273209bd
https://github.com/mayeut/libjpeg-turbo/commit/4c2b53b77da5a20e30e2aadaeddb0efbfe24e06d
Unified version with fixes:
https://github.com/mayeut/libjpeg-turbo/commit/1004a3cd05870612a194b410efeaa1b4da76d246
|
|
d70a5c12
|
2015-12-14T16:59:43
|
|
Remove unnecessary .arch directive in ARM64 code
This directive was preventing the code from assembling using the
integrated assembler in clang.
Fixes #33
|
|
1e32fe31
|
2015-10-14T17:32:39
|
|
Replace INT32 with a new internal datatype (JLONG)
These days, INT32 is a commonly-defined datatype in system headers. We
cannot eliminate the definition of that datatype from jmorecfg.h, since
the INT32 typedef has technically been part of the libjpeg API since
version 5 (1994.) However, using INT32 internally is risky, because the
inclusion of a particular header (Xmd.h, for instance) could change the
definition of INT32 from long to int on 64-bit platforms and thus change
the internal behavior of libjpeg-turbo in unexpected ways (for instance,
failing to correctly set __INT32_IS_ACTUALLY_LONG to match the INT32
typedef-- perhaps as a result of including the wrong version of
jpeglib.h-- could cause libjpeg-turbo to produce incorrect results.)
The library has always been built in environments in which INT32 is
effectively long (on Windows, long is always 32-bit, so effectively it's
the same as int), so it makes sense to turn INT32 into an explicitly
long datatype. This ensures that libjpeg-turbo will always behave
consistently, regardless of the headers included at compile time.
Addresses a concern expressed in #26.
|
|
62999d77
|
2014-12-19T15:36:39
|
|
Modify the ARM64 assembly file so that it uses only syntax that the clang assembler in XCode 5.x can understand. These changes should all be cosmetic in nature-- they do not change the meaning or readability of the code nor the ability to build it for Linux. Actually, the code is now more in compliance with the ARM64 programming manual. In addition to these changes, there were a couple of instructions that clang simply doesn't support, so gas-preprocessor.pl was modified so that it now converts those into equivalent instructions that clang can handle.
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/branches/1.4.x@1450 632fc199-4ca6-4c93-a231-07263d6284db
|
|
0a9a2526
|
2014-08-29T01:53:17
|
|
Rename the ARM64 assembly file to match the C file
git-svn-id: svn+ssh://svn.code.sf.net/p/libjpeg-turbo/code/trunk@1390 632fc199-4ca6-4c93-a231-07263d6284db
|