Age | Commit message (Collapse) | Author | Files | Lines |
|
* cipher/Makefile.am: Add 'crc-intel-pclmul.c'.
* cipher/crc-intel-pclmul.c: New.
* cipher/crc.c (USE_INTEL_PCLMUL): New macro.
(CRC_CONTEXT) [USE_INTEL_PCLMUL]: Add 'use_pclmul'.
[USE_INTEL_PCLMUL] (_gcry_crc32_intel_pclmul)
(gcry_crc24rfc2440_intel_pclmul): New.
(crc32_init, crc32rfc1510_init, crc24rfc2440_init)
[USE_INTEL_PCLMUL]: Select PCLMUL implementation if SSE4.1 and PCLMUL
HW features detected.
(crc32_write, crc24rfc2440_write) [USE_INTEL_PCLMUL]: Use PCLMUL
implementation if enabled.
(crc24_init): Document storage format of 24-bit CRC.
(crc24_next4): Use only 'data' for last table look-up.
* configure.ac: Add 'crc-intel-pclmul.lo'.
* src/g10lib.h (HWF_*, HWF_INTEL_SSE4_1): Update HWF flags to include
Intel SSE4.1.
* src/hwf-x86.c (detect_x86_gnuc): Add SSE4.1 detection.
* src/hwfeatures.c (hwflist): Add 'intel-sse4.1'.
* tests/basic.c (fillbuf_count): New.
(check_one_md): Add "?" check (million byte data-set with byte pattern
0x00,0x01,0x02,...); Test all buffer sizes 1 to 1000, for "!" and "?"
checks.
(check_one_md_multi): Skip "?".
(check_digests): Add "?" test-vectors for MD5, SHA1, SHA224, SHA256,
SHA384, SHA512, SHA3_224, SHA3_256, SHA3_384, SHA3_512, RIPEMD160,
CRC32, CRC32_RFC1510, CRC24_RFC2440, TIGER1 and WHIRLPOOL; Add "!"
test-vectors for CRC32_RFC1510 and CRC24_RFC2440.
--
Add Intel PCLMUL accelerated implmentations of CRC algorithms.
CRC performance is improved ~11x on x86_64 and i386 on Intel
Haswell, and ~2.7x on Intel Sandy-bridge.
Benchmark on Intel Core i5-4570 (x86_64, 3.2 Ghz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 0.865 ns/B 1103.0 MiB/s 2.77 c/B
CRC32RFC1510 | 0.865 ns/B 1102.7 MiB/s 2.77 c/B
CRC24RFC2440 | 0.865 ns/B 1103.0 MiB/s 2.77 c/B
After:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 0.079 ns/B 12051.7 MiB/s 0.253 c/B
CRC32RFC1510 | 0.079 ns/B 12050.6 MiB/s 0.253 c/B
CRC24RFC2440 | 0.079 ns/B 12100.0 MiB/s 0.252 c/B
Benchmark on Intel Core i5-4570 (i386, 3.2 Ghz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 0.860 ns/B 1109.0 MiB/s 2.75 c/B
CRC32RFC1510 | 0.861 ns/B 1108.3 MiB/s 2.75 c/B
CRC24RFC2440 | 0.860 ns/B 1108.6 MiB/s 2.75 c/B
After:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 0.078 ns/B 12207.0 MiB/s 0.250 c/B
CRC32RFC1510 | 0.078 ns/B 12207.0 MiB/s 0.250 c/B
CRC24RFC2440 | 0.080 ns/B 11925.6 MiB/s 0.256 c/B
Benchmark on Intel Core i5-2450M (x86_64, 2.5 Ghz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 1.25 ns/B 762.3 MiB/s 3.13 c/B
CRC32RFC1510 | 1.26 ns/B 759.1 MiB/s 3.14 c/B
CRC24RFC2440 | 1.25 ns/B 764.9 MiB/s 3.12 c/B
After:
| nanosecs/byte mebibytes/sec cycles/byte
CRC32 | 0.451 ns/B 2114.3 MiB/s 1.13 c/B
CRC32RFC1510 | 0.451 ns/B 2114.6 MiB/s 1.13 c/B
CRC24RFC2440 | 0.457 ns/B 2085.0 MiB/s 1.14 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/sha1.c (sha1_init): Use HWF_INTEL_FAST_SHLD instead of
HWF_INTEL_CPU.
* cipher/sha256.c (sha256_init, sha224_init): Ditto.
* cipher/sha512.c (sha512_init, sha384_init): Ditto.
* src/g10lib.h (HWF_INTEL_FAST_SHLD): New.
(HWF_INTEL_BMI2, HWF_INTEL_SSSE3, HWF_INTEL_PCLMUL, HWF_INTEL_AESNI)
(HWF_INTEL_RDRAND, HWF_INTEL_AVX, HWF_INTEL_AVX2)
(HWF_ARM_NEON): Update.
* src/hwf-x86.c (detect_x86_gnuc): Add detection of Intel Core
CPUs with fast SHLD/SHRD instruction.
* src/hwfeatures.c (hwflist): Add "intel-fast-shld".
--
Intel Core CPUs since codename sandy-bridge have been able to
execute SHLD/SHRD instructions faster than rotate instructions
ROL/ROR. Since SHLD/SHRD can be used to do rotation, some
optimized implementations (SHA1/SHA256/SHA512) use SHLD/SHRD
instructions in-place of ROL/ROR.
This patch provides more accurate detection of CPUs with
fast SHLD implementation.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* src/hwf-x86.c [__i386__] (get_cpuid): Use '=D' for regs[1] instead
of '=r'.
--
On Win32, %ebx can be assigned for '=r' (regs[1]). This results invalid
assembly:
pushl %ebx
movl %ebx, %ebx
cpuid
movl %ebx, %ebx
popl %ebx
So use '=D' (%esi) for regs[1] instead.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* src/hwf-x86.c (get_xgetbv): Add EDX as output.
--
XGETBV instruction modifies EAX:EDX register pair, so we need to mark
EDX as output to let compiler know that contents in this register are
lost.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* src/hwf-x86.c (get_xgetbv): Build only if AVX support is enabled.
--
Old as(1) versions do not support the xgetvb instruction. Thus build
this function only if asm support has been requested.
GnuPG-bug-id: 1708
|
|
* configure.ac: Also check for 'xgetbv' instruction in AVX and AVX2
inline assembly checks.
* src/hwf-x86.c [__i386__] (get_xgetbv): New function.
[__x86_64__] (get_xgetbv): New function.
[HAS_X86_CPUID] (detect_x86_gnuc): Check for OSXSAVE and OS support for
XMM&YMM registers and enable AVX/AVX2 only if XMM&YMM registers are
supported by OS.
--
This patch is based on original patch and bug report by Panagiotis Christopoulos:
Adding better detection of AVX/AVX2 support
After upgrading libgcrypt from 1.5.3 to 1.6.0 on a remote XEN system (linode) my
gpg2 stopped working properly, throwing SIGILL signals when doing sha512
operations etc. I managed to debug this with the help of Doublas Freed
(dwfreed at mtu.edu) and it seems that the current AVX detection just checks for
bit 28 on cpuid but the check still works on systems that have disabled the avx/avx2
instructions for some reason (eg. performance/unstability) resulting in SIGILLs
(eg. when trying _gcry_sha512_transform_amd64_avx() ).
From Intel resources[1][2], I found additional checks for better AVX
detection and applied them in the following patch. Please review/change
accordingly and commit some better AVX detection mechanism. The AVX part is
tested but could not test the AVX2 one, because I lack proper hardware. I can
provide additional information upon request. Use the patch only as a guideline,
as it's not thoroughly tested.
[1] http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled
[2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf (sections 14.3 and 14.7.1)
Reported-by: Panagiotis Christopoulos (pchrist) <pchrist@gentoo.org>
Cc: Doublas Freed <dwfreed@mtu.edu>
Cc: Tim Harder <radhermit@gentoo.org>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'sha512-avx-amd64.S' and
'sha512-avx2-bmi2-amd64.S'.
* cipher/sha512-avx-amd64.S: New.
* cipher/sha512-avx2-bmi2-amd64.S: New.
* cipher/sha512.c (USE_AVX, USE_AVX2): New.
(SHA512_CONTEXT) [USE_AVX]: Add 'use_avx'.
(SHA512_CONTEXT) [USE_AVX2]: Add 'use_avx2'.
(sha512_init, sha384_init) [USE_AVX]: Initialize 'use_avx'.
(sha512_init, sha384_init) [USE_AVX2]: Initialize 'use_avx2'.
[USE_AVX] (_gcry_sha512_transform_amd64_avx): New.
[USE_AVX2] (_gcry_sha512_transform_amd64_avx2): New.
(transform) [USE_AVX2]: Add call for AVX2 implementation.
(transform) [USE_AVX]: Add call for AVX implementation.
* configure.ac (HAVE_GCC_INLINE_ASM_BMI2): New check.
(sha512): Add 'sha512-avx-amd64.lo' and 'sha512-avx2-bmi2-amd64.lo'.
* doc/gcrypt.texi: Document 'intel-cpu' and 'intel-bmi2'.
* src/g10lib.h (HWF_INTEL_CPU, HWF_INTEL_BMI2): New.
* src/hwfeatures.c (hwflist): Add "intel-cpu" and "intel-bmi2".
* src/hwf-x86.c (detect_x86_gnuc): Check for HWF_INTEL_CPU and
HWF_INTEL_BMI2.
--
Patch adds fast AVX and AVX2 implementation of SHA-512 by Intel Corporation.
The assembly source is licensed under 3-clause BSD license, thus compatible
with LGPL2.1+. Original source can be accessed at:
http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs
Implementation is described in white paper
"Fast SHA512 Implementations on IntelĀ® Architecture Processors"
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementat$
Note: AVX implementation uses SHLD instruction to emulate RORQ, since it's
faster on Intel Sandy-Bridge. However, on non-Intel CPUs SHLD is much
slower than RORQ, so therefore AVX implementation is (for now) limited
to Intel CPUs.
Note: AVX2 implementation also uses BMI2 instruction rorx, thus additional
HWF flag.
Benchmarks:
cpu Old SSSE3 AVX/AVX2 Old vs AVX/AVX2
vs SSSE3
Intel i5-4570 10.11 c/B 7.56 c/B 6.72 c/B 1.50x 1.12x
Intel i5-2450M 14.11 c/B 10.53 c/B 8.88 c/B 1.58x 1.18x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'sha256-ssse3-amd64.S'.
* cipher/sha256-ssse3-amd64.S: New.
* cipher/sha256.c (USE_SSSE3): New.
(SHA256_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'.
(sha256_init, sha224_init) [USE_SSSE3]: Initialize 'use_ssse3'.
(transform): Rename to...
(_transform): This.
[USE_SSSE3] (_gcry_sha256_transform_amd64_ssse3): New.
(transform): New.
* configure.ac (HAVE_INTEL_SYNTAX_PLATFORM_AS): New check.
(sha256): Add 'sha256-ssse3-amd64.lo'.
* doc/gcrypt.texi: Document 'intel-ssse3'.
* src/g10lib.h (HWF_INTEL_SSSE3): New.
* src/hwfeatures.c (hwflist): Add "intel-ssse3".
* src/hwf-x86.c (detect_x86_gnuc): Test for SSSE3.
--
Patch adds fast SSSE3 implementation of SHA-256 by Intel Corporation. The
assembly source is licensed under 3-clause BSD license, thus compatible
with LGPL2.1+. Original source can be accessed at:
http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs
Implementation is described in white paper
"Fast SHA - 256 Implementations on IntelĀ® Architecture Processors"
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html
Benchmarks:
cpu Old New Diff
Intel i5-4570 13.99 c/B 10.66 c/B 1.31x
Intel i5-2450M 21.53 c/B 15.79 c/B 1.36x
Intel Core2 T8100 20.84 c/B 15.07 c/B 1.38x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/cipher-gcm.c (fillM): Rename...
(do_fillM): ...to this.
(ghash): Remove.
(fillM): New macro.
(GHASH): Use 'do_ghash' instead of 'ghash'.
[GCM_USE_INTEL_PCLMUL] (do_ghash_pclmul): New.
(ghash): New.
(setupM): New.
(_gcry_cipher_gcm_encrypt, _gcry_cipher_gcm_decrypt)
(_gcry_cipher_gcm_authenticate, _gcry_cipher_gcm_setiv)
(_gcry_cipher_gcm_tag): Use 'ghash' instead of 'GHASH' and
'c->u_mode.gcm.u_tag.tag' instead of 'c->u_tag.tag'.
* cipher/cipher-internal.h (GCM_USE_INTEL_PCLMUL): New.
(gcry_cipher_handle): Move 'u_tag' and 'gcm_table' under
'u_mode.gcm'.
* configure.ac (pclmulsupport, gcry_cv_gcc_inline_asm_pclmul): New.
* src/g10lib.h (HWF_INTEL_PCLMUL): New.
* src/global.c: Add "intel-pclmul".
* src/hwf-x86.c (detect_x86_gnuc): Add check for Intel PCLMUL.
--
Speed-up GCM for Intel CPUs.
Intel Haswell (x86-64):
Old:
AES GCM enc | 5.17 ns/B 184.4 MiB/s 16.55 c/B
GCM dec | 4.38 ns/B 218.0 MiB/s 14.00 c/B
GCM auth | 3.17 ns/B 300.4 MiB/s 10.16 c/B
New:
AES GCM enc | 3.01 ns/B 317.2 MiB/s 9.62 c/B
GCM dec | 1.96 ns/B 486.9 MiB/s 6.27 c/B
GCM auth | 0.848 ns/B 1124.8 MiB/s 2.71 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Add option --disable-avx2-support.
(HAVE_GCC_INLINE_ASM_AVX2): New.
(ENABLE_AVX2_SUPPORT): New.
* src/g10lib.h (HWF_INTEL_AVX2): New.
* src/global.c (hwflist): Add HWF_INTEL_AVX2.
* src/hwf-x86.c [__i386__] (get_cpuid): Initialize registers to zero
before cpuid.
[__x86_64__] (get_cpuid): Initialize registers to zero before cpuid.
(detect_x86_gnuc): Store maximum cpuid level.
(detect_x86_gnuc) [ENABLE_AVX2_SUPPORT]: Add detection for AVX2.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Add option --disable-avx-support.
(HAVE_GCC_INLINE_ASM_AVX): New.
(ENABLE_AVX_SUPPORT): New.
(camellia) [ENABLE_AVX_SUPPORT, ENABLE_AESNI_SUPPORT]: Add
camellia_aesni_avx_x86-64.lo.
* cipher/Makefile.am (AM_CCASFLAGS): Add.
(EXTRA_libcipher_la_SOURCES): Add camellia_aesni_avx_x86-64.S
* cipher/camellia-glue.c [ENABLE_AESNI_SUPPORT, ENABLE_AVX_SUPPORT]
[__x86_64__] (USE_AESNI_AVX): Add macro.
(struct Camellia_context) [USE_AESNI_AVX]: Add use_aesni_avx.
[USE_AESNI_AVX] (_gcry_camellia_aesni_avx_ctr_enc)
(_gcry_camellia_aesni_avx_cbc_dec): New prototypes to assembly
functions.
(camellia_setkey) [USE_AESNI_AVX]: Enable AES-NI/AVX if hardware
support both.
(_gcry_camellia_ctr_enc) [USE_AESNI_AVX]: Add AES-NI/AVX code.
(_gcry_camellia_cbc_dec) [USE_AESNI_AVX]: Add AES-NI/AVX code.
* cipher/camellia_aesni_avx_x86-64.S: New.
* src/g10lib.h (HWF_INTEL_AVX): New.
* src/global.c (hwflist): Add HWF_INTEL_AVX.
* src/hwf-x86.c (detect_x86_gnuc) [ENABLE_AVX_SUPPORT]: Add detection
for AVX.
--
Before:
Running each test 250 times.
ECB/Stream CBC CFB OFB CTR
--------------- --------------- --------------- --------------- ---------------
CAMELLIA128 2210ms 2200ms 2300ms 2050ms 2240ms 2250ms 2290ms 2270ms 2070ms 2070ms
CAMELLIA256 2810ms 2800ms 2920ms 2670ms 2840ms 2850ms 2910ms 2890ms 2660ms 2640ms
After:
Running each test 250 times.
ECB/Stream CBC CFB OFB CTR
--------------- --------------- --------------- --------------- ---------------
CAMELLIA128 2200ms 2220ms 2290ms 470ms 2240ms 2270ms 2270ms 2290ms 480ms 480ms
CAMELLIA256 2820ms 2820ms 2900ms 600ms 2860ms 2860ms 2900ms 2920ms 620ms 620ms
AES-NI/AVX implementation works by processing 16 parallel blocks (256 bytes).
It's bytesliced implementation that uses AES-NI (Subbyte) for Camellia sboxes,
with help of prefiltering/postfiltering. For smaller data sets generic C
implementation is used.
Speed-up for CBC-decryption and CTR-mode (large data): 4.3x
Tests were run on: Intel Core i5-2450M
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
(license boiler plate update by wk)
|
|
* configure.ac (GCRYPT_HWF_MODULES): New.
(HAVE_CPU_ARCH_X86, HAVE_CPU_ARCH_ALPHA, HAVE_CPU_ARCH_SPARC)
(HAVE_CPU_ARCH_MIPS, HAVE_CPU_ARCH_M68K, HAVE_CPU_ARCH_PPC)
(HAVE_CPU_ARCH_ARM): New AC_DEFINEs.
* mpi/config.links (mpi_cpu_arch): New.
* src/global.c (print_config): Print new tag "cpu-arch".
* src/Makefile.am (libgcrypt_la_SOURCES): Add hwf-common.h
(EXTRA_libgcrypt_la_SOURCES): New.
(gcrypt_hwf_modules): New.
(libgcrypt_la_DEPENDENCIES, libgcrypt_la_LIBADD): Add that one.
* src/hwfeatures.c: Factor most code out to ...
* src/hwf-x86.c: New file.
(detect_x86_gnuc): Return the feature vector.
(_gcry_hwf_detect_x86): New.
* src/hwf-common.h: New.
* src/hwfeatures.c (_gcry_detect_hw_features): Dispatch using
HAVE_CPU_ARCH_ macros.
Signed-off-by: Werner Koch <wk@gnupg.org>
|