summaryrefslogtreecommitdiff
path: root/src/hwf-x86.c
AgeCommit message (Collapse)AuthorFilesLines
2016-03-12Add Intel PCLMUL implementations of CRC algorithmsJussi Kivilinna1-0/+3
* cipher/Makefile.am: Add 'crc-intel-pclmul.c'. * cipher/crc-intel-pclmul.c: New. * cipher/crc.c (USE_INTEL_PCLMUL): New macro. (CRC_CONTEXT) [USE_INTEL_PCLMUL]: Add 'use_pclmul'. [USE_INTEL_PCLMUL] (_gcry_crc32_intel_pclmul) (gcry_crc24rfc2440_intel_pclmul): New. (crc32_init, crc32rfc1510_init, crc24rfc2440_init) [USE_INTEL_PCLMUL]: Select PCLMUL implementation if SSE4.1 and PCLMUL HW features detected. (crc32_write, crc24rfc2440_write) [USE_INTEL_PCLMUL]: Use PCLMUL implementation if enabled. (crc24_init): Document storage format of 24-bit CRC. (crc24_next4): Use only 'data' for last table look-up. * configure.ac: Add 'crc-intel-pclmul.lo'. * src/g10lib.h (HWF_*, HWF_INTEL_SSE4_1): Update HWF flags to include Intel SSE4.1. * src/hwf-x86.c (detect_x86_gnuc): Add SSE4.1 detection. * src/hwfeatures.c (hwflist): Add 'intel-sse4.1'. * tests/basic.c (fillbuf_count): New. (check_one_md): Add "?" check (million byte data-set with byte pattern 0x00,0x01,0x02,...); Test all buffer sizes 1 to 1000, for "!" and "?" checks. (check_one_md_multi): Skip "?". (check_digests): Add "?" test-vectors for MD5, SHA1, SHA224, SHA256, SHA384, SHA512, SHA3_224, SHA3_256, SHA3_384, SHA3_512, RIPEMD160, CRC32, CRC32_RFC1510, CRC24_RFC2440, TIGER1 and WHIRLPOOL; Add "!" test-vectors for CRC32_RFC1510 and CRC24_RFC2440. -- Add Intel PCLMUL accelerated implmentations of CRC algorithms. CRC performance is improved ~11x on x86_64 and i386 on Intel Haswell, and ~2.7x on Intel Sandy-bridge. Benchmark on Intel Core i5-4570 (x86_64, 3.2 Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.865 ns/B 1103.0 MiB/s 2.77 c/B CRC32RFC1510 | 0.865 ns/B 1102.7 MiB/s 2.77 c/B CRC24RFC2440 | 0.865 ns/B 1103.0 MiB/s 2.77 c/B After: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.079 ns/B 12051.7 MiB/s 0.253 c/B CRC32RFC1510 | 0.079 ns/B 12050.6 MiB/s 0.253 c/B CRC24RFC2440 | 0.079 ns/B 12100.0 MiB/s 0.252 c/B Benchmark on Intel Core i5-4570 (i386, 3.2 Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.860 ns/B 1109.0 MiB/s 2.75 c/B CRC32RFC1510 | 0.861 ns/B 1108.3 MiB/s 2.75 c/B CRC24RFC2440 | 0.860 ns/B 1108.6 MiB/s 2.75 c/B After: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.078 ns/B 12207.0 MiB/s 0.250 c/B CRC32RFC1510 | 0.078 ns/B 12207.0 MiB/s 0.250 c/B CRC24RFC2440 | 0.080 ns/B 11925.6 MiB/s 0.256 c/B Benchmark on Intel Core i5-2450M (x86_64, 2.5 Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 1.25 ns/B 762.3 MiB/s 3.13 c/B CRC32RFC1510 | 1.26 ns/B 759.1 MiB/s 3.14 c/B CRC24RFC2440 | 1.25 ns/B 764.9 MiB/s 3.12 c/B After: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.451 ns/B 2114.3 MiB/s 1.13 c/B CRC32RFC1510 | 0.451 ns/B 2114.6 MiB/s 1.13 c/B CRC24RFC2440 | 0.457 ns/B 2085.0 MiB/s 1.14 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-10-28hwf-x86: add detection for Intel CPUs with fast SHLD instructionJussi Kivilinna1-2/+32
* cipher/sha1.c (sha1_init): Use HWF_INTEL_FAST_SHLD instead of HWF_INTEL_CPU. * cipher/sha256.c (sha256_init, sha224_init): Ditto. * cipher/sha512.c (sha512_init, sha384_init): Ditto. * src/g10lib.h (HWF_INTEL_FAST_SHLD): New. (HWF_INTEL_BMI2, HWF_INTEL_SSSE3, HWF_INTEL_PCLMUL, HWF_INTEL_AESNI) (HWF_INTEL_RDRAND, HWF_INTEL_AVX, HWF_INTEL_AVX2) (HWF_ARM_NEON): Update. * src/hwf-x86.c (detect_x86_gnuc): Add detection of Intel Core CPUs with fast SHLD/SHRD instruction. * src/hwfeatures.c (hwflist): Add "intel-fast-shld". -- Intel Core CPUs since codename sandy-bridge have been able to execute SHLD/SHRD instructions faster than rotate instructions ROL/ROR. Since SHLD/SHRD can be used to do rotation, some optimized implementations (SHA1/SHA256/SHA512) use SHLD/SHRD instructions in-place of ROL/ROR. This patch provides more accurate detection of CPUs with fast SHLD implementation. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-05-14hwf-x86: use edi for passing value to ebx for i386 cpuidJussi Kivilinna1-1/+1
* src/hwf-x86.c [__i386__] (get_cpuid): Use '=D' for regs[1] instead of '=r'. -- On Win32, %ebx can be assigned for '=r' (regs[1]). This results invalid assembly: pushl %ebx movl %ebx, %ebx cpuid movl %ebx, %ebx popl %ebx So use '=D' (%esi) for regs[1] instead. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-05-14hwf-x86: add EDX as output register for xgetbv asm blockJussi Kivilinna1-4/+4
* src/hwf-x86.c (get_xgetbv): Add EDX as output. -- XGETBV instruction modifies EAX:EDX register pair, so we need to mark EDX as output to let compiler know that contents in this register are lost. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-09-02asm: Allow building x86 and amd64 using old compilers.Werner Koch1-0/+4
* src/hwf-x86.c (get_xgetbv): Build only if AVX support is enabled. -- Old as(1) versions do not support the xgetvb instruction. Thus build this function only if asm support has been requested. GnuPG-bug-id: 1708
2013-12-30Fix buggy/incomplete detection of AVX/AVX2 supportJussi Kivilinna1-1/+45
* configure.ac: Also check for 'xgetbv' instruction in AVX and AVX2 inline assembly checks. * src/hwf-x86.c [__i386__] (get_xgetbv): New function. [__x86_64__] (get_xgetbv): New function. [HAS_X86_CPUID] (detect_x86_gnuc): Check for OSXSAVE and OS support for XMM&YMM registers and enable AVX/AVX2 only if XMM&YMM registers are supported by OS. -- This patch is based on original patch and bug report by Panagiotis Christopoulos: Adding better detection of AVX/AVX2 support After upgrading libgcrypt from 1.5.3 to 1.6.0 on a remote XEN system (linode) my gpg2 stopped working properly, throwing SIGILL signals when doing sha512 operations etc. I managed to debug this with the help of Doublas Freed (dwfreed at mtu.edu) and it seems that the current AVX detection just checks for bit 28 on cpuid but the check still works on systems that have disabled the avx/avx2 instructions for some reason (eg. performance/unstability) resulting in SIGILLs (eg. when trying _gcry_sha512_transform_amd64_avx() ). From Intel resources[1][2], I found additional checks for better AVX detection and applied them in the following patch. Please review/change accordingly and commit some better AVX detection mechanism. The AVX part is tested but could not test the AVX2 one, because I lack proper hardware. I can provide additional information upon request. Use the patch only as a guideline, as it's not thoroughly tested. [1] http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled [2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf (sections 14.3 and 14.7.1) Reported-by: Panagiotis Christopoulos (pchrist) <pchrist@gentoo.org> Cc: Doublas Freed <dwfreed@mtu.edu> Cc: Tim Harder <radhermit@gentoo.org> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-13SHA-512: Add AVX and AVX2 implementations for x86-64Jussi Kivilinna1-1/+6
* cipher/Makefile.am: Add 'sha512-avx-amd64.S' and 'sha512-avx2-bmi2-amd64.S'. * cipher/sha512-avx-amd64.S: New. * cipher/sha512-avx2-bmi2-amd64.S: New. * cipher/sha512.c (USE_AVX, USE_AVX2): New. (SHA512_CONTEXT) [USE_AVX]: Add 'use_avx'. (SHA512_CONTEXT) [USE_AVX2]: Add 'use_avx2'. (sha512_init, sha384_init) [USE_AVX]: Initialize 'use_avx'. (sha512_init, sha384_init) [USE_AVX2]: Initialize 'use_avx2'. [USE_AVX] (_gcry_sha512_transform_amd64_avx): New. [USE_AVX2] (_gcry_sha512_transform_amd64_avx2): New. (transform) [USE_AVX2]: Add call for AVX2 implementation. (transform) [USE_AVX]: Add call for AVX implementation. * configure.ac (HAVE_GCC_INLINE_ASM_BMI2): New check. (sha512): Add 'sha512-avx-amd64.lo' and 'sha512-avx2-bmi2-amd64.lo'. * doc/gcrypt.texi: Document 'intel-cpu' and 'intel-bmi2'. * src/g10lib.h (HWF_INTEL_CPU, HWF_INTEL_BMI2): New. * src/hwfeatures.c (hwflist): Add "intel-cpu" and "intel-bmi2". * src/hwf-x86.c (detect_x86_gnuc): Check for HWF_INTEL_CPU and HWF_INTEL_BMI2. -- Patch adds fast AVX and AVX2 implementation of SHA-512 by Intel Corporation. The assembly source is licensed under 3-clause BSD license, thus compatible with LGPL2.1+. Original source can be accessed at: http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs Implementation is described in white paper "Fast SHA512 Implementations on IntelĀ® Architecture Processors" http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementat$ Note: AVX implementation uses SHLD instruction to emulate RORQ, since it's faster on Intel Sandy-Bridge. However, on non-Intel CPUs SHLD is much slower than RORQ, so therefore AVX implementation is (for now) limited to Intel CPUs. Note: AVX2 implementation also uses BMI2 instruction rorx, thus additional HWF flag. Benchmarks: cpu Old SSSE3 AVX/AVX2 Old vs AVX/AVX2 vs SSSE3 Intel i5-4570 10.11 c/B 7.56 c/B 6.72 c/B 1.50x 1.12x Intel i5-2450M 14.11 c/B 10.53 c/B 8.88 c/B 1.58x 1.18x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-12SHA-256: Add SSSE3 implementation for x86-64Jussi Kivilinna1-0/+3
* cipher/Makefile.am: Add 'sha256-ssse3-amd64.S'. * cipher/sha256-ssse3-amd64.S: New. * cipher/sha256.c (USE_SSSE3): New. (SHA256_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'. (sha256_init, sha224_init) [USE_SSSE3]: Initialize 'use_ssse3'. (transform): Rename to... (_transform): This. [USE_SSSE3] (_gcry_sha256_transform_amd64_ssse3): New. (transform): New. * configure.ac (HAVE_INTEL_SYNTAX_PLATFORM_AS): New check. (sha256): Add 'sha256-ssse3-amd64.lo'. * doc/gcrypt.texi: Document 'intel-ssse3'. * src/g10lib.h (HWF_INTEL_SSSE3): New. * src/hwfeatures.c (hwflist): Add "intel-ssse3". * src/hwf-x86.c (detect_x86_gnuc): Test for SSSE3. -- Patch adds fast SSSE3 implementation of SHA-256 by Intel Corporation. The assembly source is licensed under 3-clause BSD license, thus compatible with LGPL2.1+. Original source can be accessed at: http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs Implementation is described in white paper "Fast SHA - 256 Implementations on IntelĀ® Architecture Processors" http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html Benchmarks: cpu Old New Diff Intel i5-4570 13.99 c/B 10.66 c/B 1.31x Intel i5-2450M 21.53 c/B 15.79 c/B 1.36x Intel Core2 T8100 20.84 c/B 15.07 c/B 1.38x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-20Add Intel PCLMUL acceleration for GCMJussi Kivilinna1-0/+5
* cipher/cipher-gcm.c (fillM): Rename... (do_fillM): ...to this. (ghash): Remove. (fillM): New macro. (GHASH): Use 'do_ghash' instead of 'ghash'. [GCM_USE_INTEL_PCLMUL] (do_ghash_pclmul): New. (ghash): New. (setupM): New. (_gcry_cipher_gcm_encrypt, _gcry_cipher_gcm_decrypt) (_gcry_cipher_gcm_authenticate, _gcry_cipher_gcm_setiv) (_gcry_cipher_gcm_tag): Use 'ghash' instead of 'GHASH' and 'c->u_mode.gcm.u_tag.tag' instead of 'c->u_tag.tag'. * cipher/cipher-internal.h (GCM_USE_INTEL_PCLMUL): New. (gcry_cipher_handle): Move 'u_tag' and 'gcm_table' under 'u_mode.gcm'. * configure.ac (pclmulsupport, gcry_cv_gcc_inline_asm_pclmul): New. * src/g10lib.h (HWF_INTEL_PCLMUL): New. * src/global.c: Add "intel-pclmul". * src/hwf-x86.c (detect_x86_gnuc): Add check for Intel PCLMUL. -- Speed-up GCM for Intel CPUs. Intel Haswell (x86-64): Old: AES GCM enc | 5.17 ns/B 184.4 MiB/s 16.55 c/B GCM dec | 4.38 ns/B 218.0 MiB/s 14.00 c/B GCM auth | 3.17 ns/B 300.4 MiB/s 10.16 c/B New: AES GCM enc | 3.01 ns/B 317.2 MiB/s 9.62 c/B GCM dec | 1.96 ns/B 486.9 MiB/s 6.27 c/B GCM auth | 0.848 ns/B 1124.8 MiB/s 2.71 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-09Add detection for Intel AVX2 instruction setJussi Kivilinna1-3/+21
* configure.ac: Add option --disable-avx2-support. (HAVE_GCC_INLINE_ASM_AVX2): New. (ENABLE_AVX2_SUPPORT): New. * src/g10lib.h (HWF_INTEL_AVX2): New. * src/global.c (hwflist): Add HWF_INTEL_AVX2. * src/hwf-x86.c [__i386__] (get_cpuid): Initialize registers to zero before cpuid. [__x86_64__] (get_cpuid): Initialize registers to zero before cpuid. (detect_x86_gnuc): Store maximum cpuid level. (detect_x86_gnuc) [ENABLE_AVX2_SUPPORT]: Add detection for AVX2. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-02-19Add AES-NI/AVX accelerated Camellia implementationJussi Kivilinna1-0/+5
* configure.ac: Add option --disable-avx-support. (HAVE_GCC_INLINE_ASM_AVX): New. (ENABLE_AVX_SUPPORT): New. (camellia) [ENABLE_AVX_SUPPORT, ENABLE_AESNI_SUPPORT]: Add camellia_aesni_avx_x86-64.lo. * cipher/Makefile.am (AM_CCASFLAGS): Add. (EXTRA_libcipher_la_SOURCES): Add camellia_aesni_avx_x86-64.S * cipher/camellia-glue.c [ENABLE_AESNI_SUPPORT, ENABLE_AVX_SUPPORT] [__x86_64__] (USE_AESNI_AVX): Add macro. (struct Camellia_context) [USE_AESNI_AVX]: Add use_aesni_avx. [USE_AESNI_AVX] (_gcry_camellia_aesni_avx_ctr_enc) (_gcry_camellia_aesni_avx_cbc_dec): New prototypes to assembly functions. (camellia_setkey) [USE_AESNI_AVX]: Enable AES-NI/AVX if hardware support both. (_gcry_camellia_ctr_enc) [USE_AESNI_AVX]: Add AES-NI/AVX code. (_gcry_camellia_cbc_dec) [USE_AESNI_AVX]: Add AES-NI/AVX code. * cipher/camellia_aesni_avx_x86-64.S: New. * src/g10lib.h (HWF_INTEL_AVX): New. * src/global.c (hwflist): Add HWF_INTEL_AVX. * src/hwf-x86.c (detect_x86_gnuc) [ENABLE_AVX_SUPPORT]: Add detection for AVX. -- Before: Running each test 250 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 2210ms 2200ms 2300ms 2050ms 2240ms 2250ms 2290ms 2270ms 2070ms 2070ms CAMELLIA256 2810ms 2800ms 2920ms 2670ms 2840ms 2850ms 2910ms 2890ms 2660ms 2640ms After: Running each test 250 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 2200ms 2220ms 2290ms 470ms 2240ms 2270ms 2270ms 2290ms 480ms 480ms CAMELLIA256 2820ms 2820ms 2900ms 600ms 2860ms 2860ms 2900ms 2920ms 620ms 620ms AES-NI/AVX implementation works by processing 16 parallel blocks (256 bytes). It's bytesliced implementation that uses AES-NI (Subbyte) for Camellia sboxes, with help of prefiltering/postfiltering. For smaller data sets generic C implementation is used. Speed-up for CBC-decryption and CTR-mode (large data): 4.3x Tests were run on: Intel Core i5-2450M Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> (license boiler plate update by wk)
2012-12-21Prepare for hardware feature detection on other platforms.Werner Koch1-0/+226
* configure.ac (GCRYPT_HWF_MODULES): New. (HAVE_CPU_ARCH_X86, HAVE_CPU_ARCH_ALPHA, HAVE_CPU_ARCH_SPARC) (HAVE_CPU_ARCH_MIPS, HAVE_CPU_ARCH_M68K, HAVE_CPU_ARCH_PPC) (HAVE_CPU_ARCH_ARM): New AC_DEFINEs. * mpi/config.links (mpi_cpu_arch): New. * src/global.c (print_config): Print new tag "cpu-arch". * src/Makefile.am (libgcrypt_la_SOURCES): Add hwf-common.h (EXTRA_libgcrypt_la_SOURCES): New. (gcrypt_hwf_modules): New. (libgcrypt_la_DEPENDENCIES, libgcrypt_la_LIBADD): Add that one. * src/hwfeatures.c: Factor most code out to ... * src/hwf-x86.c: New file. (detect_x86_gnuc): Return the feature vector. (_gcry_hwf_detect_x86): New. * src/hwf-common.h: New. * src/hwfeatures.c (_gcry_detect_hw_features): Dispatch using HAVE_CPU_ARCH_ macros. Signed-off-by: Werner Koch <wk@gnupg.org>