summaryrefslogtreecommitdiff
path: root/configure.ac
AgeCommit message (Collapse)AuthorFilesLines
2013-12-13SHA-1: Add SSSE3 implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'sha1-ssse3-amd64.c'. * cipher/sha1-ssse3-amd64.c: New. * cipher/sha1.c (USE_SSSE3): New. (SHA1_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'. (sha1_init) [USE_SSSE3]: Initialize 'use_ssse3'. (transform): Rename to... (_transform): this. (transform): New. * configure.ac [host=x86_64]: Add 'sha1-ssse3-amd64.lo'. -- Patch adds SSSE3 implementation based on white paper "Improving the Performance of the Secure Hash Algorithm (SHA-1)" at http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1 Benchmarks: cpu Old New Diff Intel i5-4570 9.02 c/B 5.22 c/B 1.72x Intel i5-2450M 12.27 c/B 7.24 c/B 1.69x Intel Core2 T8100 7.94 c/B 6.76 c/B 1.17x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-13Fix empty clobber in AVX2 assembly checkJussi Kivilinna1-1/+1
* configure.ac (gcry_cv_gcc_inline_asm_avx2): Add "cc" as assembly globber. -- Appearently empty globbers only work in some cases on linux, and fail on mingw32. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-13SHA-512: Add AVX and AVX2 implementations for x86-64Jussi Kivilinna1-0/+19
* cipher/Makefile.am: Add 'sha512-avx-amd64.S' and 'sha512-avx2-bmi2-amd64.S'. * cipher/sha512-avx-amd64.S: New. * cipher/sha512-avx2-bmi2-amd64.S: New. * cipher/sha512.c (USE_AVX, USE_AVX2): New. (SHA512_CONTEXT) [USE_AVX]: Add 'use_avx'. (SHA512_CONTEXT) [USE_AVX2]: Add 'use_avx2'. (sha512_init, sha384_init) [USE_AVX]: Initialize 'use_avx'. (sha512_init, sha384_init) [USE_AVX2]: Initialize 'use_avx2'. [USE_AVX] (_gcry_sha512_transform_amd64_avx): New. [USE_AVX2] (_gcry_sha512_transform_amd64_avx2): New. (transform) [USE_AVX2]: Add call for AVX2 implementation. (transform) [USE_AVX]: Add call for AVX implementation. * configure.ac (HAVE_GCC_INLINE_ASM_BMI2): New check. (sha512): Add 'sha512-avx-amd64.lo' and 'sha512-avx2-bmi2-amd64.lo'. * doc/gcrypt.texi: Document 'intel-cpu' and 'intel-bmi2'. * src/g10lib.h (HWF_INTEL_CPU, HWF_INTEL_BMI2): New. * src/hwfeatures.c (hwflist): Add "intel-cpu" and "intel-bmi2". * src/hwf-x86.c (detect_x86_gnuc): Check for HWF_INTEL_CPU and HWF_INTEL_BMI2. -- Patch adds fast AVX and AVX2 implementation of SHA-512 by Intel Corporation. The assembly source is licensed under 3-clause BSD license, thus compatible with LGPL2.1+. Original source can be accessed at: http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs Implementation is described in white paper "Fast SHA512 Implementations on Intel® Architecture Processors" http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementat$ Note: AVX implementation uses SHLD instruction to emulate RORQ, since it's faster on Intel Sandy-Bridge. However, on non-Intel CPUs SHLD is much slower than RORQ, so therefore AVX implementation is (for now) limited to Intel CPUs. Note: AVX2 implementation also uses BMI2 instruction rorx, thus additional HWF flag. Benchmarks: cpu Old SSSE3 AVX/AVX2 Old vs AVX/AVX2 vs SSSE3 Intel i5-4570 10.11 c/B 7.56 c/B 6.72 c/B 1.50x 1.12x Intel i5-2450M 14.11 c/B 10.53 c/B 8.88 c/B 1.58x 1.18x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-13SHA-512: Add SSSE3 implementation for x86-64Jussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'sha512-ssse3-amd64.S'. * cipher/sha512-ssse3-amd64.S: New. * cipher/sha512.c (USE_SSSE3): New. (SHA512_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'. (sha512_init, sha384_init) [USE_SSSE3]: Initialize 'use_ssse3'. [USE_SSSE3] (_gcry_sha512_transform_amd64_ssse3): New. (transform) [USE_SSSE3]: Call SSSE3 implementation. * configure.ac (sha512): Add 'sha512-ssse3-amd64.lo'. -- Patch adds fast SSSE3 implementation of SHA-512 by Intel Corporation. The assembly source is licensed under 3-clause BSD license, thus compatible with LGPL2.1+. Original source can be accessed at: http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs Implementation is described in white paper "Fast SHA512 Implementations on Intel® Architecture Processors" http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementations-ia-processors-paper.html Benchmarks: cpu Old New Diff Intel i5-4570 10.11 c/B 7.56 c/B 1.33x Intel i5-2450M 14.11 c/B 10.53 c/B 1.33x Intel Core2 T8100 11.92 c/B 10.22 c/B 1.16x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-12SHA-256: Add SSSE3 implementation for x86-64Jussi Kivilinna1-0/+41
* cipher/Makefile.am: Add 'sha256-ssse3-amd64.S'. * cipher/sha256-ssse3-amd64.S: New. * cipher/sha256.c (USE_SSSE3): New. (SHA256_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'. (sha256_init, sha224_init) [USE_SSSE3]: Initialize 'use_ssse3'. (transform): Rename to... (_transform): This. [USE_SSSE3] (_gcry_sha256_transform_amd64_ssse3): New. (transform): New. * configure.ac (HAVE_INTEL_SYNTAX_PLATFORM_AS): New check. (sha256): Add 'sha256-ssse3-amd64.lo'. * doc/gcrypt.texi: Document 'intel-ssse3'. * src/g10lib.h (HWF_INTEL_SSSE3): New. * src/hwfeatures.c (hwflist): Add "intel-ssse3". * src/hwf-x86.c (detect_x86_gnuc): Test for SSSE3. -- Patch adds fast SSSE3 implementation of SHA-256 by Intel Corporation. The assembly source is licensed under 3-clause BSD license, thus compatible with LGPL2.1+. Original source can be accessed at: http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs Implementation is described in white paper "Fast SHA - 256 Implementations on Intel® Architecture Processors" http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html Benchmarks: cpu Old New Diff Intel i5-4570 13.99 c/B 10.66 c/B 1.31x Intel i5-2450M 21.53 c/B 15.79 c/B 1.36x Intel Core2 T8100 20.84 c/B 15.07 c/B 1.38x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-20Add Intel PCLMUL acceleration for GCMJussi Kivilinna1-0/+35
* cipher/cipher-gcm.c (fillM): Rename... (do_fillM): ...to this. (ghash): Remove. (fillM): New macro. (GHASH): Use 'do_ghash' instead of 'ghash'. [GCM_USE_INTEL_PCLMUL] (do_ghash_pclmul): New. (ghash): New. (setupM): New. (_gcry_cipher_gcm_encrypt, _gcry_cipher_gcm_decrypt) (_gcry_cipher_gcm_authenticate, _gcry_cipher_gcm_setiv) (_gcry_cipher_gcm_tag): Use 'ghash' instead of 'GHASH' and 'c->u_mode.gcm.u_tag.tag' instead of 'c->u_tag.tag'. * cipher/cipher-internal.h (GCM_USE_INTEL_PCLMUL): New. (gcry_cipher_handle): Move 'u_tag' and 'gcm_table' under 'u_mode.gcm'. * configure.ac (pclmulsupport, gcry_cv_gcc_inline_asm_pclmul): New. * src/g10lib.h (HWF_INTEL_PCLMUL): New. * src/global.c: Add "intel-pclmul". * src/hwf-x86.c (detect_x86_gnuc): Add check for Intel PCLMUL. -- Speed-up GCM for Intel CPUs. Intel Haswell (x86-64): Old: AES GCM enc | 5.17 ns/B 184.4 MiB/s 16.55 c/B GCM dec | 4.38 ns/B 218.0 MiB/s 14.00 c/B GCM auth | 3.17 ns/B 300.4 MiB/s 10.16 c/B New: AES GCM enc | 3.01 ns/B 317.2 MiB/s 9.62 c/B GCM dec | 1.96 ns/B 486.9 MiB/s 6.27 c/B GCM auth | 0.848 ns/B 1124.8 MiB/s 2.71 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-06Fix __builtin_bswap32/64 checksJussi Kivilinna1-4/+4
* configure.ac (gcry_cv_have_builtin_bswap32) (gcry_cv_have_builtin_bswap64): Change compile checks to link checks. -- Patch changes compile checks to link checks for __builtin_bswap(32|64). Compiling obviously works with missing functions, linking not so much. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-28Add ARM NEON assembly implementation of SerpentJussi Kivilinna1-0/+5
* cipher/Makefile.am: Add 'serpent-armv7-neon.S'. * cipher/serpent-armv7-neon.S: New. * cipher/serpent.c (USE_NEON): New macro. (serpent_context_t) [USE_NEON]: Add 'use_neon'. [USE_NEON] (_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec) (_gcry_serpent_neon_cbc_dec): New prototypes. (serpent_setkey_internal) [USE_NEON]: Detect NEON support. (_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec) (_gcry_serpent_neon_cbc_dec) [USE_NEON]: Use NEON implementations to process eight blocks in parallel. * configure.ac [neonsupport]: Add 'serpent-armv7-neon.lo'. -- Patch adds ARM NEON optimized implementation of Serpent cipher to speed up parallelizable bulk operations. Benchmarks on ARM Cortex-A8 (armhf, 1008 Mhz): Old: SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte CBC dec | 43.53 ns/B 21.91 MiB/s 43.88 c/B CFB dec | 44.77 ns/B 21.30 MiB/s 45.13 c/B CTR enc | 45.21 ns/B 21.10 MiB/s 45.57 c/B CTR dec | 45.21 ns/B 21.09 MiB/s 45.57 c/B New: SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte CBC dec | 26.26 ns/B 36.32 MiB/s 26.47 c/B CFB dec | 26.21 ns/B 36.38 MiB/s 26.42 c/B CTR enc | 26.20 ns/B 36.40 MiB/s 26.41 c/B CTR dec | 26.20 ns/B 36.40 MiB/s 26.41 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-28Add ARM NEON assembly implementation of Salsa20Jussi Kivilinna1-0/+5
* cipher/Makefile.am: Add 'salsa20-armv7-neon.S'. * cipher/salsa20-armv7-neon.S: New. * cipher/salsa20.c [USE_ARM_NEON_ASM]: New macro. (struct SALSA20_context_s, salsa20_core_t, salsa20_keysetup_t) (salsa20_ivsetup_t): New. (SALSA20_context_t) [USE_ARM_NEON_ASM]: Add 'use_neon'. (SALSA20_context_t): Add 'keysetup', 'ivsetup' and 'core'. (salsa20_core): Change 'src' argument to 'ctx'. [USE_ARM_NEON_ASM] (_gcry_arm_neon_salsa20_encrypt): New prototype. [USE_ARM_NEON_ASM] (salsa20_core_neon, salsa20_keysetup_neon) (salsa20_ivsetup_neon): New. (salsa20_do_setkey): Setup keysetup, ivsetup and core with default functions. (salsa20_do_setkey) [USE_ARM_NEON_ASM]: When NEON support detect, set keysetup, ivsetup and core with ARM NEON functions. (salsa20_do_setkey): Call 'ctx->keysetup'. (salsa20_setiv): Call 'ctx->ivsetup'. (salsa20_do_encrypt_stream) [USE_ARM_NEON_ASM]: Process large buffers in ARM NEON implementation. (salsa20_do_encrypt_stream): Call 'ctx->core' instead of directly calling 'salsa20_core'. (selftest): Add test to check large buffer processing and block counter updating. * configure.ac [neonsupport]: 'Add salsa20-armv7-neon.lo'. -- Patch adds fast ARM NEON assembly implementation for Salsa20. Implementation gains extra speed by processing three blocks in parallel with help of ARM NEON vector processing unit. This implementation is based on public domain code by Peter Schwabe and D. J. Bernstein and it is available in SUPERCOP benchmarking framework. For more details on this work, check paper "NEON crypto" by Daniel J. Bernstein and Peter Schwabe: http://cryptojedi.org/papers/#neoncrypto Benchmark results on Cortex-A8 (1008 Mhz): Before: SALSA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 18.88 ns/B 50.51 MiB/s 19.03 c/B STREAM dec | 18.89 ns/B 50.49 MiB/s 19.04 c/B = SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 13.60 ns/B 70.14 MiB/s 13.71 c/B STREAM dec | 13.60 ns/B 70.13 MiB/s 13.71 c/B After: SALSA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 5.48 ns/B 174.1 MiB/s 5.52 c/B STREAM dec | 5.47 ns/B 174.2 MiB/s 5.52 c/B = SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 3.65 ns/B 260.9 MiB/s 3.68 c/B STREAM dec | 3.65 ns/B 261.6 MiB/s 3.67 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-28Add AMD64 assembly implementation of Salsa20Jussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'salsa20-amd64.S'. * cipher/salsa20-amd64.S: New. * cipher/salsa20.c (USE_AMD64): New macro. [USE_AMD64] (_gcry_salsa20_amd64_keysetup, _gcry_salsa20_amd64_ivsetup) (_gcry_salsa20_amd64_encrypt_blocks): New prototypes. [USE_AMD64] (salsa20_keysetup, salsa20_ivsetup, salsa20_core): New. [!USE_AMD64] (salsa20_core): Change 'src' to non-constant, update block counter in 'salsa20_core' and return burn stack depth. [!USE_AMD64] (salsa20_keysetup, salsa20_ivsetup): New. (salsa20_do_setkey): Move generic key setup to 'salsa20_keysetup'. (salsa20_setkey): Fix burn stack depth. (salsa20_setiv): Move generic IV setup to 'salsa20_ivsetup'. (salsa20_do_encrypt_stream) [USE_AMD64]: Process large buffers in AMD64 implementation. (salsa20_do_encrypt_stream): Move stack burning to this function... (salsa20_encrypt_stream, salsa20r12_encrypt_stream): ...from these functions. * configure.ac [x86-64]: Add 'salsa20-amd64.lo'. -- Patch adds fast AMD64 assembly implementation for Salsa20. This implementation is based on public domain code by D. J. Bernstein and it is available at http://cr.yp.to/snuffle.html (amd64-xmm6). Implementation gains extra speed by processing four blocks in parallel with help SSE2 instructions. Benchmark results on Intel Core i5-4570 (3.2 Ghz): Before: SALSA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 3.88 ns/B 246.0 MiB/s 12.41 c/B STREAM dec | 3.88 ns/B 246.0 MiB/s 12.41 c/B = SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 2.46 ns/B 387.9 MiB/s 7.87 c/B STREAM dec | 2.46 ns/B 387.7 MiB/s 7.87 c/B After: SALSA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.985 ns/B 967.8 MiB/s 3.15 c/B STREAM dec | 0.987 ns/B 966.5 MiB/s 3.16 c/B = SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.636 ns/B 1500.5 MiB/s 2.03 c/B STREAM dec | 0.636 ns/B 1499.2 MiB/s 2.04 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-23Enable assembler optimizations on earlier ARM coresDmitry Eremin-Solenikov1-5/+5
* cipher/blowfish-armv6.S => cipher/blowfish-arm.S: adapt to pre-armv6 CPUs. * cipher/blowfish.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/camellia-armv6.S => cipher/camellia-arm.S: adapt to pre-armv6 CPUs. * cipher/camellia.c, cipher-camellia-glue.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/cast5-armv6.S => cipher/cast5-arm.S: adapt to pre-armv6 CPUs. * cipher/cast5.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/rijndael-armv6.S => cipher/rijndael-arm.S: adapt to pre-armv6 CPUs. * cipher/rijndael.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/twofish-armv6.S => cipher/twofish-arm.S: adapt to pre-armv6 CPUs. * cipher/twofish.c: enable assembly on armv4/armv5 little-endian CPUs. -- Our ARMv6 assembly optimized code can be easily adapted to earlier CPUs. The only incompatible place is rev instruction used to do byte swapping. Replace it on <= ARMv6 with a series of 4 instructions. Compare: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 620ms 610ms 650ms 680ms 620ms 630ms 660ms 660ms 630ms 630ms CAMELLIA128 720ms 720ms 780ms 790ms 770ms 760ms 780ms 780ms 770ms 760ms CAMELLIA256 910ms 910ms 970ms 970ms 960ms 950ms 970ms 970ms 960ms 950ms CAST5 820ms 820ms 930ms 920ms 890ms 860ms 930ms 920ms 880ms 890ms BLOWFISH 550ms 560ms 650ms 660ms 630ms 600ms 660ms 650ms 610ms 620ms ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 130ms 140ms 180ms 200ms 160ms 170ms 190ms 200ms 170ms 170ms CAMELLIA128 150ms 160ms 210ms 220ms 200ms 190ms 210ms 220ms 190ms 190ms CAMELLIA256 180ms 180ms 260ms 240ms 240ms 230ms 250ms 250ms 230ms 230ms CAST5 170ms 160ms 270ms 120ms 240ms 130ms 260ms 270ms 130ms 120ms BLOWFISH 160ms 150ms 260ms 110ms 230ms 120ms 250ms 260ms 110ms 120ms Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> [ jk: in camellia.h and twofish.c, USE_ARMV6_ASM => USE_ARM_ASM ] [ jk: fix blowfish-arm.S when __ARM_FEATURE_UNALIGNED defined ] [ jk: in twofish.S remove defined(HAVE_ARM_ARCH_V6) ] [ jk: ARMv6 => ARM in comments ]
2013-10-23Correct ASM assembly test in configure.acDmitry Eremin-Solenikov1-3/+2
* configure.ac: correct HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS test to require neither ARMv6, nor thumb mode. Our assembly code works perfectly even on ARMv4 now. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
2013-10-23ecc: Refactor ecc.cWerner Koch1-1/+2
* cipher/ecc-ecdsa.c, cipher/ecc-eddsa.c, cipher/ecc-gost.c: New. * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add new files. * configure.ac (GCRYPT_PUBKEY_CIPHERS): Add new files. * cipher/ecc.c (point_init, point_free): Move to ecc-common.h. (sign_ecdsa): Move to ecc-ecdsa.c as _gcry_ecc_ecdsa_sign. (verify_ecdsa): Move to ecc-ecdsa.c as _gcry_ecc_ecdsa_verify. (sign_gost): Move to ecc-gots.c as _gcry_ecc_gost_sign. (verify_gost): Move to ecc-gost.c as _gcry_ecc_gost_verify. (sign_eddsa): Move to ecc-eddsa.c as _gcry_ecc_eddsa_sign. (verify_eddsa): Move to ecc-eddsa.c as _gcry_ecc_eddsa_verify. (eddsa_generate_key): Move to ecc-eddsa.c as _gcry_ecc_eddsa_genkey. (reverse_buffer): Move to ecc-eddsa.c. (eddsa_encodempi, eddsa_encode_x_y): Ditto. (_gcry_ecc_eddsa_encodepoint, _gcry_ecc_eddsa_decodepoint): Ditto. -- This change should make it easier to add new ECC algorithms. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-10-22twofish: add ARMv6 assembly implementationJussi Kivilinna1-0/+4
* cipher/Makefile.am: Add 'twofish-armv6.S'. * cipher/twofish-armv6.S: New. * cipher/twofish.c (USE_ARMV6_ASM): New macro. [USE_ARMV6_ASM] (_gcry_twofish_armv6_encrypt_block) (_gcry_twofish_armv6_decrypt_block): New prototypes. [USE_AMDV6_ASM] (twofish_encrypt, twofish_decrypt): Add. [USE_AMD64_ASM] (do_twofish_encrypt, do_twofish_decrypt): Remove. (_gcry_twofish_ctr_enc, _gcry_twofish_cfb_dec): Use 'twofish_encrypt' instead of 'do_twofish_encrypt'. (_gcry_twofish_cbc_dec): Use 'twofish_decrypt' instead of 'do_twofish_decrypt'. * configure.ac [arm]: Add 'twofish-armv6.lo'. -- Add optimized ARMv6 assembly implementation for Twofish. Implementation is tuned for Cortex-A8. Unaligned access handling is done in assembly part. For now, only enable this on little-endian systems as big-endian correctness have not been tested yet. Old (gcc-4.8) vs new (twofish-asm), Cortex-A8 (on armhf): ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- TWOFISH 1.23x 1.25x 1.16x 1.26x 1.16x 1.30x 1.18x 1.17x 1.23x 1.23x 1.22x 1.22x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-22serpent-amd64: do not use GAS macrosJussi Kivilinna1-6/+1
* cipher/serpent-avx2-amd64.S: Remove use of GAS macros. * cipher/serpent-sse2-amd64.S: Ditto. * configure.ac [HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS]: Do not check for GAS macros. -- This way we have better portability; for example, when compiling with clang on x86-64, the assembly implementations are now enabled and working. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-10Prevent tail call optimization with _gcry_burn_stackJussi Kivilinna1-1/+27
* configure.ac: New check, HAVE_GCC_ASM_VOLATILE_MEMORY. * src/g10lib.h (_gcry_burn_stack): Rename to __gcry_burn_stack. (__gcry_burn_stack_dummy): New. (_gcry_burn_stack): New macro. * src/misc.c (_gcry_burn_stack): Rename to __gcry_burn_stack. (__gcry_burn_stack_dummy): New. -- Tail call optimization can turn _gcry_burn_stack call in to tail jump. When this happens, stack pointer is restored to initial state of current function. This causes problem for _gcry_burn_stack because its callers do not count in current function stack depth. One solution is to prevent gcry_burn_stack being tail optimized by inserting dummy function call behind it. Another would be to add memory barrier 'asm volatile("":::"memory")' behind every _gcry_burn_stack call. This however requires GCC asm support from compiler. Patch adds detection for memory barrier support and when available uses memory barrier to prevent when tail call optimization. If not available dummy function call is used instead. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-09-21Optimize and cleanup 32-bit and 64-bit endianess transformsJussi Kivilinna1-0/+30
* cipher/bithelp.h (bswap32, bswap64, le_bswap32, be_bswap32) (le_bswap64, be_bswap64): New. * cipher/bufhelp.h (buf_get_be32, buf_get_le32, buf_put_le32) (buf_put_be32, buf_get_be64, buf_get_le64, buf_put_be64) (buf_put_le64): New. * cipher/blowfish.c (do_encrypt_block, do_decrypt_block): Use new endian conversion helpers. (do_bf_setkey): Turn endian specific code to generic. * cipher/camellia.c (GETU32, PUTU32): Use new endian conversion helpers. * cipher/cast5.c (rol): Remove, use rol from bithelp. (F1, F2, F3): Fix to use rol from bithelp. (do_encrypt_block, do_decrypt_block, do_cast_setkey): Use new endian conversion helpers. * cipher/des.c (READ_64BIT_DATA, WRITE_64BIT_DATA): Ditto. * cipher/md4.c (transform, md4_final): Ditto. * cipher/md5.c (transform, md5_final): Ditto. * cipher/rmd160.c (transform, rmd160_final): Ditto. * cipher/salsa20.c (LE_SWAP32, LE_READ_UINT32): Ditto. * cipher/scrypt.c (READ_UINT64, LE_READ_UINT64, LE_SWAP32): Ditto. * cipher/seed.c (GETU32, PUTU32): Ditto. * cipher/serpent.c (byte_swap_32): Remove. (serpent_key_prepare, serpent_encrypt_internal) (serpent_decrypt_internal): Use new endian conversion helpers. * cipher/sha1.c (transform, sha1_final): Ditto. * cipher/sha256.c (transform, sha256_final): Ditto. * cipher/sha512.c (__transform, sha512_final): Ditto. * cipher/stribog.c (transform, stribog_final): Ditto. * cipher/tiger.c (transform, tiger_final): Ditto. * cipher/twofish.c (INPACK, OUTUNPACK): Ditto. * cipher/whirlpool.c (buffer_to_block, block_to_buffer): Ditto. * configure.ac (gcry_cv_have_builtin_bswap32): Check for compiler provided __builtin_bswap32. (gcry_cv_have_builtin_bswap64): Check for compiler provided __builtin_bswap64. -- Patch add helper functions that provide conversions to/from integers and buffers of different endianess. Benefits are code cleanup and optimization for architectures that have byte-swaping instructions and/or can do fast unaligned memory accesses. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-09-18Add GOST R 34.11-2012 implementation (Stribog)Dmitry Eremin-Solenikov1-2/+8
* src/gcrypt.h.in (GCRY_MD_GOSTR3411_12_256) (GCRY_MD_GOSTR3411_12_512): New. * cipher/stribog.c: New. * configure.ac (available_digests_64): Add stribog. * src/cipher.h: Declare Stribog declarations. * cipher/md.c: Register Stribog digest. * tests/basic.c (check_digests) Add 4 testcases for Stribog from standard. * doc/gcrypt.texi: Document new constants. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
2013-09-18Add basic implementation of GOST R 34.11-94 message digestDmitry Eremin-Solenikov1-1/+11
* src/gcrypt.h.in (GCRY_MD_GOSTR3411_94): New. * cipher/gostr3411-94.c: New. * configure.ac (available_digests): Add gostr3411-94. * src/cipher.h: Add gostr3411-94 definitions. * cipher/md.c: Register GOST R 34.11-94. * tests/basic.c (check_digests): Add 4 tests for GOST R 34.11-94 hash algo. Two are defined in the standard itself, two other are more or less common tests - an empty string an exclamation mark. * doc/gcrypt.texi: Add an entry describing GOST R 34.11-94 to the MD algorithms table. -- Add simple implementation of GOST R 34.11-94 hash function. Currently there is no way to specify hash parameters (it always uses GOST R 34.11-94 test parameters). Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Stack burn value in gost3411_init added by wk.
2013-09-18Add limited implementation of GOST 28147-89 cipherDmitry Eremin-Solenikov1-1/+7
* src/gcrypt.h.in (GCRY_CIPHER_GOST28147): New. * cipher/gost.h, cipher/gost28147.c: New. * configure.ac (available_ciphers): Add gost28147. * src/cipher.h: Add gost28147 definitions. * cipher/cipher.c: Register gost28147. * tests/basic.c (check_ciphers): Enable simple test for gost28147. * doc/gcrypt.texi: document GCRY_CIPHER_GOST28147. -- Add a very basic implementation of GOST 28147-89 cipher: from modes defined in standard only ECB and CFB are supported, sbox is limited to the "test variant" as provided in GOST 34.11-94. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
2013-09-07Add configure option --disable-amd64-as-feature-detection.Werner Koch1-4/+14
* configure.ac: Implement new disable flag. -- Doing a static build of Libgcrypt currently throws an as error on my box. Adding this configure option as a workaround Signed-off-by: Werner Koch <wk@gnupg.org>
2013-09-04Make _gcry_burn_stack use variable length arrayJussi Kivilinna1-0/+18
* configure.ac (HAVE_VLA): Add check. * src/misc.c (_gcry_burn_stack) [HAVE_VLA]: Add VLA code. -- Some gcc versions convert _gcry_burn_stack into loop that overwrites the same 64-byte stack buffer instead of burn stack deeper. It's argued at GCC bugzilla that _gcry_burn_stack is doing wrong thing here [1] and that this kind of optimization is allowed. So lets fix _gcry_burn_stack by using variable length array when VLAs are supported by compiler. This should ensure proper stack burning to the requested depth and avoid GCC loop optimizations. [1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52285 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-31sha512: add ARM/NEON assembly version of transform functionJussi Kivilinna1-0/+5
* cipher/Makefile.am: Add 'sha512-armv7-neon.S'. * cipher/sha512-armv7-neon.S: New file. * cipher/sha512.c (USE_ARM_NEON_ASM): New macro. (SHA512_CONTEXT) [USE_ARM_NEON_ASM]: Add 'use_neon'. (sha512_init, sha384_init) [USE_ARM_NEON_ASM]: Enable 'use_neon' if CPU support NEON instructions. (k): Round constant array moved outside of 'transform' function. (__transform): Renamed from 'tranform' function. [USE_ARM_NEON_ASM] (_gcry_sha512_transform_armv7_neon): New prototype. (transform): New wrapper function for different transform versions. (sha512_write, sha512_final): Burn stack by the amount returned by transform function. * configure.ac (sha512) [neonsupport]: Add 'sha512-armv7-neon.lo'. -- Add NEON assembly for transform function for faster SHA512 on ARM. Major speed up thanks to 64-bit integer registers and large register file that can hold full input buffer. Benchmark results on Cortex-A8, 1Ghz: Old: $ tests/benchmark --hash-repetitions 100 md sha512 sha384 SHA512 17050ms 18780ms 29120ms 18040ms 17190ms SHA384 17130ms 18720ms 29160ms 18090ms 17280ms New: $ tests/benchmark --hash-repetitions 100 md sha512 sha384 SHA512 3600ms 5070ms 15330ms 4510ms 3480ms SHA384 3590ms 5060ms 15350ms 4510ms 3520ms New vs old: SHA512 4.74x 3.70x 1.90x 4.00x 4.94x SHA384 4.77x 3.70x 1.90x 4.01x 4.91x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-31Add ARM HW feature detection module and add NEON detectionJussi Kivilinna1-0/+43
* configure.ac: Add option --disable-neon-support. (HAVE_GCC_INLINE_ASM_NEON): New. (ENABLE_NEON_SUPPORT): New. [arm]: Add 'hwf-arm.lo' as HW feature module. * src/Makefile.am: Add 'hwf-arm.c'. * src/g10lib.h (HWF_ARM_NEON): New macro. * src/global.c (hwflist): Add HWF_ARM_NEON entry. * src/hwf-arm.c: New file. * src/hwf-common.h (_gcry_hwf_detect_arm): New prototype. * src/hwfeatures.c (_gcry_detect_hw_features) [HAVE_CPU_ARCH_ARM]: Add call to _gcry_hwf_detect_arm. -- Add HW detection module for detecting ARM NEON instruction set. ARM does not have cpuid instruction so we have to rely on OS to pass feature set information to user-space. For linux, NEON support can be detected by parsing '/proc/self/auxv' for hardware capabilities information. For other OSes, NEON can be detected by checking if platform/compiler only supports NEON capable CPUs (by check if __ARM_NEON__ macro is defined). Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-30Refactor the ECC code into 3 files.Werner Koch1-1/+2
* cipher/ecc-common.h, cipher/ecc-curves.c, cipher/ecc-misc.c: New. * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add new files. * configure.ac (GCRYPT_PUBKEY_CIPHERS): Add new .c files. * cipher/ecc.c (curve_aliases, ecc_domain_parms_t, domain_parms) (scanval): Move to ecc-curves.c. (fill_in_curve): Move to ecc-curve.c as _gcry_ecc_fill_in_curve. (ecc_get_curve): Move to ecc-curve.c as _gcry_ecc_get_curve. (_gcry_mpi_ec_ec2os): Move to ecc-misc.c. (ec2os): Move to ecc-misc.c as _gcry_ecc_ec2os. (os2ec): Move to ecc-misc.c as _gcry_ecc_os2ec. (point_set): Move as inline function to ecc-common.h. (_gcry_ecc_curve_free): Move to ecc-misc.c as _gcry_ecc_curve_free. (_gcry_ecc_curve_copy): Move to ecc-misc.c as _gcry_ecc_curve_copy. (mpi_from_keyparam, point_from_keyparam): Move to ecc-curves.c. (_gcry_mpi_ec_new): Move to ecc-curves.c. (ecc_get_param): Move to ecc-curves.c as _gcry_ecc_get_param. (ecc_get_param_sexp): Move to ecc-curves.c as _gcry_ecc_get_param_sexp. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-08-20Move ARMv6 detection to configure.acJussi Kivilinna1-0/+23
* cipher/blowfish-armv6.S: Replace __ARM_ARCH >= 6 checks with HAVE_ARM_ARCH_V6. * cipher/blowfish.c: Ditto. * cipher/camellia-armv6.S: Ditto. * cipher/camellia.h: Ditto. * cipher/cast5-armv6.S: Ditto. * cipher/cast5.c: Ditto. * cipher/rijndael-armv6.S: Ditto. * cipher/rijndael.c: Ditto. * configure.ac: Add HAVE_ARM_ARCH_V6 check. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-16camellia: add ARMv6 assembly implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'camellia-armv6.S'. * cipher/camellia-armv6.S: New file. * cipher/camellia-glue.c [USE_ARMV6_ASM] (_gcry_camellia_armv6_encrypt_block) (_gcry_camellia_armv6_decrypt_block): New prototypes. [USE_ARMV6_ASM] (Camellia_EncryptBlock, Camellia_DecryptBlock) (camellia_encrypt, camellia_decrypt): New functions. * cipher/camellia.c [!USE_ARMV6_ASM]: Compile encryption and decryption routines if USE_ARMV6_ASM macro is _not_ defined. * cipher/camellia.h (USE_ARMV6_ASM): New macro. [!USE_ARMV6_ASM] (Camellia_EncryptBlock, Camellia_DecryptBlock): If USE_ARMV6_ASM is defined, disable these function prototypes. (camellia) [arm]: Add 'camellia-armv6.lo'. -- Add optimized ARMv6 assembly implementation for Camellia. Implementation is tuned for Cortex-A8. Unaligned access handling is done in assembly part. For now. only enable this on little-endian systems as big-endian correctness have not been tested yet. Old vs new. Cortex-A8 (on Debian Wheezy/armhf): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 1.44x 1.47x 1.35x 1.34x 1.43x 1.39x 1.38x 1.36x 1.38x 1.39x CAMELLIA192 1.60x 1.62x 1.52x 1.47x 1.56x 1.54x 1.52x 1.53x 1.52x 1.53x CAMELLIA256 1.59x 1.60x 1.49x 1.47x 1.53x 1.54x 1.51x 1.50x 1.52x 1.53x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-16blowfish: add ARMv6 assembly implementationJussi Kivilinna1-0/+4
* cipher/Makefile.am: Add 'blowfish-armv6.S'. * cipher/blowfish-armv6.S: New file. * cipher/blowfish.c (USE_ARMV6_ASM): New macro. [USE_ARMV6_ASM] (_gcry_blowfish_armv6_do_encrypt) (_gcry_blowfish_armv6_encrypt_block) (_gcry_blowfish_armv6_decrypt_block, _gcry_blowfish_armv6_ctr_enc) (_gcry_blowfish_armv6_cbc_dec, _gcry_blowfish_armv6_cfb_dec): New prototypes. [USE_ARMV6_ASM] (do_encrypt, do_encrypt_block, do_decrypt_block) (encrypt_block, decrypt_block): New functions. (_gcry_blowfish_ctr_enc) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (_gcry_blowfish_cbc_dec) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (_gcry_blowfish_cfb_dec) [USE_ARMV6_ASM]: Use ARMv6 assembly function. * configure.ac (blowfish) [arm]: Add 'blowfish-armv6.lo'. -- Patch provides non-parallel implementations for small speed-up and 2-way parallel implementations that gets accelerated on multi-issue CPUs (hand-tuned for in-order dual-issue Cortex-A8). Unaligned access handling is done in assembly. For now, only enable this on little-endian systems as big-endian correctness have not been tested yet. Old vs new (Cortex-A8, Debian Wheezy/armhf): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- BLOWFISH 1.28x 1.16x 1.21x 2.16x 1.26x 1.86x 1.21x 1.25x 1.89x 1.96x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-16cast5: add ARMv6 assembly implementationJussi Kivilinna1-0/+4
* cipher/Makefile.am: Add 'cast5-armv6.S'. * cipher/cast5-armv6.S: New file. * cipher/cast5.c (USE_ARMV6_ASM): New macro. (CAST5_context) [USE_ARMV6_ASM]: New members 'Kr_arm_enc' and 'Kr_arm_dec'. [USE_ARMV6_ASM] (_gcry_cast5_armv6_encrypt_block) (_gcry_cast5_armv6_decrypt_block, _gcry_cast5_armv6_ctr_enc) (_gcry_cast5_armv6_cbc_dec, _gcry_cast5_armv6_cfb_dec): New prototypes. [USE_ARMV6_ASM] (do_encrypt_block, do_decrypt_block, encrypt_block) (decrypt_block): New functions. (_gcry_cast5_ctr_enc) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (_gcry_cast5_cbc_dec) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (_gcry_cast5_cfb_dec) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (do_cast_setkey) [USE_ARMV6_ASM]: Initialize 'Kr_arm_enc' and 'Kr_arm_dec'. * configure.ac (cast5) [arm]: Add 'cast5-armv6.lo'. -- Provides non-parallel implementations for small speed-up and 2-way parallel implementations that gets accelerated on multi-issue CPUs (hand-tuned for in-order dual-issue Cortex-A8). Unaligned access handling is done in assembly. For now, only enable this on little-endian systems as big-endian correctness have not been tested yet. Old vs new (Cortex-A8, Debian Wheezy/armhf): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAST5 1.15x 1.12x 1.12x 2.07x 1.14x 1.60x 1.12x 1.13x 1.62x 1.63x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-14rijndael: add ARMv6 assembly implementationJussi Kivilinna1-0/+32
* cipher/Makefile.am: Add 'rijndael-armv6.S'. * cipher/rijndael-armv6.S: New file. * cipher/rijndael.c (USE_ARMV6_ASM): New macro. [USE_ARMV6_ASM] (_gcry_aes_armv6_encrypt_block) (_gcry_aes_armv6_decrypt_block): New prototypes. (do_encrypt_aligned) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (do_encrypt): Disable input/output alignment when USE_ARMV6_ASM. (do_decrypt_aligned) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (do_decrypt): Disable input/output alignment when USE_ARMV6_ASM. * configure.ac (HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS): New check for gcc/as compatibility with ARM assembly implementations. (aes) [arm]: Add 'rijndael-armv6.lo'. -- Add optimized ARMv6 assembly implementation for AES. Implementation is tuned for Cortex-A8. Unaligned access handling is done in assembly part. For now, only enable this on little-endian systems as big-endian correctness have not been tested yet. Old vs new. Cortex-A8 (on Debian Wheezy/armhf): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 2.61x 3.12x 2.16x 2.59x 2.26x 2.25x 2.08x 2.08x 2.23x 2.23x AES192 2.60x 3.06x 2.18x 2.65x 2.29x 2.29x 2.12x 2.12x 2.25x 2.27x AES256 2.62x 3.09x 2.24x 2.72x 2.30x 2.34x 2.17x 2.19x 2.32x 2.32x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-07-18Add support for Salsa20.Werner Koch1-1/+7
* src/gcrypt.h.in (GCRY_CIPHER_SALSA20): New. * cipher/salsa20.c: New. * configure.ac (available_ciphers): Add Salsa20. * cipher/cipher.c: Register Salsa20. (cipher_setiv): Allow to divert an IV to a cipher module. * src/cipher-proto.h (cipher_setiv_func_t): New. (cipher_extra_spec): Add field setiv. * src/cipher.h: Declare Salsa20 definitions. * tests/basic.c (check_stream_cipher): New. (check_stream_cipher_large_block): New. (check_cipher_modes): Run new test functions. (check_ciphers): Add simple test for Salsa20. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-06-26Make gpg-error replacement defines more robust.Werner Koch1-9/+2
* configure.ac (AH_BOTTOM): Move GPG_ERR_ replacement defines to ... * src/gcrypt-int.h: new file. * src/visibility.h, src/cipher.h: Replace gcrypt.h by gcrypt-int.h. * tests/: Ditto for all test files. -- Defining newer gpg-error codes in config.h was not a good idea, because config.h is usually included before gpg-error.h and thus gpg-error.h would be double defines to lead to faulty code there like typedef enum { [...] 191 = 191, [...] };
2013-06-20Check if assembler is compatible with AMD64 assembly implementationsJussi Kivilinna1-0/+30
* cipher/blowfish-amd64.S: Enable only if HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS is defined. * cipher/camellia-aesni-avx-amd64.S: Ditto. * cipher/camellia-aesni-avx2-amd64.S: Ditto. * cipher/cast5-amd64.S: Ditto. * cipher/rinjdael-amd64.S: Ditto. * cipher/serpent-avx2-amd64.S: Ditto. * cipher/serpent-sse2-amd64.S: Ditto. * cipher/twofish-amd64.S: Ditto. * cipher/blowfish.c: Use AMD64 assembly implementation only if HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS is defined * cipher/camellia-glue.c: Ditto. * cipher/cast5.c: Ditto. * cipher/rijndael.c: Ditto. * cipher/serpent.c: Ditto. * cipher/twofish.c: Ditto. * configure.ac: Check gcc/as compatibility with AMD64 assembly implementations. -- Later these checks can be split and assembly implementations adapted to handle different platforms, but for now disable AMD64 assembly implementations if assembler does not look to be able to handle them. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-09Add Camellia AES-NI/AVX2 implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'camellia-aesni-avx2-amd64.S'. * cipher/camellia-aesni-avx2-amd64.S: New file. * cipher/camellia-glue.c (USE_AESNI_AVX2): New macro. (CAMELLIA_context) [USE_AESNI_AVX2]: Add 'use_aesni_avx2'. [USE_AESNI_AVX2] (_gcry_camellia_aesni_avx2_ctr_enc) (_gcry_camellia_aesni_avx2_cbc_dec) (_gcry_camellia_aesni_avx2_cfb_dec): New prototypes. (camellia_setkey) [USE_AESNI_AVX2]: Check AVX2+AES-NI capable hardware and set 'ctx->use_aesni_avx2'. (_gcry_camellia_ctr_enc) [USE_AESNI_AVX2]: Add AVX2 accelerated code. (_gcry_camellia_cbc_dec) [USE_AESNI_AVX2]: Add AVX2 accelerated code. (_gcry_camellia_cfb_dec) [USE_AESNI_AVX2]: Add AVX2 accelerated code. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Grow 'nblocks' so that AVX2 codepaths get tested. * configure.ac (camellia) [avx2support, aesnisupport]: Add 'camellia-aesni-avx2-amd64.lo'. -- Add new AVX2/AES-NI implementation of Camellia that processes 32 blocks in parallel. Speed old (AVX/AES-NI) vs. new (AVX2/AES-NI) on Intel Core i5-4570: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 1.00x 0.99x 1.00x 1.53x 1.00x 1.49x 1.00x 1.00x 1.54x 1.54x CAMELLIA256 0.99x 1.00x 1.00x 1.50x 1.00x 1.50x 1.00x 1.00x 1.54x 1.52x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-09Add Serpent AVX2 implementationJussi Kivilinna1-0/+5
* cipher/Makefile.am: Add 'serpent-avx2-amd64.S'. * cipher/serpent-avx2-amd64.S: New file. * cipher/serpent.c (USE_AVX2): New macro. (serpent_context_t) [USE_AVX2]: Add 'use_avx2'. [USE_AVX2] (_gcry_serpent_avx2_ctr_enc, _gcry_serpent_avx2_cbc_dec) (_gcry_serpent_avx2_cfb_dec): New prototypes. (serpent_setkey_internal) [USE_AVX2]: Check for AVX2 capable hardware and set 'use_avx2'. (_gcry_serpent_ctr_enc) [USE_AVX2]: Use AVX2 accelerated functions. (_gcry_serpent_cbc_dec) [USE_AVX2]: Use AVX2 accelerated functions. (_gcry_serpent_cfb_dec) [USE_AVX2]: Use AVX2 accelerated functions. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Grow 'nblocks' so that AVX2 codepaths are tested. * configure.ac (serpent) [avx2support]: Add 'serpent-avx2-amd64.lo'. -- Add new AVX2 implementation of Serpent that processes 16 blocks in parallel. Speed old (SSE2) vs. new (AVX2) on Intel Core i5-4570: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- SERPENT128 1.00x 1.00x 1.00x 2.10x 1.00x 2.16x 1.01x 1.00x 2.16x 2.18x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-09Add detection for Intel AVX2 instruction setJussi Kivilinna1-0/+35
* configure.ac: Add option --disable-avx2-support. (HAVE_GCC_INLINE_ASM_AVX2): New. (ENABLE_AVX2_SUPPORT): New. * src/g10lib.h (HWF_INTEL_AVX2): New. * src/global.c (hwflist): Add HWF_INTEL_AVX2. * src/hwf-x86.c [__i386__] (get_cpuid): Initialize registers to zero before cpuid. [__x86_64__] (get_cpuid): Initialize registers to zero before cpuid. (detect_x86_gnuc): Store maximum cpuid level. (detect_x86_gnuc) [ENABLE_AVX2_SUPPORT]: Add detection for AVX2. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-09twofish: add amd64 assembly implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'twofish-amd64.S'. * cipher/twofish-amd64.S: New file. * cipher/twofish.c (USE_AMD64_ASM): New macro. [USE_AMD64_ASM] (_gcry_twofish_amd64_encrypt_block) (_gcry_twofish_amd64_decrypt_block, _gcry_twofish_amd64_ctr_enc) (_gcry_twofish_amd64_cbc_dec, _gcry_twofish_amd64_cfb_dec): New prototypes. [USE_AMD64_ASM] (do_twofish_encrypt, do_twofish_decrypt) (twofish_encrypt, twofish_decrypt): New functions. (_gcry_twofish_ctr_enc, _gcry_twofish_cbc_dec, _gcry_twofish_cfb_dec) (selftest_ctr, selftest_cbc, selftest_cfb): New functions. (selftest): Call new bulk selftests. * cipher/cipher.c (gcry_cipher_open) [USE_TWOFISH]: Register Twofish bulk functions for ctr-enc, cbc-dec and cfb-dec. * configure.ac (twofish) [x86_64]: Add 'twofish-amd64.lo'. * src/cipher.h (_gcry_twofish_ctr_enc, _gcry_twofish_cbc_dec) (gcry_twofish_cfb_dec): New prototypes. -- Provides non-parallel implementations for small speed-up and 3-way parallel implementations that gets accelerated on `out-of-order' CPUs. Speed old vs. new on Intel Core i5-4570: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- TWOFISH128 1.08x 1.07x 1.10x 1.80x 1.09x 1.70x 1.08x 1.08x 1.70x 1.69x Speed old vs. new on Intel Core2 T8100: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- TWOFISH128 1.11x 1.10x 1.13x 1.65x 1.13x 1.62x 1.12x 1.11x 1.63x 1.59x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-29rinjdael: add amd64 assembly implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'rijndael-amd64.S'. * cipher/rijndael-amd64.S: New file. * cipher/rijndael.c (USE_AMD64_ASM): New macro. [USE_AMD64_ASM] (_gcry_aes_amd64_encrypt_block) (_gcry_aes_amd64_decrypt_block): New prototypes. (do_encrypt_aligned) [USE_AMD64_ASM]: Use amd64 assembly function. (do_encrypt): Disable input/output alignment when USE_AMD64_ASM is set. (do_decrypt_aligned) [USE_AMD64_ASM]: Use amd64 assembly function. (do_decrypt): Disable input/output alignment when USE_AMD64_AES is set. * configure.ac (aes) [x86-64]: Add 'rijndael-amd64.lo'. -- Add optimized amd64 assembly implementation for AES. Old vs new, on AMD Phenom II: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 1.74x 1.72x 1.81x 1.85x 1.82x 1.76x 1.67x 1.64x 1.79x 1.81x AES192 1.77x 1.77x 1.79x 1.88x 1.90x 1.80x 1.69x 1.69x 1.85x 1.81x AES256 1.79x 1.81x 1.83x 1.89x 1.88x 1.82x 1.72x 1.70x 1.87x 1.89x Old vs new, on Intel Core2: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 1.77x 1.75x 1.78x 1.76x 1.76x 1.77x 1.75x 1.76x 1.76x 1.82x AES192 1.80x 1.73x 1.81x 1.76x 1.79x 1.85x 1.77x 1.76x 1.80x 1.85x AES256 1.81x 1.77x 1.81x 1.77x 1.80x 1.79x 1.78x 1.77x 1.81x 1.85x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-29blowfish: add amd64 assembly implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'blowfish-amd64.S'. * cipher/blowfish-amd64.S: New file. * cipher/blowfish.c (USE_AMD64_ASM): New macro. [USE_AMD64_ASM] (_gcry_blowfish_amd64_do_encrypt) (_gcry_blowfish_amd64_encrypt_block) (_gcry_blowfish_amd64_decrypt_block, _gcry_blowfish_amd64_ctr_enc) (_gcry_blowfish_amd64_cbc_dec, _gcry_blowfish_amd64_cfb_dec): New prototypes. [USE_AMD64_ASM] (do_encrypt, do_encrypt_block, do_decrypt_block) (encrypt_block, decrypt_block): New functions. (_gcry_blowfish_ctr_enc, _gcry_blowfish_cbc_dec) (_gcry_blowfish_cfb_dec, selftest_ctr, selftest_cbc, selftest_cfb): New functions. (selftest): Call new bulk selftests. * cipher/cipher.c (gcry_cipher_open) [USE_BLOWFISH]: Register Blowfish bulk functions for ctr-enc, cbc-dec and cfb-dec. * configure.ac (blowfish) [x86_64]: Add 'blowfish-amd64.lo'. * src/cipher.h (_gcry_blowfish_ctr_enc, _gcry_blowfish_cbc_dec) (gcry_blowfish_cfb_dec): New prototypes. -- Add non-parallel functions for small speed-up and 4-way parallel functions for modes of operation that support parallel processing. Speed old vs. new on AMD Phenom II X6 1055T: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- BLOWFISH 1.21x 1.12x 1.17x 3.52x 1.18x 3.34x 1.16x 1.15x 3.38x 3.47x Speed old vs. new on Intel Core i5-2450M (Sandy-Bridge): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- BLOWFISH 1.16x 1.10x 1.17x 2.98x 1.18x 2.88x 1.16x 1.15x 3.00x 3.02x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-24cast5: add amd64 assembly implementationJussi Kivilinna1-0/+7
* cipher/Makefile.am: Add 'cast5-amd64.S'. * cipher/cast5-amd64.S: New file. * cipher/cast5.c (USE_AMD64_ASM): New macro. (_gcry_cast5_s1tos4): Merge arrays s1, s2, s3, s4 to single array to simplify access from assembly implementation. (s1, s2, s3, s4): New macros pointing to subarrays in _gcry_cast5_s1tos4. [USE_AMD64_ASM] (_gcry_cast5_amd64_encrypt_block) (_gcry_cast5_amd64_decrypt_block, _gcry_cast5_amd64_ctr_enc) (_gcry_cast5_amd64_cbc_dec, _gcry_cast5_amd64_cfb_dec): New prototypes. [USE_AMD64_ASM] (do_encrypt_block, do_decrypt_block, encrypt_block) (decrypt_block): New functions. (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec, _gcry_cast5_cfb_dec) (selftest_ctr, selftest_cbc, selftest_cfb): New functions. (selftest): Call new bulk selftests. * cipher/cipher.c (gcry_cipher_open) [USE_CAST5]: Register CAST5 bulk functions for ctr-enc, cbc-dec and cfb-dec. * configure.ac (cast5) [x86_64]: Add 'cast5-amd64.lo'. * src/cipher.h (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec) (gcry_cast5_cfb_dec): New prototypes. -- Provides non-parallel implementations for small speed-up and 4-way parallel implementations that gets accelerated on `out-of-order' CPUs. Speed old vs. new on AMD Phenom II X6 1055T: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAST5 1.23x 1.22x 1.21x 2.86x 1.21x 2.83x 1.22x 1.17x 2.73x 2.73x Speed old vs. new on Intel Core i5-2450M (Sandy-Bridge): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAST5 1.00x 1.04x 1.06x 2.56x 1.06x 2.37x 1.03x 1.01x 2.43x 2.41x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-23serpent: add SSE2 accelerated amd64 implementationJussi Kivilinna1-0/+7
* configure.ac (serpent): Add 'serpent-sse2-amd64.lo'. * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add 'serpent-sse2-amd64.S'. * cipher/cipher.c (gcry_cipher_open) [USE_SERPENT]: Register bulk functions for CBC-decryption and CTR-mode. * cipher/serpent.c (USE_SSE2): New macro. [USE_SSE2] (_gcry_serpent_sse2_ctr_enc, _gcry_serpent_sse2_cbc_dec): New prototypes to assembler functions. (serpent_setkey): Set 'serpent_init_done' before calling serpent_test. (_gcry_serpent_ctr_enc): New function. (_gcry_serpent_cbc_dec): New function. (selftest_ctr_128): New function. (selftest_cbc_128): New function. (selftest): Call selftest_ctr_128 and selftest_cbc_128. * cipher/serpent-sse2-amd64.S: New file. * src/cipher.h (_gcry_serpent_ctr_enc): New prototype. (_gcry_serpent_cbc_dec): New prototype. -- [v2]: Converted to SSE2, to support all amd64 processors (SSE2 is required feature by AMD64 SysV ABI). Patch adds word-sliced SSE2 implementation of Serpent for amd64 for speeding up parallelizable workloads (CTR mode, CBC mode decryption). Implementation processes eight blocks in parallel, with two four-block sets interleaved for out-of-order scheduling. Speed old vs. new on Intel Core i5-2450M (Sandy-Bridge): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- SERPENT128 1.00x 0.99x 1.00x 3.98x 1.00x 1.01x 1.00x 1.01x 4.04x 4.04x Speed old vs. new on AMD Phenom II X6 1055T: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- SERPENT128 1.02x 1.01x 1.00x 2.83x 1.00x 1.00x 1.00x 1.00x 2.72x 2.72x Speed old vs. new on Intel Core2 Duo T8100: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- SERPENT128 1.00x 1.02x 0.97x 4.02x 0.98x 1.01x 0.98x 1.00x 3.82x 3.91x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-22camellia: Rename camellia_aesni_avx_x86-64.S to camellia-aesni-avx-amd64.SJussi Kivilinna1-1/+1
* cipher/camellia_aesni_avx_x86-64.S: Remove. * cipher/camellia-aesni-avx-amd64.S: New. * cipher/Makefile.am: Use the new filename. * configure.ac: Use the new filename. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-04-11Add gcry_pubkey_get_sexp.Werner Koch1-6/+10
* src/gcrypt.h.in (GCRY_PK_GET_PUBKEY): New. (GCRY_PK_GET_SECKEY): New. (gcry_pubkey_get_sexp): New. * src/visibility.c (gcry_pubkey_get_sexp): New. * src/visibility.h (gcry_pubkey_get_sexp): Mark visible. * src/libgcrypt.def, src/libgcrypt.vers: Add new function. * cipher/pubkey-internal.h: New. * cipher/Makefile.am (libcipher_la_SOURCES): Add new file. * cipher/ecc.c: Include pubkey-internal.h (_gcry_pk_ecc_get_sexp): New. * cipher/pubkey.c: Include pubkey-internal.h and context.h. (_gcry_pubkey_get_sexp): New. * src/context.c (_gcry_ctx_find_pointer): New. * src/cipher-proto.h: Add _gcry_pubkey_get_sexp. * tests/t-mpi-point.c (print_sexp): New. (context_param, basic_ec_math_simplified): Add tests for the new function. * configure.ac (NEED_GPG_ERROR_VERSION): Set to 1.11. (AH_BOTTOM) Add error codes from gpg-error 1.12 * src/g10lib.h (fips_not_operational): Use GPG_ERR_NOT_OPERATIONAL. * mpi/ec.c (_gcry_mpi_ec_get_mpi): Fix computation of Q. (_gcry_mpi_ec_get_point): Ditto. -- While checking the new code I figured that the auto-computation of Q must have led to a segv. It seems we had no test case for that. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-04-05Add test case for SCRYPT and rework the code.Werner Koch1-2/+41
* tests/t-kdf.c (check_scrypt): New. (main): Call new test. * configure.ac: Support disabling of the scrypt algorithm. Make KDF enabling similar to the other algorithm classes. Disable scrypt if we don't have a 64 bit type. * cipher/memxor.c, cipher/memxor.h: Remove. * cipher/scrypt.h: Remove. * cipher/kdf-internal.h: New. * cipher/Makefile.am: Remove files. Add new file. Move scrypt.c to EXTRA_libcipher_la_SOURCES. (GCRYPT_MODULES): Add GCRYPT_KDFS. * src/gcrypt.h.in (GCRY_KDF_SCRYPT): Change value. * cipher/kdf.c (pkdf2): Rename to _gcry_kdf_pkdf2. (_gcry_kdf_pkdf2): Don't bail out for SALTLEN==0. (gcry_kdf_derive): Allow for a passwordlen of zero for scrypt. Check for SALTLEN > 0 for GCRY_KDF_PBKDF2. Pass algo to _gcry_kdf_scrypt. (gcry_kdf_derive) [!USE_SCRYPT]: Return an error. * cipher/scrypt.c: Replace memxor.h by bufhelp.h. Replace scrypt.h by kdf-internal.h. Enable code only if HAVE_U64_TYPEDEF is defined. Replace C99 types uint64_t, uint32_t, and uint8_t by libgcrypt types. (_SALSA20_INPUT_LENGTH): Remove underscore from identifier. (_scryptBlockMix): Replace memxor by buf_xor. (_gcry_kdf_scrypt): Use gcry_malloc and gcry_free. Check for integer overflow. Add hack to support blocksize of 1 for tests. Return errors from calls to _gcry_kdf_pkdf2. * cipher/kdf.c (openpgp_s2k): Make static. -- This patch prepares the addition of more KDF functions, brings the code into Libgcrypt shape, adds a test case and makes the code more robust. For example, scrypt would have fail silently if Libgcrypt was not build with SHA256 support. Also fixed symbol naming for systems without a visibility support. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-03-22Replace deprecated AM_CONFIG_HEADER macro.Werner Koch1-2/+2
* configure.ac: s/AM_CONFIG_HEADER/AC_CONFIG_HEADER/
2013-03-22Disable AES-NI support if as does not support SSSE3.Werner Koch1-13/+44
* configure.ac (HAVE_GCC_INLINE_ASM_SSSE3): New test. (ENABLE_AESNI_SUPPORT): Do not define without SSSE3 support. (HAVE_GCC_INLINE_ASM_SSSE3, ENABLE_AVX_SUPPORT): Split up detection and definition. -- For example the assembler of FreeBSD 7.3 does not know about pshufb and thus rijndael.c can't be compiled without using --disable-aesni-support. This check that the toolchain can use SSSE3 instructions before trying to build with AES_NI support.
2013-03-20Provide GCRYPT_VERSION_NUMBER macro, add build info to the binary.Werner Koch1-11/+21
* src/gcrypt.h.in (GCRYPT_VERSION_NUMBER): New. * configure.ac (VERSION_NUMBER): New ac_subst. * src/global.c (_gcry_vcontrol): Move call to above function ... (gcry_check_version): .. here. * configure.ac (BUILD_REVISION, BUILD_FILEVERSION) (BUILD_TIMESTAMP): Define on all platforms. * compat/compat.c (_gcry_compat_identification): Include revision and timestamp.
2013-03-07Pretty print the configure feedback.Werner Koch1-15/+15
* acinclude.m4 (GNUPG_MSG_PRINT): Remove. (GCRY_MSG_SHOW, GCRY_MSG_WRAP): New. * configure.ac: Use new macros for the feedback.
2013-02-20Remove build hacks for FreeBSD.Werner Koch1-6/+0
* configure.ac [freebsd]: Do not add /usr/local to CPPFLAGS and LDFLAGS. -- Back in ~2000 we introduced a quick hack to make building of Libgcrypt on FreeBSD easier by always adding -I/usr/local/include and -L/usr/local/lib . It turned out that this is a bad idea if one wants to build with library version which is not installed in /usr/local.
2013-02-19Add AES-NI/AVX accelerated Camellia implementationJussi Kivilinna1-0/+42
* configure.ac: Add option --disable-avx-support. (HAVE_GCC_INLINE_ASM_AVX): New. (ENABLE_AVX_SUPPORT): New. (camellia) [ENABLE_AVX_SUPPORT, ENABLE_AESNI_SUPPORT]: Add camellia_aesni_avx_x86-64.lo. * cipher/Makefile.am (AM_CCASFLAGS): Add. (EXTRA_libcipher_la_SOURCES): Add camellia_aesni_avx_x86-64.S * cipher/camellia-glue.c [ENABLE_AESNI_SUPPORT, ENABLE_AVX_SUPPORT] [__x86_64__] (USE_AESNI_AVX): Add macro. (struct Camellia_context) [USE_AESNI_AVX]: Add use_aesni_avx. [USE_AESNI_AVX] (_gcry_camellia_aesni_avx_ctr_enc) (_gcry_camellia_aesni_avx_cbc_dec): New prototypes to assembly functions. (camellia_setkey) [USE_AESNI_AVX]: Enable AES-NI/AVX if hardware support both. (_gcry_camellia_ctr_enc) [USE_AESNI_AVX]: Add AES-NI/AVX code. (_gcry_camellia_cbc_dec) [USE_AESNI_AVX]: Add AES-NI/AVX code. * cipher/camellia_aesni_avx_x86-64.S: New. * src/g10lib.h (HWF_INTEL_AVX): New. * src/global.c (hwflist): Add HWF_INTEL_AVX. * src/hwf-x86.c (detect_x86_gnuc) [ENABLE_AVX_SUPPORT]: Add detection for AVX. -- Before: Running each test 250 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 2210ms 2200ms 2300ms 2050ms 2240ms 2250ms 2290ms 2270ms 2070ms 2070ms CAMELLIA256 2810ms 2800ms 2920ms 2670ms 2840ms 2850ms 2910ms 2890ms 2660ms 2640ms After: Running each test 250 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 2200ms 2220ms 2290ms 470ms 2240ms 2270ms 2270ms 2290ms 480ms 480ms CAMELLIA256 2820ms 2820ms 2900ms 600ms 2860ms 2860ms 2900ms 2920ms 620ms 620ms AES-NI/AVX implementation works by processing 16 parallel blocks (256 bytes). It's bytesliced implementation that uses AES-NI (Subbyte) for Camellia sboxes, with help of prefiltering/postfiltering. For smaller data sets generic C implementation is used. Speed-up for CBC-decryption and CTR-mode (large data): 4.3x Tests were run on: Intel Core i5-2450M Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> (license boiler plate update by wk)