Age | Commit message (Collapse) | Author | Files | Lines |
|
* cipher/Makefile.am: Add 'sha512-arm.S'.
* cipher/sha512-arm.S: New.
* cipher/sha512.c (USE_ARM_ASM): New.
(_gcry_sha512_transform_arm): New.
(transform) [USE_ARM_ASM]: Use ARM assembly implementation instead of
generic.
* configure.ac: Add 'sha512-arm.lo'.
--
Benchmark on Cortex-A8 (armv6, 1008 Mhz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 112.0 ns/B 8.52 MiB/s 112.9 c/B
After (3.3x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 34.01 ns/B 28.04 MiB/s 34.28 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'keccak-armv7-neon.S'.
* cipher/keccak-armv7-neon.S: New.
* cipher/keccak.c (USE_64BIT_ARM_NEON): New.
(NEED_COMMON64): Select if USE_64BIT_ARM_NEON.
[NEED_COMMON64] (round_consts_64bit): Rename to...
[NEED_COMMON64] (_gcry_keccak_round_consts_64bit): ...this; Add
terminator at end.
[USE_64BIT_ARM_NEON] (_gcry_keccak_permute_armv7_neon)
(_gcry_keccak_absorb_lanes64_armv7_neon, keccak_permute64_armv7_neon)
(keccak_absorb_lanes64_armv7_neon, keccak_armv7_neon_64_ops): New.
(keccak_init) [USE_64BIT_ARM_NEON]: Select ARM/NEON implementation
if supported by HW.
* cipher/keccak_permute_64.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Update
to use new round constant table.
* configure.ac: Add 'keccak-armv7-neon.lo'.
--
Patch adds ARMv7/NEON implementation of Keccak (SHAKE/SHA3). Patch
is based on public-domain implementation by Ronny Van Keer from
SUPERCOP package:
https://github.com/floodyberry/supercop/blob/master/crypto_hash/\
keccakc1024/inplace-armv7a-neon/keccak2.s
Benchmark results on Cortex-A8 @ 1008 Mhz:
Before (generic 32-bit bit-interleaved impl.):
| nanosecs/byte mebibytes/sec cycles/byte
SHAKE128 | 83.00 ns/B 11.49 MiB/s 83.67 c/B
SHAKE256 | 101.7 ns/B 9.38 MiB/s 102.5 c/B
SHA3-224 | 96.13 ns/B 9.92 MiB/s 96.90 c/B
SHA3-256 | 101.5 ns/B 9.40 MiB/s 102.3 c/B
SHA3-384 | 131.4 ns/B 7.26 MiB/s 132.5 c/B
SHA3-512 | 189.1 ns/B 5.04 MiB/s 190.6 c/B
After (ARM/NEON, ~3.2x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHAKE128 | 25.09 ns/B 38.01 MiB/s 25.29 c/B
SHAKE256 | 30.95 ns/B 30.82 MiB/s 31.19 c/B
SHA3-224 | 29.24 ns/B 32.61 MiB/s 29.48 c/B
SHA3-256 | 30.95 ns/B 30.82 MiB/s 31.19 c/B
SHA3-384 | 40.42 ns/B 23.59 MiB/s 40.74 c/B
SHA3-512 | 58.37 ns/B 16.34 MiB/s 58.84 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac (BUILD_TIMESTAMP): Set to "<none>" by default.
--
This is based on
libgpg-error commit d620005fd1a655d591fccb44639e22ea445e4554
but changed to be disabled by default. Check there for some
background.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* src/gcrypt.h.in (GCRY_MD_SHA3_224, GCRY_MD_SHA3_256)
(GCRY_MD_SHA3_384, GCRY_MD_SHA3_512): New.
(GCRY_MAC_HMAC_SHA3_224, GCRY_MAC_HMAC_SHA3_256)
(GCRY_MAC_HMAC_SHA3_384, GCRY_MAC_HMAC_SHA3_512): New.
* cipher/keccak.c: New with stub functions.
* cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add keccak.c.
* configure.ac (available_digests): Add sha3.
(USE_SHA3): New.
* src/fips.c (run_hmac_selftests): Add SHA3 to the required selftests.
* cipher/md.c (digest_list) [USE_SHA3]: Add standard SHA3 algos.
(md_open): Ditto for hmac processing.
* cipher/mac-hmac.c (map_mac_algo_to_md): Add mapping.
* cipher/hmac-tests.c (run_selftests): Prepare for tests.
* cipher/pubkey-util.c (get_hash_algo): Add "sha3-xxx".
--
Note that the algo GCRY_MD_SHA3_xxx are prelimanry. We should try to
sync them with OpenPGP.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* cipher/cipher-gcm-intel-pclmul.c (_gcry_ghash_intel_pclmul)
( _gcry_ghash_intel_pclmul) [__WIN64__]: Store non-volatile vector
registers before use and restore after.
* cipher/cipher-internal.h (GCM_USE_INTEL_PCLMUL): Remove dependency
on !defined(__WIN64__).
* cipher/rijndael-aesni.c [__WIN64__] (aesni_prepare_2_6_variable,
aesni_prepare, aesni_prepare_2_6, aesni_cleanup)
( aesni_cleanup_2_6): New.
[!__WIN64__] (aesni_prepare_2_6_variable, aesni_prepare_2_6): New.
(_gcry_aes_aesni_do_setkey, _gcry_aes_aesni_cbc_enc)
(_gcry_aesni_ctr_enc, _gcry_aesni_cfb_dec, _gcry_aesni_cbc_dec)
(_gcry_aesni_ocb_crypt, _gcry_aesni_ocb_auth): Use
'aesni_prepare_2_6'.
* cipher/rijndael-internal.h (USE_SSSE3): Enable if
HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS or
HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS.
(USE_AESNI): Remove dependency on !defined(__WIN64__)
* cipher/rijndael-ssse3-amd64.c [HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS]
(vpaes_ssse3_prepare, vpaes_ssse3_cleanup): New.
[!HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS] (vpaes_ssse3_prepare): New.
(vpaes_ssse3_prepare_enc, vpaes_ssse3_prepare_dec): Use
'vpaes_ssse3_prepare'.
(_gcry_aes_ssse3_do_setkey, _gcry_aes_ssse3_prepare_decryption): Use
'vpaes_ssse3_prepare' and 'vpaes_ssse3_cleanup'.
[HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS] (X): Add masking macro to
exclude '.type' and '.size' markers from assembly code, as they are
not support on WIN64/COFF objects.
* configure.ac (gcry_cv_gcc_attribute_ms_abi)
(gcry_cv_gcc_attribute_sysv_abi, gcry_cv_gcc_default_abi_is_ms_abi)
(gcry_cv_gcc_default_abi_is_sysv_abi)
(gcry_cv_gcc_win64_platform_as_ok): New checks.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Add sizeof check for 'void *'.
* random/rndhw.c (poll_padlock): Check for SIZEOF_VOID_P == 8
instead of defined(__LP64__).
(RDRAND_LONG): Check for SIZEOF_UNSIGNED_LONG == 8 instead of
defined(__LP64__).
--
__LP64__ is not predefined for 64-bit mingw64-gcc, which caused wrong
assembly code selections. Do selection based on type sizes instead,
to support x86_64, x32 and win64 properly.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac (gcry_cv_gcc_attribute_packed): Move 'long b' to its
own packed structure.
--
Change packed attribute test so that it works with both MS ABI and SYSV ABI.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/bufhelp.h (BUFHELP_FAST_UNALIGNED_ACCESS): Enable only when
HAVE_GCC_ATTRIBUTE_PACKED and HAVE_GCC_ATTRIBUTE_ALIGNED are defined.
(bufhelp_int_t): New type.
(buf_cpy, buf_xor, buf_xor_1, buf_xor_2dst, buf_xor_n_copy_2): Use
'bufhelp_int_t'.
[BUFHELP_FAST_UNALIGNED_ACCESS] (bufhelp_u32_t, bufhelp_u64_t): New.
[BUFHELP_FAST_UNALIGNED_ACCESS] (buf_get_be32, buf_get_le32)
(buf_put_be32, buf_put_le32, buf_get_be64, buf_get_le64)
(buf_put_be64, buf_put_le64): Use 'bufhelp_uXX_t'.
* configure.ac (gcry_cv_gcc_attribute_packed): New.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/bithelp.h (_gcry_ctz, _gcry_ctz64): New.
* configure.ac (HAVE_BUILTIN_CTZ): Add new test.
--
Note that these functions return the number of bits in the word when
passing 0.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* random/rndunix.c (STDERR_FILENO): Define if needed.
(start_gatherer): Re-open standard descriptors. Fix an
unsigned/signed pointer warning.
--
GnuPG-bug-id: 1702
|
|
* configure.ac (AM_INIT_AUTOMAKE): Add serial-tests.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* cipher/Makefile.am: Add 'rijndael-ssse3-amd64.c'.
* cipher/rijndael-internal.h (USE_SSSE3): New.
(RIJNDAEL_context_s) [USE_SSSE3]: Add 'use_ssse3'.
* cipher/rijndael-ssse3-amd64.c: New.
* cipher/rijndael.c [USE_SSSE3] (_gcry_aes_ssse3_do_setkey)
(_gcry_aes_ssse3_prepare_decryption, _gcry_aes_ssse3_encrypt)
(_gcry_aes_ssse3_decrypt, _gcry_aes_ssse3_cfb_enc)
(_gcry_aes_ssse3_cbc_enc, _gcry_aes_ssse3_ctr_enc)
(_gcry_aes_ssse3_cfb_dec, _gcry_aes_ssse3_cbc_dec): New.
(do_setkey): Add HWF check for SSSE3 and setup for SSSE3
implementation.
(prepare_decryption, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc)
(_gcry_aes_ctr_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_dec): Add
selection for SSSE3 implementation.
* configure.ac [host=x86_64]: Add 'rijndael-ssse3-amd64.lo'.
--
This patch adds "AES with vector permutations" implementation by
Mike Hamburg. Public-domain source-code is available at:
http://crypto.stanford.edu/vpaes/
Benchmark on Intel Core2 T8100 (2.1Ghz, no turbo):
Old (AMD64 asm):
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 8.79 ns/B 108.5 MiB/s 18.46 c/B
ECB dec | 9.07 ns/B 105.1 MiB/s 19.05 c/B
CBC enc | 7.77 ns/B 122.7 MiB/s 16.33 c/B
CBC dec | 7.74 ns/B 123.2 MiB/s 16.26 c/B
CFB enc | 7.88 ns/B 121.0 MiB/s 16.54 c/B
CFB dec | 7.56 ns/B 126.1 MiB/s 15.88 c/B
OFB enc | 9.02 ns/B 105.8 MiB/s 18.94 c/B
OFB dec | 9.07 ns/B 105.1 MiB/s 19.05 c/B
CTR enc | 7.80 ns/B 122.2 MiB/s 16.38 c/B
CTR dec | 7.81 ns/B 122.2 MiB/s 16.39 c/B
New (ssse3):
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 5.77 ns/B 165.2 MiB/s 12.13 c/B
ECB dec | 7.13 ns/B 133.7 MiB/s 14.98 c/B
CBC enc | 5.27 ns/B 181.0 MiB/s 11.06 c/B
CBC dec | 6.39 ns/B 149.3 MiB/s 13.42 c/B
CFB enc | 5.27 ns/B 180.9 MiB/s 11.07 c/B
CFB dec | 5.28 ns/B 180.7 MiB/s 11.08 c/B
OFB enc | 6.11 ns/B 156.1 MiB/s 12.83 c/B
OFB dec | 6.13 ns/B 155.5 MiB/s 12.88 c/B
CTR enc | 5.26 ns/B 181.5 MiB/s 11.04 c/B
CTR dec | 5.24 ns/B 182.0 MiB/s 11.00 c/B
Benchmark on Intel i5-2450M (2.5Ghz, no turbo, aes-ni disabled):
Old (AMD64 asm):
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 8.06 ns/B 118.3 MiB/s 20.15 c/B
ECB dec | 8.21 ns/B 116.1 MiB/s 20.53 c/B
CBC enc | 7.88 ns/B 121.1 MiB/s 19.69 c/B
CBC dec | 7.57 ns/B 126.0 MiB/s 18.92 c/B
CFB enc | 7.87 ns/B 121.2 MiB/s 19.67 c/B
CFB dec | 7.56 ns/B 126.2 MiB/s 18.89 c/B
OFB enc | 8.27 ns/B 115.3 MiB/s 20.67 c/B
OFB dec | 8.28 ns/B 115.1 MiB/s 20.71 c/B
CTR enc | 8.02 ns/B 119.0 MiB/s 20.04 c/B
CTR dec | 8.02 ns/B 118.9 MiB/s 20.05 c/B
New (ssse3):
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 4.03 ns/B 236.6 MiB/s 10.07 c/B
ECB dec | 5.28 ns/B 180.8 MiB/s 13.19 c/B
CBC enc | 3.77 ns/B 252.7 MiB/s 9.43 c/B
CBC dec | 4.69 ns/B 203.3 MiB/s 11.73 c/B
CFB enc | 3.75 ns/B 254.3 MiB/s 9.37 c/B
CFB dec | 3.69 ns/B 258.6 MiB/s 9.22 c/B
OFB enc | 4.17 ns/B 228.7 MiB/s 10.43 c/B
OFB dec | 4.17 ns/B 228.7 MiB/s 10.42 c/B
CTR enc | 3.72 ns/B 256.5 MiB/s 9.30 c/B
CTR dec | 3.72 ns/B 256.1 MiB/s 9.31 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* Makefile.am (AUTOMAKE_OPTIONS): Remove.
(doc) [!BUILD_DOC]: Do not recurse into the dir.
* configure.ac (AM_INIT_AUTOMAKE): Add option formerly in Makefile.am.
(BUILD_DOC): Add new am_conditional.
|
|
* cipher/Makefile.am: Add 'rijndael-padlock.c'.
* cipher/rijndael-padlock.c: New.
* cipher/rijndael.c (do_padlock, do_padlock_encrypt)
(do_padlock_decrypt): Move to 'rijndael-padlock.c'.
* configure.ac [mpi_cpu_arch=x86]: Add 'rijndael-padlock.lo'.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.in: Add 'rijndael-aesni.c'.
* cipher/rijndael-aesni.c: New.
* cipher/rijndael-internal.h: New.
* cipher/rijndael.c (MAXKC, MAXROUNDS, BLOCKSIZE, ATTR_ALIGNED_16)
(USE_AMD64_ASM, USE_ARM_ASM, USE_PADLOCK, USE_AESNI, RIJNDAEL_context)
(keyschenc, keyschdec, padlockkey): Move to 'rijndael-internal.h'.
(u128_s, aesni_prepare, aesni_cleanup, aesni_cleanup_2_6)
(aesni_do_setkey, do_aesni_enc, do_aesni_dec, do_aesni_enc_vec4)
(do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Move
to 'rijndael-aesni.c'.
(prepare_decryption, rijndael_encrypt, _gcry_aes_cfb_enc)
(_gcry_aes_cbc_enc, _gcry_aes_ctr_enc, rijndael_decrypt)
(_gcry_aes_cfb_dec, _gcry_aes_cbc_dec) [USE_AESNI]: Move to functions
in 'rijdael-aesni.c'.
* configure.ac [mpi_cpu_arch=x86]: Add 'rijndael-aesni.lo'.
--
Clean-up rijndael.c before new new hardware acceleration support gets added.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'poly1305-armv7-neon.S'.
* cipher/poly1305-armv7-neon.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_NEON)
(POLY1305_NEON_BLOCKSIZE, POLY1305_NEON_STATESIZE)
(POLY1305_NEON_ALIGNMENT): New.
* cipher/poly1305.c [POLY1305_USE_NEON]
(_gcry_poly1305_armv7_neon_init_ext)
(_gcry_poly1305_armv7_neon_finish_ext)
(_gcry_poly1305_armv7_neon_blocks, poly1305_armv7_neon_ops): New.
(_gcry_poly1305_init) [POLY1305_USE_NEON]: Select NEON implementation
if HWF_ARM_NEON set.
* configure.ac [neonsupport=yes]: Add 'poly1305-armv7-neon.lo'.
--
Add Andrew Moon's public domain NEON implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmark on Cortex-A8 (--cpu-mhz 1008):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 12.34 ns/B 77.27 MiB/s 12.44 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 2.12 ns/B 450.7 MiB/s 2.13 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'chacha20-armv7-neon.S'.
* cipher/chacha20-armv7-neon.S: New.
* cipher/chacha20.c (USE_NEON): New.
[USE_NEON] (_gcry_chacha20_armv7_neon_blocks): New.
(chacha20_do_setkey) [USE_NEON]: Use Neon implementation if
HWF_ARM_NEON flag set.
(selftest): Self-test encrypting buffer byte by byte.
* configure.ac [neonsupport=yes]: Add 'chacha20-armv7-neon.lo'.
--
Add Andrew Moon's public domain ARMv7/NEON implementation of ChaCha20. Original
source is available at: https://github.com/floodyberry/chacha-opt
Benchmark on Cortex-A8 (--cpu-mhz 1008):
Old:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 13.45 ns/B 70.92 MiB/s 13.56 c/B
STREAM dec | 13.45 ns/B 70.90 MiB/s 13.56 c/B
New:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 6.20 ns/B 153.9 MiB/s 6.25 c/B
STREAM dec | 6.20 ns/B 153.9 MiB/s 6.25 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'whirlpool-sse2-amd64.S'.
* cipher/whirlpool-sse2-amd64.S: New.
* cipher/whirlpool.c (USE_AMD64_ASM): New.
(whirlpool_tables_s): New.
(rc, C0, C1, C2, C3, C4, C5, C6, C7): Combine these tables into single
structure and replace old tables with macros of same name.
(tab): New structure containing above tables.
[USE_AMD64_ASM] (_gcry_whirlpool_transform_amd64)
(whirlpool_transform): New.
* configure.ac [host=x86_64]: Add 'whirlpool-sse2-amd64.lo'.
--
Benchmark results:
On Intel Core i5-4570 (3.2 Ghz):
After:
WHIRLPOOL | 4.82 ns/B 197.8 MiB/s 15.43 c/B
Before:
WHIRLPOOL | 9.10 ns/B 104.8 MiB/s 29.13 c/B
On Intel Core i5-2450M (2.5 Ghz):
After:
WHIRLPOOL | 8.43 ns/B 113.1 MiB/s 21.09 c/B
Before:
WHIRLPOOL | 13.45 ns/B 70.92 MiB/s 33.62 c/B
On Intel Core2 T8100 (2.1 Ghz):
After:
WHIRLPOOL | 10.22 ns/B 93.30 MiB/s 21.47 c/B
Before:
WHIRLPOOL | 19.87 ns/B 48.00 MiB/s 41.72 c/B
Summary, old vs new ratio:
Intel Core i5-4570: 1.88x
Intel Core i5-2450M: 1.59x
Intel Core2 T8100: 1.94x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Mark SYSROOT as arg var.
|
|
* src/libgcrypt.m4: Add support for SYSROOT and set
gpg_config_script_warn. Use AC_PATH_PROG instead of AC_PATH_TOOL
because the config script is not expected to be installed with a
prefix for its name
* configure.ac: Print a library mismatch warning.
* m4/gpg-error.m4: Update from git master.
--
Also fixed the false copyright notice in libgcrypt.m4.
|
|
* cipher/Makefile.am: Add 'chacha20-sse2-amd64.S'.
* cipher/chacha20-sse2-amd64.S: New.
* cipher/chacha20.c (USE_SSE2): New.
[USE_SSE2] (_gcry_chacha20_amd64_sse2_blocks): New.
(chacha20_do_setkey) [USE_SSE2]: Use SSE2 implementation for blocks
function.
* configure.ac [host=x86-64]: Add 'chacha20-sse2-amd64.lo'.
--
Add Andrew Moon's public domain SSE2 implementation of ChaCha20. Original
source is available at: https://github.com/floodyberry/chacha-opt
Benchmark on Intel i5-4570 (haswell),
with "--disable-hwf intel-avx2 --disable-hwf intel-ssse3":
Old:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 1.97 ns/B 483.8 MiB/s 6.31 c/B
STREAM dec | 1.97 ns/B 483.6 MiB/s 6.31 c/B
New:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.931 ns/B 1024.7 MiB/s 2.98 c/B
STREAM dec | 0.930 ns/B 1025.0 MiB/s 2.98 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'poly1305-avx2-amd64.S'.
* cipher/poly1305-avx2-amd64.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_AVX2)
(POLY1305_AVX2_BLOCKSIZE, POLY1305_AVX2_STATESIZE)
(POLY1305_AVX2_ALIGNMENT): New.
(POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
(POLY1305_STATE_ALIGNMENT): Use AVX2 versions when needed.
* cipher/poly1305.c [POLY1305_USE_AVX2]
(_gcry_poly1305_amd64_avx2_init_ext)
(_gcry_poly1305_amd64_avx2_finish_ext)
(_gcry_poly1305_amd64_avx2_blocks, poly1305_amd64_avx2_ops): New.
(_gcry_poly1305_init) [POLY1305_USE_AVX2]: Use AVX2 implementation if
AVX2 supported by CPU.
* configure.ac [host=x86_64]: Add 'poly1305-avx2-amd64.lo'.
--
Add Andrew Moon's public domain AVX2 implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmarks on Intel i5-4570 (haswell):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.448 ns/B 2129.5 MiB/s 1.43 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.205 ns/B 4643.5 MiB/s 0.657 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'poly1305-sse2-amd64.S'.
* cipher/poly1305-internal.h (POLY1305_USE_SSE2)
(POLY1305_SSE2_BLOCKSIZE, POLY1305_SSE2_STATESIZE)
(POLY1305_SSE2_ALIGNMENT): New.
(POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
(POLY1305_STATE_ALIGNMENT): Use SSE2 versions when needed.
* cipher/poly1305-sse2-amd64.S: New.
* cipher/poly1305.c [POLY1305_USE_SSE2]
(_gcry_poly1305_amd64_sse2_init_ext)
(_gcry_poly1305_amd64_sse2_finish_ext)
(_gcry_poly1305_amd64_sse2_blocks, poly1305_amd64_sse2_ops): New.
(_gcry_polu1305_init) [POLY1305_USE_SSE2]: Use SSE2 version.
* configure.ac [host=x86_64]: Add 'poly1305-sse2-amd64.lo'.
--
Add Andrew Moon's public domain SSE2 implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmarks on Intel i5-4570 (haswell):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.844 ns/B 1130.2 MiB/s 2.70 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.448 ns/B 2129.5 MiB/s 1.43 c/B
Benchmarks on Intel i5-2450M (sandy-bridge):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 1.25 ns/B 763.0 MiB/s 3.12 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.605 ns/B 1575.9 MiB/s 1.51 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'chacha20-avx2-amd64.S'.
* cipher/chacha20-avx2-amd64.S: New.
* cipher/chacha20.c (USE_AVX2): New macro.
[USE_AVX2] (_gcry_chacha20_amd64_avx2_blocks): New.
(chacha20_do_setkey): Select AVX2 implementation if there is HW
support.
(selftest): Increase size of buf by 256.
* configure.ac [host=x86-64]: Add 'chacha20-avx2-amd64.lo'.
--
Add AVX2 optimized implementation for ChaCha20. Based on implementation by
Andrew Moon.
SSSE3 (Intel Haswell):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.742 ns/B 1284.8 MiB/s 2.38 c/B
STREAM dec | 0.741 ns/B 1286.5 MiB/s 2.37 c/B
AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.393 ns/B 2428.0 MiB/s 1.26 c/B
STREAM dec | 0.392 ns/B 2433.6 MiB/s 1.25 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'chacha20-ssse3-amd64.S'.
* cipher/chacha20-ssse3-amd64.S: New.
* cipher/chacha20.c (USE_SSSE3): New macro.
[USE_SSSE3] (_gcry_chacha20_amd64_ssse3_blocks): New.
(chacha20_do_setkey): Select SSSE3 implementation if there is HW
support.
* configure.ac [host=x86-64]: Add 'chacha20-ssse3-amd64.lo'.
--
Add SSSE3 optimized implementation for ChaCha20. Based on implementation
by Andrew Moon.
Before (Intel Haswell):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 1.97 ns/B 483.6 MiB/s 6.31 c/B
STREAM dec | 1.97 ns/B 484.0 MiB/s 6.31 c/B
After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.742 ns/B 1284.8 MiB/s 2.38 c/B
STREAM dec | 0.741 ns/B 1286.5 MiB/s 2.37 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'chacha20.c'.
* cipher/chacha20.c: New.
* cipher/cipher.c (cipher_list): Add ChaCha20.
* configure.ac: Add ChaCha20.
* doc/gcrypt.texi: Add ChaCha20.
* src/cipher.h (_gcry_cipher_spec_chacha20): New.
* src/gcrypt.h.in (GCRY_CIPHER_CHACHA20): Add new algo.
* tests/basic.c (MAX_DATA_LEN): Increase to 128 from 100.
(check_stream_cipher): Add ChaCha20 test-vectors.
(check_ciphers): Add ChaCha20.
--
Patch adds Bernstein's ChaCha20 cipher to libgcrypt. Implementation is based
on public domain implementations.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Bumb LT version to C21/A1/R0.
--
This is to avoid conflicts with the 1.6 series. Note that if we add a
new interface to 1.6 we would need to bump age again.
|
|
* cipher/Makefile.am: Add 'des-amd64.S'.
* cipher/cipher-selftests.c (_gcry_selftest_helper_cbc)
(_gcry_selftest_helper_cfb, _gcry_selftest_helper_ctr): Handle failures
from 'setkey' function.
* cipher/cipher.c (_gcry_cipher_open_internal) [USE_DES]: Setup bulk
functions for 3DES.
* cipher/des-amd64.S: New file.
* cipher/des.c (USE_AMD64_ASM, ATTR_ALIGNED_16): New macros.
[USE_AMD64_ASM] (_gcry_3des_amd64_crypt_block)
(_gcry_3des_amd64_ctr_enc), _gcry_3des_amd64_cbc_dec)
(_gcry_3des_amd64_cfb_dec): New prototypes.
[USE_AMD64_ASM] (tripledes_ecb_crypt): New function.
(TRIPLEDES_ECB_BURN_STACK): New macro.
(_gcry_3des_ctr_enc, _gcry_3des_cbc_dec, _gcry_3des_cfb_dec)
(bulk_selftest_setkey, selftest_ctr, selftest_cbc, selftest_cfb): New
functions.
(selftest): Add call to CTR, CBC and CFB selftest functions.
(do_tripledes_encrypt, do_tripledes_decrypt): Use
TRIPLEDES_ECB_BURN_STACK.
* configure.ac [host=x86-64]: Add 'des-amd64.lo'.
* src/cipher.h (_gcry_3des_ctr_enc, _gcry_3des_cbc_dec)
(_gcry_3des_cfb_dec): New prototypes.
--
Add non-parallel functions for small speed-up and 3-way parallel functions for
modes of operation that support parallel processing.
Old vs new (Intel Core i5-4570):
================================
enc dec
ECB 1.17x 1.17x
CBC 1.17x 2.51x
CFB 1.16x 2.49x
OFB 1.17x 1.17x
CTR 2.56x 2.56x
Old vs new (Intel Core i5-2450M):
=================================
enc dec
ECB 1.28x 1.28x
CBC 1.27x 2.33x
CFB 1.27x 2.34x
OFB 1.27x 1.27x
CTR 2.36x 2.35x
New (Intel Core i5-4570):
=========================
3DES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 28.39 ns/B 33.60 MiB/s 90.84 c/B
ECB dec | 28.27 ns/B 33.74 MiB/s 90.45 c/B
CBC enc | 29.50 ns/B 32.33 MiB/s 94.40 c/B
CBC dec | 13.35 ns/B 71.45 MiB/s 42.71 c/B
CFB enc | 29.59 ns/B 32.23 MiB/s 94.68 c/B
CFB dec | 13.41 ns/B 71.12 MiB/s 42.91 c/B
OFB enc | 28.90 ns/B 33.00 MiB/s 92.47 c/B
OFB dec | 28.90 ns/B 33.00 MiB/s 92.48 c/B
CTR enc | 13.39 ns/B 71.20 MiB/s 42.86 c/B
CTR dec | 13.39 ns/B 71.21 MiB/s 42.86 c/B
Old (Intel Core i5-4570):
=========================
3DES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 33.24 ns/B 28.69 MiB/s 106.4 c/B
ECB dec | 33.26 ns/B 28.67 MiB/s 106.4 c/B
CBC enc | 34.45 ns/B 27.69 MiB/s 110.2 c/B
CBC dec | 33.45 ns/B 28.51 MiB/s 107.1 c/B
CFB enc | 34.43 ns/B 27.70 MiB/s 110.2 c/B
CFB dec | 33.41 ns/B 28.55 MiB/s 106.9 c/B
OFB enc | 33.79 ns/B 28.22 MiB/s 108.1 c/B
OFB dec | 33.79 ns/B 28.22 MiB/s 108.1 c/B
CTR enc | 34.27 ns/B 27.83 MiB/s 109.7 c/B
CTR dec | 34.27 ns/B 27.83 MiB/s 109.7 c/B
New (Intel Core i5-2450M):
==========================
3DES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 42.21 ns/B 22.59 MiB/s 105.5 c/B
ECB dec | 42.23 ns/B 22.58 MiB/s 105.6 c/B
CBC enc | 43.70 ns/B 21.82 MiB/s 109.2 c/B
CBC dec | 23.25 ns/B 41.02 MiB/s 58.12 c/B
CFB enc | 43.71 ns/B 21.82 MiB/s 109.3 c/B
CFB dec | 23.23 ns/B 41.05 MiB/s 58.08 c/B
OFB enc | 42.73 ns/B 22.32 MiB/s 106.8 c/B
OFB dec | 42.73 ns/B 22.32 MiB/s 106.8 c/B
CTR enc | 23.31 ns/B 40.92 MiB/s 58.27 c/B
CTR dec | 23.35 ns/B 40.84 MiB/s 58.38 c/B
Old (Intel Core i5-2450M):
==========================
3DES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 53.98 ns/B 17.67 MiB/s 134.9 c/B
ECB dec | 54.00 ns/B 17.66 MiB/s 135.0 c/B
CBC enc | 55.43 ns/B 17.20 MiB/s 138.6 c/B
CBC dec | 54.27 ns/B 17.57 MiB/s 135.7 c/B
CFB enc | 55.42 ns/B 17.21 MiB/s 138.6 c/B
CFB dec | 54.35 ns/B 17.55 MiB/s 135.9 c/B
OFB enc | 54.49 ns/B 17.50 MiB/s 136.2 c/B
OFB dec | 54.49 ns/B 17.50 MiB/s 136.2 c/B
CTR enc | 55.02 ns/B 17.33 MiB/s 137.5 c/B
CTR dec | 55.01 ns/B 17.34 MiB/s 137.5 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/md2.c: New.
* cipher/md.c (digest_list): add _gcry_digest_spec_md2.
* tests/basic.c (check_digests): add MD2 test vectors.
* configure.ac (default_digests): disable md2 by default.
--
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Some minor indentation fixes by wk.
|
|
* configure.ac (gcry_cv_cc_arm_arch_is_v6): Use compiler test instead
of preprocessor test.
--
Old test was using C preprocessor to check ARM version macros and missed fact
that using different CFLAGS affect those macros (CFLAGS are not passed to
preprocessor checks).
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac (HAVE_PTHREAD): Do test when building for Windows.
* tests/basic.c: Replace "%zi" by "%z" and a cast to make it work
under Windows.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* src/global.c (external_lock_test): New.
(_gcry_vcontrol): Call new function with formerly reserved code 61.
* tests/t-common.h: New. Taken from current libgpg-error.
* tests/t-lock.c: New. Based on t-lock.c from libgpg-error.
* configure.ac (HAVE_PTHREAD): Set macro to 1 if defined.
(AC_CHECK_FUNCS): Check for flockfile.
* tests/Makefile.am (tests_bin): Add t-lock.
(noinst_HEADERS): Add t-common.h
(LDADD): Move value to ...
(default_ldadd): new.
(t_lock_LDADD): New.
--
Signed-off-by: Werner Koch <wk@gnupg.org>
(cherry picked from commit fa42c61a84996b6a7574c32233dfd8d9f254d93a)
Resolved conflicts:
* src/ath.c: Remove as not anymore used in 1.7.
* tests/Makefile.am: Merge.
Changes:
* src/global.c (external_lock_test): Use the gpgrt function
for locking.
Changed subject because here we are only adding the test case.
|
|
* mpi/config.links (mpi_cpu_arch): Always set for ARM. Set for HPPA.
Set to "undefined" for unknown platforms.
(try_asm_modules): Act upon only after having detected the CPU.
* configure.ac: Move the call to config.links before the platform
specific compiler checks. Check platform specific features only if
the platform is targeted.
--
There is no need to check x86 options if we are targeting ARM and vice
versa. This may only introduce build problems. With this patch the
summary output at the end of the compiler also shows more reasonable
messages.
Signed-off-by: Werner Koch <wk@gnupg.org>
(cherry picked from commit 04d478d9b0f92d80105ddaf2c011f40ae8260cfb)
|
|
* configure.ac: Check size of uint64_t and the UINT64_C macro.
--
configure.ac used $ac_cv_sizeof_uint64_t but never set this variable.
Due to the availability of long long on all platforms supporting
uint64_t this was not a real problem. Found while remove the
corresponding test from gnupg.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* configure.ac (NEED_GPG_ERROR_VERSION): Require 1.13.
(gl_LOCK): Remove.
* src/ath.c, src/ath.h: Remove. Remove from all files. Replace all
mutexes by gpgrt based statically initialized locks.
* src/global.c (global_init): Remove ath_init.
(_gcry_vcontrol): Make ath install a dummy function.
(print_config): Remove threads info line.
* doc/gcrypt.texi: Simplify the multi-thread related documentation.
--
The current code does only work on ELF systems with weak symbol
support. In particular no locks were used under Windows. With the
new gpgrt_lock functions from the soon to be released libgpg-error
1.13 we have a better portable scheme which also allows for static
initialized mutexes.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* configure.ac (gcry_cv_gcc_as_const_division_ok): Correct variable
name mismatch at '--Wa,--divide' workaround check.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac (gcry_cv_gcc_as_const_division_ok): Add new check for
constant division in assembly and test for "-Wa,--divide" workaround.
(gcry_cv_gcc_amd64_platform_as_ok): Check for also constant division.
--
Appearantly on Solaris/x86 '/' character is treated as begining of line
comment by GNU as. This causes problems when compiling SHA-1 SSSE3
implementation:
On 02.01.2014 16:26, Richard PALO wrote:
>> COLLECT_GCC_OPTIONS='-D' 'HAVE_CONFIG_H' '-I' '.' '-I' '..' '-I' '../src' '-I' '/var/tmp/pkgsrc/security/libgcrypt/work/.buildlink/include' '-I' '/var/tmp/pkgsrc/security/libgcrypt/work/.buildlink/include/gettext' '-D' '_REENTRANT' '-O2' '-MT' 'sha1-ssse3-amd64.lo' '-MD' '-MP' '-MF' '.deps/sha1-ssse3-amd64.Tpo' '-c' '-fPIC' '-D' 'PIC' '-o' '.libs/sha1-ssse3-amd64.o' '-v' '-mtune=generic' '-march=x86-64'
>> /usr/gnu/bin/as -v -I . -I .. -I ../src -I /var/tmp/pkgsrc/security/libgcrypt/work/.buildlink/include -I /var/tmp/pkgsrc/security/libgcrypt/work/.buildlink/include/gettext -V -Qy -s --64 -o .libs/sha1-ssse3-amd64.o /var/tmp//ccAxWPXX.s
>> GNU assembler version 2.23.1 (i386-pc-solaris2.11) using BFD version (GNU Binutils) 2.23.1
>> /var/tmp//ccAxWPXX.s: Assembler messages:
>> /var/tmp//ccAxWPXX.s:34: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:38: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:42: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:46: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:54: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:58: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:62: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:66: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:70: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:74: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:78: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:82: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:86: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:90: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:94: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:98: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:102: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:106: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:110: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:114: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:119: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:123: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:127: Error: unbalanced parenthesis in operand 1.
>> /var/tmp//ccAxWPXX.s:132: Error: unbalanced parenthesis in operand 1.
>
>
> apparently the paddd code, such as
> `paddd (.LK_XMM + ((i)/20)*16) RIP, tmp0;`
> isn't digested well, appended is the generated assembler code.
On 02.01.2014 17:41, Richard PALO wrote:
> Hi again, after finding the following:
> https://sourceware.org/bugzilla/show_bug.cgi?id=4572
>
> I tried using '-Wa,--divide' and that seemed to workaround the problem...
>
> perhaps the code, or at least the Makefile could be adapted accordingly?
Patch adds detection of this feature and attempts to workaround issue with by
adding "-Wa,--divide" to CPPFLAGS. If workaround does not work (old GAS on
Solaris/x86), we'll disable AMD64 assembly.
[v3]:
- Update CPPFLAGS after testing instead of CFLAGS.
Reported-and-tested-by: Richard PALO <richard.palo@free.fr>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* scripts/: Rename to build-aux/.
* compile, config.guess, config.rpath, config.sub
* depcomp, doc/mdate-sh, doc/texinfo.tex
* install-sh, ltmain.sh, missing: Move to build-aux/.
* Makefile.am (EXTRA_DIST): Adjust.
* configure.ac (AC_CONFIG_AUX_DIR): New.
(AM_SILENT_RULES): New.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* cipher/Makefile.am: Add 'arcfour-amd64.S'.
* cipher/arcfour-amd64.S: New.
* cipher/arcfour.c (USE_AMD64_ASM): New.
[USE_AMD64_ASM] (ARCFOUR_context, _gcry_arcfour_amd64)
(encrypt_stream): New.
* configure.ac [host=x86_64]: Add 'arcfour-amd64.lo'.
--
Patch adds Marc Bevand's public-domain AMD64 assembly implementation of RC4 to
libgcrypt. Original implementation is at:
http://www.zorinaq.com/papers/rc4-amd64.html
Benchmarks on Intel i5-4570 (3200 Mhz):
New:
ARCFOUR | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 1.29 ns/B 737.7 MiB/s 4.14 c/B
STREAM dec | 1.31 ns/B 730.6 MiB/s 4.18 c/B
Old (C-language):
ARCFOUR | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 2.09 ns/B 457.4 MiB/s 6.67 c/B
STREAM dec | 2.09 ns/B 457.2 MiB/s 6.68 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac: Also check for 'xgetbv' instruction in AVX and AVX2
inline assembly checks.
* src/hwf-x86.c [__i386__] (get_xgetbv): New function.
[__x86_64__] (get_xgetbv): New function.
[HAS_X86_CPUID] (detect_x86_gnuc): Check for OSXSAVE and OS support for
XMM&YMM registers and enable AVX/AVX2 only if XMM&YMM registers are
supported by OS.
--
This patch is based on original patch and bug report by Panagiotis Christopoulos:
Adding better detection of AVX/AVX2 support
After upgrading libgcrypt from 1.5.3 to 1.6.0 on a remote XEN system (linode) my
gpg2 stopped working properly, throwing SIGILL signals when doing sha512
operations etc. I managed to debug this with the help of Doublas Freed
(dwfreed at mtu.edu) and it seems that the current AVX detection just checks for
bit 28 on cpuid but the check still works on systems that have disabled the avx/avx2
instructions for some reason (eg. performance/unstability) resulting in SIGILLs
(eg. when trying _gcry_sha512_transform_amd64_avx() ).
From Intel resources[1][2], I found additional checks for better AVX
detection and applied them in the following patch. Please review/change
accordingly and commit some better AVX detection mechanism. The AVX part is
tested but could not test the AVX2 one, because I lack proper hardware. I can
provide additional information upon request. Use the patch only as a guideline,
as it's not thoroughly tested.
[1] http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled
[2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf (sections 14.3 and 14.7.1)
Reported-by: Panagiotis Christopoulos (pchrist) <pchrist@gentoo.org>
Cc: Doublas Freed <dwfreed@mtu.edu>
Cc: Tim Harder <radhermit@gentoo.org>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'sha1-armv7-neon.S'.
* cipher/sha1-armv7-neon.S: New.
* cipher/sha1.c (USE_NEON): New.
(SHA1_CONTEXT, sha1_init) [USE_NEON]: Add and initialize 'use_neon'.
[USE_NEON] (_gcry_sha1_transform_armv7_neon): New.
(transform) [USE_NEON]: Use ARM/NEON assembly if enabled.
* configure.ac: Add 'sha1-armv7-neon.lo'.
--
Patch adds ARM/NEON implementation for SHA-1.
Benchmarks show 1.72x improvement on ARM Cortex-A8, 1008 Mhz:
jussi@cubie:~/libgcrypt$ tests/bench-slope --cpu-mhz 1008 hash sha1
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
SHA1 | 7.80 ns/B 122.3 MiB/s 7.86 c/B
=
jussi@cubie:~/libgcrypt$ tests/bench-slope --disable-hwf arm-neon --cpu-mhz 1008 hash sha1
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
SHA1 | 13.41 ns/B 71.10 MiB/s 13.52 c/B
=
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* LICENSES: Add 'cipher/sha256-avx-amd64.S' and
'cipher/sha256-avx2-bmi2-amd64.S'.
* cipher/Makefile.am: Add 'sha256-avx-amd64.S' and
'sha256-avx2-bmi2-amd64.S'.
* cipher/sha256-avx-amd64.S: New.
* cipher/sha256-avx2-bmi2-amd64.S: New.
* cipher/sha256-ssse3-amd64.S: Use 'lea' instead of 'add' in few
places for tiny speed improvement.
* cipher/sha256.c (USE_AVX, USE_AVX2): New.
(SHA256_CONTEXT) [USE_AVX, USE_AVX2]: Add 'use_avx' and 'use_avx2'.
(sha256_init, sha224_init) [USE_AVX, USE_AVX2]: Initialize above
new context members.
[USE_AVX] (_gcry_sha256_transform_amd64_avx): New.
[USE_AVX2] (_gcry_sha256_transform_amd64_avx2): New.
(transform) [USE_AVX2]: Use AVX2 assembly if enabled.
(transform) [USE_AVX]: Use AVX assembly if enabled.
* configure.ac: Add 'sha256-avx-amd64.lo' and
'sha256-avx2-bmi2-amd64.lo'.
--
Patch adds fast AVX and AVX2/BMI2 implementations of SHA-256 by Intel
Corporation. The assembly source is licensed under 3-clause BSD license,
thus compatible with LGPL2.1+. Original source can be accessed at:
http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs
Implementation is described in white paper
"Fast SHA - 256 Implementations on IntelĀ® Architecture Processors"
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html
Note: AVX implementation uses SHLD instruction to emulate RORQ, since it's
faster on Intel Sandy-Bridge. However, on non-Intel CPUs SHLD is much
slower than RORQ, so therefore AVX implementation is (for now) limited
to Intel CPUs.
Note: AVX2 implementation also uses BMI2 instruction rorx, thus additional
HWF flag.
Benchmarks:
cpu C-lang SSSE3 AVX/AVX2 C vs AVX/AVX2
vs SSSE3
Intel i5-4570 13.86 c/B 10.27 c/B 8.70 c/B 1.59x 1.18x
Intel i5-2450M 17.25 c/B 12.36 c/B 10.31 c/B 1.67x 1.19x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'sha1-avx-amd64.S' and
'sha1-avx-bmi2-amd64.S'.
* cipher/sha1-avx-amd64.S: New.
* cipher/sha1-avx-bmi2-amd64.S: New.
* cipher/sha1.c (USE_AVX, USE_BMI2): New.
(SHA1_CONTEXT) [USE_AVX]: Add 'use_avx'.
(SHA1_CONTEXT) [USE_BMI2]: Add 'use_bmi2'.
(sha1_init): Initialize 'use_avx' and 'use_bmi2'.
[USE_AVX] (_gcry_sha1_transform_amd64_avx): New.
[USE_BMI2] (_gcry_sha1_transform_amd64_bmi2): New.
(transform) [USE_BMI2]: Use BMI2 assembly if enabled.
(transform) [USE_AVX]: Use AVX assembly if enabled.
* configure.ac: Add 'sha1-avx-amd64.lo' and 'sha1-avx-bmi2-amd64.lo'.
--
Patch adds AVX (for Sandybridge and Ivybridge) and AVX/BMI2 (for Haswell)
optimized implementations of SHA-1.
Note: AVX implementation is currently limited to Intel CPUs due to use
of SHLD instruction for faster rotations on Sandybrigde.
Benchmarks:
cpu C-version SSSE3 AVX/(SHLD|BMI2) New vs C New vs SSSE3
Intel i5-4570 8.84 c/B 4.61 c/B 3.86 c/B 2.29x 1.19x
Intel i5-2450M 9.45 c/B 5.30 c/B 4.39 c/B 2.15x 1.20x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
--
|
|
--
|
|
|
|
* configure.ac: Add option --enable-large-data-tests.
* tests/hashtest-256g.in: New.
* tests/Makefile.am (EXTRA_DIST): Add hashtest-256g.in.
(TESTS): Split up into tests_bin, tests_bin_last, tests_sh, and
tests_sh_last.
(tests_sh_last): Add hashtest-256g
(noinst_PROGRAMS): Add only tests_bin and tests_bin_last.
(bench-slope.log, hashtest-256g.log): New rules to enforce serial run.
Signed-off-by: Werner Koch <wk@gnupg.org>
|
|
* cipher/Makefile.am: Add 'sha1-ssse3-amd64.c'.
* cipher/sha1-ssse3-amd64.c: New.
* cipher/sha1.c (USE_SSSE3): New.
(SHA1_CONTEXT) [USE_SSSE3]: Add 'use_ssse3'.
(sha1_init) [USE_SSSE3]: Initialize 'use_ssse3'.
(transform): Rename to...
(_transform): this.
(transform): New.
* configure.ac [host=x86_64]: Add 'sha1-ssse3-amd64.lo'.
--
Patch adds SSSE3 implementation based on white paper "Improving the Performance
of the Secure Hash Algorithm (SHA-1)" at
http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1
Benchmarks:
cpu Old New Diff
Intel i5-4570 9.02 c/B 5.22 c/B 1.72x
Intel i5-2450M 12.27 c/B 7.24 c/B 1.69x
Intel Core2 T8100 7.94 c/B 6.76 c/B 1.17x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* configure.ac (gcry_cv_gcc_inline_asm_avx2): Add "cc" as assembly
globber.
--
Appearently empty globbers only work in some cases on linux, and fail on
mingw32.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'sha512-avx-amd64.S' and
'sha512-avx2-bmi2-amd64.S'.
* cipher/sha512-avx-amd64.S: New.
* cipher/sha512-avx2-bmi2-amd64.S: New.
* cipher/sha512.c (USE_AVX, USE_AVX2): New.
(SHA512_CONTEXT) [USE_AVX]: Add 'use_avx'.
(SHA512_CONTEXT) [USE_AVX2]: Add 'use_avx2'.
(sha512_init, sha384_init) [USE_AVX]: Initialize 'use_avx'.
(sha512_init, sha384_init) [USE_AVX2]: Initialize 'use_avx2'.
[USE_AVX] (_gcry_sha512_transform_amd64_avx): New.
[USE_AVX2] (_gcry_sha512_transform_amd64_avx2): New.
(transform) [USE_AVX2]: Add call for AVX2 implementation.
(transform) [USE_AVX]: Add call for AVX implementation.
* configure.ac (HAVE_GCC_INLINE_ASM_BMI2): New check.
(sha512): Add 'sha512-avx-amd64.lo' and 'sha512-avx2-bmi2-amd64.lo'.
* doc/gcrypt.texi: Document 'intel-cpu' and 'intel-bmi2'.
* src/g10lib.h (HWF_INTEL_CPU, HWF_INTEL_BMI2): New.
* src/hwfeatures.c (hwflist): Add "intel-cpu" and "intel-bmi2".
* src/hwf-x86.c (detect_x86_gnuc): Check for HWF_INTEL_CPU and
HWF_INTEL_BMI2.
--
Patch adds fast AVX and AVX2 implementation of SHA-512 by Intel Corporation.
The assembly source is licensed under 3-clause BSD license, thus compatible
with LGPL2.1+. Original source can be accessed at:
http://www.intel.com/p/en_US/embedded/hwsw/technology/packet-processing#docs
Implementation is described in white paper
"Fast SHA512 Implementations on IntelĀ® Architecture Processors"
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/fast-sha512-implementat$
Note: AVX implementation uses SHLD instruction to emulate RORQ, since it's
faster on Intel Sandy-Bridge. However, on non-Intel CPUs SHLD is much
slower than RORQ, so therefore AVX implementation is (for now) limited
to Intel CPUs.
Note: AVX2 implementation also uses BMI2 instruction rorx, thus additional
HWF flag.
Benchmarks:
cpu Old SSSE3 AVX/AVX2 Old vs AVX/AVX2
vs SSSE3
Intel i5-4570 10.11 c/B 7.56 c/B 6.72 c/B 1.50x 1.12x
Intel i5-2450M 14.11 c/B 10.53 c/B 8.88 c/B 1.58x 1.18x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|