peter/libgcrypt - libgcrypt source repository for Peter

Age	Commit message (Collapse)	Author	Files	Lines
2015-11-18	Tweak Keccak for small speed-up	Jussi Kivilinna	1	-24/+20
	* cipher/keccak_permute_32.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Track rounds with round constant pointer instead of separate round counter. * cipher/keccak_permute_64.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Ditto. (KECCAK_F1600_ABSORB_FUNC_NAME): Tweak lanes pointer increment for bulk absorb loops. -- Patch makes small tweaks to improve performance. Benchmark on Intel Haswell @ 3.2 Ghz: Before: \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 2.27 ns/B 420.5 MiB/s 7.26 c/B SHAKE256 \| 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-224 \| 2.64 ns/B 361.7 MiB/s 8.44 c/B SHA3-256 \| 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-384 \| 3.65 ns/B 261.3 MiB/s 11.68 c/B SHA3-512 \| 5.27 ns/B 181.0 MiB/s 16.86 c/B After: \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 2.25 ns/B 423.5 MiB/s 7.21 c/B SHAKE256 \| 2.77 ns/B 343.9 MiB/s 8.88 c/B SHA3-224 \| 2.62 ns/B 364.1 MiB/s 8.38 c/B SHA3-256 \| 2.77 ns/B 343.8 MiB/s 8.88 c/B SHA3-384 \| 3.63 ns/B 262.6 MiB/s 11.63 c/B SHA3-512 \| 5.23 ns/B 182.3 MiB/s 16.75 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-11-01	Add ARMv7/NEON implementation of Keccak	Jussi Kivilinna	1	-1/+1
	* cipher/Makefile.am: Add 'keccak-armv7-neon.S'. * cipher/keccak-armv7-neon.S: New. * cipher/keccak.c (USE_64BIT_ARM_NEON): New. (NEED_COMMON64): Select if USE_64BIT_ARM_NEON. [NEED_COMMON64] (round_consts_64bit): Rename to... [NEED_COMMON64] (_gcry_keccak_round_consts_64bit): ...this; Add terminator at end. [USE_64BIT_ARM_NEON] (_gcry_keccak_permute_armv7_neon) (_gcry_keccak_absorb_lanes64_armv7_neon, keccak_permute64_armv7_neon) (keccak_absorb_lanes64_armv7_neon, keccak_armv7_neon_64_ops): New. (keccak_init) [USE_64BIT_ARM_NEON]: Select ARM/NEON implementation if supported by HW. * cipher/keccak_permute_64.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Update to use new round constant table. * configure.ac: Add 'keccak-armv7-neon.lo'. -- Patch adds ARMv7/NEON implementation of Keccak (SHAKE/SHA3). Patch is based on public-domain implementation by Ronny Van Keer from SUPERCOP package: https://github.com/floodyberry/supercop/blob/master/crypto_hash/\ keccakc1024/inplace-armv7a-neon/keccak2.s Benchmark results on Cortex-A8 @ 1008 Mhz: Before (generic 32-bit bit-interleaved impl.): \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 83.00 ns/B 11.49 MiB/s 83.67 c/B SHAKE256 \| 101.7 ns/B 9.38 MiB/s 102.5 c/B SHA3-224 \| 96.13 ns/B 9.92 MiB/s 96.90 c/B SHA3-256 \| 101.5 ns/B 9.40 MiB/s 102.3 c/B SHA3-384 \| 131.4 ns/B 7.26 MiB/s 132.5 c/B SHA3-512 \| 189.1 ns/B 5.04 MiB/s 190.6 c/B After (ARM/NEON, ~3.2x faster): \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 25.09 ns/B 38.01 MiB/s 25.29 c/B SHAKE256 \| 30.95 ns/B 30.82 MiB/s 31.19 c/B SHA3-224 \| 29.24 ns/B 32.61 MiB/s 29.48 c/B SHA3-256 \| 30.95 ns/B 30.82 MiB/s 31.19 c/B SHA3-384 \| 40.42 ns/B 23.59 MiB/s 40.74 c/B SHA3-512 \| 58.37 ns/B 16.34 MiB/s 58.84 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-11-01	Optimize Keccak 64-bit absorb functions	Jussi Kivilinna	1	-0/+99
	* cipher/keccak.c [USE_64BIT] [__x86_64__] (absorb_lanes64_8) (absorb_lanes64_4, absorb_lanes64_2, absorb_lanes64_1): New. * cipher/keccak.c [USE_64BIT] [!__x86_64__] (absorb_lanes64_8) (absorb_lanes64_4, absorb_lanes64_2, absorb_lanes64_1): New. [USE_64BIT] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT] (keccak_absorb_lanes64): Remove. [USE_64BIT_SHLD] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT_SHLD] (keccak_absorb_lanes64_shld): Remove. [USE_64BIT_BMI2] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT_BMI2] (keccak_absorb_lanes64_bmi2): Remove. * cipher/keccak_permute_64.h (KECCAK_F1600_ABSORB_FUNC_NAME): New. -- Optimize 64-bit absorb functions for small speed-up. After this change, 64-bit BMI2 implementation matches speed of fastest results from SUPERCOP for Intel Haswell CPUs (long messages). Benchmark on Intel Haswell @ 3.2 Ghz: Before: \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 2.32 ns/B 411.7 MiB/s 7.41 c/B SHAKE256 \| 2.84 ns/B 336.2 MiB/s 9.08 c/B SHA3-224 \| 2.69 ns/B 354.9 MiB/s 8.60 c/B SHA3-256 \| 2.84 ns/B 336.0 MiB/s 9.08 c/B SHA3-384 \| 3.69 ns/B 258.4 MiB/s 11.81 c/B SHA3-512 \| 5.30 ns/B 179.9 MiB/s 16.97 c/B After: \| nanosecs/byte mebibytes/sec cycles/byte SHAKE128 \| 2.27 ns/B 420.6 MiB/s 7.26 c/B SHAKE256 \| 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-224 \| 2.64 ns/B 361.7 MiB/s 8.44 c/B SHA3-256 \| 2.79 ns/B 341.5 MiB/s 8.94 c/B SHA3-384 \| 3.65 ns/B 261.4 MiB/s 11.68 c/B SHA3-512 \| 5.27 ns/B 181.0 MiB/s 16.87 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-10-28	keccak: rewrite for improved performance	Jussi Kivilinna	1	-0/+290
	* cipher/Makefile.am: Add 'keccak_permute_32.h' and 'keccak_permute_64.h'. * cipher/hash-common.h [USE_SHA3] (MD_BLOCK_MAX_BLOCKSIZE): Remove. * cipher/keccak.c (USE_64BIT, USE_32BIT, USE_64BIT_BMI2) (USE_64BIT_SHLD, USE_32BIT_BMI2, NEED_COMMON64, NEED_COMMON32BI) (keccak_ops_t): New. (KECCAK_STATE): Add 'state64' and 'state32bi' members. (KECCAK_CONTEXT): Remove 'bctx'; add 'blocksize', 'count' and 'ops'. (rol64, keccak_f1600_state_permute): Remove. [NEED_COMMON64] (round_consts_64bit, keccak_extract_inplace64): New. [NEED_COMMON32BI] (round_consts_32bit, keccak_extract_inplace32bi) (keccak_absorb_lane32bi): New. [USE_64BIT] (ANDN64, ROL64, keccak_f1600_state_permute64) (keccak_absorb_lanes64, keccak_generic64_ops): New. [USE_64BIT_SHLD] (ANDN64, ROL64, keccak_f1600_state_permute64_shld) (keccak_absorb_lanes64_shld, keccak_shld_64_ops): New. [USE_64BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute64_bmi2) (keccak_absorb_lanes64_bmi2, keccak_bmi2_64_ops): New. [USE_32BIT] (ANDN64, ROL64, keccak_f1600_state_permute32bi) (keccak_absorb_lanes32bi, keccak_generic32bi_ops): New. [USE_32BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute32bi_bmi2) (pext, pdep, keccak_absorb_lane32bi_bmi2, keccak_absorb_lanes32bi_bmi2) (keccak_extract_inplace32bi_bmi2, keccak_bmi2_32bi_ops): New. (keccak_write): New. (keccak_init): Adjust to KECCAK_CONTEXT changes; add implementation selection based on HWF features. (keccak_final): Adjust to KECCAK_CONTEXT changes; use selected 'ops' for state manipulation. (keccak_read): Adjust to KECCAK_CONTEXT changes. (_gcry_digest_spec_sha3_224, _gcry_digest_spec_sha3_256) (_gcry_digest_spec_sha3_348, _gcry_digest_spec_sha3_512): Use 'keccak_write' instead of '_gcry_md_block_write'. * cipher/keccak_permute_32.h: New. * cipher/keccak_permute_64.h: New. -- Patch adds new generic 64-bit and 32-bit implementations and optimized implementations for SHA3: - Generic 64-bit implementation based on 'simple' implementation from SUPERCOP package. - Generic 32-bit bit-inteleaved implementataion based on 'simple32bi' implementation from SUPERCOP package. - Intel BMI2 optimized variants of 64-bit and 32-bit BI implementations. - Intel SHLD optimized variant of 64-bit implementation. Patch also makes proper use of sponge construction to avoid use of addition input buffer. Below are bench-slope benchmarks for new 64-bit implementations made on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2). Before (amd64): SHA3-224 \| 3.92 ns/B 243.2 MiB/s 12.55 c/B SHA3-256 \| 4.15 ns/B 230.0 MiB/s 13.27 c/B SHA3-384 \| 5.40 ns/B 176.6 MiB/s 17.29 c/B SHA3-512 \| 7.77 ns/B 122.7 MiB/s 24.87 c/B After (generic 64-bit, amd64), 1.10x faster): SHA3-224 \| 3.57 ns/B 267.4 MiB/s 11.42 c/B SHA3-256 \| 3.77 ns/B 252.8 MiB/s 12.07 c/B SHA3-384 \| 4.91 ns/B 194.1 MiB/s 15.72 c/B SHA3-512 \| 7.06 ns/B 135.0 MiB/s 22.61 c/B After (Intel SHLD 64-bit, amd64, 1.13x faster): SHA3-224 \| 3.48 ns/B 273.7 MiB/s 11.15 c/B SHA3-256 \| 3.68 ns/B 258.9 MiB/s 11.79 c/B SHA3-384 \| 4.80 ns/B 198.7 MiB/s 15.36 c/B SHA3-512 \| 6.89 ns/B 138.4 MiB/s 22.05 c/B After (Intel BMI2 64-bit, amd64, 1.45x faster): SHA3-224 \| 2.71 ns/B 352.1 MiB/s 8.67 c/B SHA3-256 \| 2.86 ns/B 333.2 MiB/s 9.16 c/B SHA3-384 \| 3.72 ns/B 256.2 MiB/s 11.91 c/B SHA3-512 \| 5.34 ns/B 178.5 MiB/s 17.10 c/B Benchmarks of new 32-bit implementations on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2): Before (win32): SHA3-224 \| 12.05 ns/B 79.16 MiB/s 38.56 c/B SHA3-256 \| 12.75 ns/B 74.78 MiB/s 40.82 c/B SHA3-384 \| 16.63 ns/B 57.36 MiB/s 53.22 c/B SHA3-512 \| 23.97 ns/B 39.79 MiB/s 76.72 c/B After (generic 32-bit BI, win32, 1.23x to 1.29x faster): SHA3-224 \| 9.76 ns/B 97.69 MiB/s 31.25 c/B SHA3-256 \| 10.27 ns/B 92.82 MiB/s 32.89 c/B SHA3-384 \| 13.22 ns/B 72.16 MiB/s 42.31 c/B SHA3-512 \| 18.65 ns/B 51.13 MiB/s 59.70 c/B After (Intel BMI2 32-bit BI, win32, 1.66x to 1.70x faster): SHA3-224 \| 7.26 ns/B 131.4 MiB/s 23.23 c/B SHA3-256 \| 7.65 ns/B 124.7 MiB/s 24.47 c/B SHA3-384 \| 9.87 ns/B 96.67 MiB/s 31.58 c/B SHA3-512 \| 14.05 ns/B 67.85 MiB/s 44.99 c/B Benchmarks of new 32-bit implementation on ARM Cortex-A8 (1008 Mhz, gcc-4.9.1): Before: SHA3-224 \| 148.6 ns/B 6.42 MiB/s 149.8 c/B SHA3-256 \| 157.2 ns/B 6.07 MiB/s 158.4 c/B SHA3-384 \| 205.3 ns/B 4.65 MiB/s 206.9 c/B SHA3-512 \| 296.3 ns/B 3.22 MiB/s 298.6 c/B After (1.56x faster): SHA3-224 \| 96.12 ns/B 9.92 MiB/s 96.89 c/B SHA3-256 \| 101.5 ns/B 9.40 MiB/s 102.3 c/B SHA3-384 \| 131.4 ns/B 7.26 MiB/s 132.5 c/B SHA3-512 \| 188.2 ns/B 5.07 MiB/s 189.7 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

* cipher/keccak_permute_32.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Track rounds with round constant pointer instead of separate round counter. * cipher/keccak_permute_64.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Ditto. (KECCAK_F1600_ABSORB_FUNC_NAME): Tweak lanes pointer increment for bulk absorb loops. -- Patch makes small tweaks to improve performance. Benchmark on Intel Haswell @ 3.2 Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 2.27 ns/B 420.5 MiB/s 7.26 c/B SHAKE256 | 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-224 | 2.64 ns/B 361.7 MiB/s 8.44 c/B SHA3-256 | 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-384 | 3.65 ns/B 261.3 MiB/s 11.68 c/B SHA3-512 | 5.27 ns/B 181.0 MiB/s 16.86 c/B After: | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 2.25 ns/B 423.5 MiB/s 7.21 c/B SHAKE256 | 2.77 ns/B 343.9 MiB/s 8.88 c/B SHA3-224 | 2.62 ns/B 364.1 MiB/s 8.38 c/B SHA3-256 | 2.77 ns/B 343.8 MiB/s 8.88 c/B SHA3-384 | 3.63 ns/B 262.6 MiB/s 11.63 c/B SHA3-512 | 5.23 ns/B 182.3 MiB/s 16.75 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

* cipher/Makefile.am: Add 'keccak-armv7-neon.S'. * cipher/keccak-armv7-neon.S: New. * cipher/keccak.c (USE_64BIT_ARM_NEON): New. (NEED_COMMON64): Select if USE_64BIT_ARM_NEON. [NEED_COMMON64] (round_consts_64bit): Rename to... [NEED_COMMON64] (_gcry_keccak_round_consts_64bit): ...this; Add terminator at end. [USE_64BIT_ARM_NEON] (_gcry_keccak_permute_armv7_neon) (_gcry_keccak_absorb_lanes64_armv7_neon, keccak_permute64_armv7_neon) (keccak_absorb_lanes64_armv7_neon, keccak_armv7_neon_64_ops): New. (keccak_init) [USE_64BIT_ARM_NEON]: Select ARM/NEON implementation if supported by HW. * cipher/keccak_permute_64.h (KECCAK_F1600_PERMUTE_FUNC_NAME): Update to use new round constant table. * configure.ac: Add 'keccak-armv7-neon.lo'. -- Patch adds ARMv7/NEON implementation of Keccak (SHAKE/SHA3). Patch is based on public-domain implementation by Ronny Van Keer from SUPERCOP package: https://github.com/floodyberry/supercop/blob/master/crypto_hash/\ keccakc1024/inplace-armv7a-neon/keccak2.s Benchmark results on Cortex-A8 @ 1008 Mhz: Before (generic 32-bit bit-interleaved impl.): | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 83.00 ns/B 11.49 MiB/s 83.67 c/B SHAKE256 | 101.7 ns/B 9.38 MiB/s 102.5 c/B SHA3-224 | 96.13 ns/B 9.92 MiB/s 96.90 c/B SHA3-256 | 101.5 ns/B 9.40 MiB/s 102.3 c/B SHA3-384 | 131.4 ns/B 7.26 MiB/s 132.5 c/B SHA3-512 | 189.1 ns/B 5.04 MiB/s 190.6 c/B After (ARM/NEON, ~3.2x faster): | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 25.09 ns/B 38.01 MiB/s 25.29 c/B SHAKE256 | 30.95 ns/B 30.82 MiB/s 31.19 c/B SHA3-224 | 29.24 ns/B 32.61 MiB/s 29.48 c/B SHA3-256 | 30.95 ns/B 30.82 MiB/s 31.19 c/B SHA3-384 | 40.42 ns/B 23.59 MiB/s 40.74 c/B SHA3-512 | 58.37 ns/B 16.34 MiB/s 58.84 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

* cipher/keccak.c [USE_64BIT] [__x86_64__] (absorb_lanes64_8) (absorb_lanes64_4, absorb_lanes64_2, absorb_lanes64_1): New. * cipher/keccak.c [USE_64BIT] [!__x86_64__] (absorb_lanes64_8) (absorb_lanes64_4, absorb_lanes64_2, absorb_lanes64_1): New. [USE_64BIT] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT] (keccak_absorb_lanes64): Remove. [USE_64BIT_SHLD] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT_SHLD] (keccak_absorb_lanes64_shld): Remove. [USE_64BIT_BMI2] (KECCAK_F1600_ABSORB_FUNC_NAME): New. [USE_64BIT_BMI2] (keccak_absorb_lanes64_bmi2): Remove. * cipher/keccak_permute_64.h (KECCAK_F1600_ABSORB_FUNC_NAME): New. -- Optimize 64-bit absorb functions for small speed-up. After this change, 64-bit BMI2 implementation matches speed of fastest results from SUPERCOP for Intel Haswell CPUs (long messages). Benchmark on Intel Haswell @ 3.2 Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 2.32 ns/B 411.7 MiB/s 7.41 c/B SHAKE256 | 2.84 ns/B 336.2 MiB/s 9.08 c/B SHA3-224 | 2.69 ns/B 354.9 MiB/s 8.60 c/B SHA3-256 | 2.84 ns/B 336.0 MiB/s 9.08 c/B SHA3-384 | 3.69 ns/B 258.4 MiB/s 11.81 c/B SHA3-512 | 5.30 ns/B 179.9 MiB/s 16.97 c/B After: | nanosecs/byte mebibytes/sec cycles/byte SHAKE128 | 2.27 ns/B 420.6 MiB/s 7.26 c/B SHAKE256 | 2.79 ns/B 341.4 MiB/s 8.94 c/B SHA3-224 | 2.64 ns/B 361.7 MiB/s 8.44 c/B SHA3-256 | 2.79 ns/B 341.5 MiB/s 8.94 c/B SHA3-384 | 3.65 ns/B 261.4 MiB/s 11.68 c/B SHA3-512 | 5.27 ns/B 181.0 MiB/s 16.87 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

* cipher/Makefile.am: Add 'keccak_permute_32.h' and 'keccak_permute_64.h'. * cipher/hash-common.h [USE_SHA3] (MD_BLOCK_MAX_BLOCKSIZE): Remove. * cipher/keccak.c (USE_64BIT, USE_32BIT, USE_64BIT_BMI2) (USE_64BIT_SHLD, USE_32BIT_BMI2, NEED_COMMON64, NEED_COMMON32BI) (keccak_ops_t): New. (KECCAK_STATE): Add 'state64' and 'state32bi' members. (KECCAK_CONTEXT): Remove 'bctx'; add 'blocksize', 'count' and 'ops'. (rol64, keccak_f1600_state_permute): Remove. [NEED_COMMON64] (round_consts_64bit, keccak_extract_inplace64): New. [NEED_COMMON32BI] (round_consts_32bit, keccak_extract_inplace32bi) (keccak_absorb_lane32bi): New. [USE_64BIT] (ANDN64, ROL64, keccak_f1600_state_permute64) (keccak_absorb_lanes64, keccak_generic64_ops): New. [USE_64BIT_SHLD] (ANDN64, ROL64, keccak_f1600_state_permute64_shld) (keccak_absorb_lanes64_shld, keccak_shld_64_ops): New. [USE_64BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute64_bmi2) (keccak_absorb_lanes64_bmi2, keccak_bmi2_64_ops): New. [USE_32BIT] (ANDN64, ROL64, keccak_f1600_state_permute32bi) (keccak_absorb_lanes32bi, keccak_generic32bi_ops): New. [USE_32BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute32bi_bmi2) (pext, pdep, keccak_absorb_lane32bi_bmi2, keccak_absorb_lanes32bi_bmi2) (keccak_extract_inplace32bi_bmi2, keccak_bmi2_32bi_ops): New. (keccak_write): New. (keccak_init): Adjust to KECCAK_CONTEXT changes; add implementation selection based on HWF features. (keccak_final): Adjust to KECCAK_CONTEXT changes; use selected 'ops' for state manipulation. (keccak_read): Adjust to KECCAK_CONTEXT changes. (_gcry_digest_spec_sha3_224, _gcry_digest_spec_sha3_256) (_gcry_digest_spec_sha3_348, _gcry_digest_spec_sha3_512): Use 'keccak_write' instead of '_gcry_md_block_write'. * cipher/keccak_permute_32.h: New. * cipher/keccak_permute_64.h: New. -- Patch adds new generic 64-bit and 32-bit implementations and optimized implementations for SHA3: - Generic 64-bit implementation based on 'simple' implementation from SUPERCOP package. - Generic 32-bit bit-inteleaved implementataion based on 'simple32bi' implementation from SUPERCOP package. - Intel BMI2 optimized variants of 64-bit and 32-bit BI implementations. - Intel SHLD optimized variant of 64-bit implementation. Patch also makes proper use of sponge construction to avoid use of addition input buffer. Below are bench-slope benchmarks for new 64-bit implementations made on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2). Before (amd64): SHA3-224 | 3.92 ns/B 243.2 MiB/s 12.55 c/B SHA3-256 | 4.15 ns/B 230.0 MiB/s 13.27 c/B SHA3-384 | 5.40 ns/B 176.6 MiB/s 17.29 c/B SHA3-512 | 7.77 ns/B 122.7 MiB/s 24.87 c/B After (generic 64-bit, amd64), 1.10x faster): SHA3-224 | 3.57 ns/B 267.4 MiB/s 11.42 c/B SHA3-256 | 3.77 ns/B 252.8 MiB/s 12.07 c/B SHA3-384 | 4.91 ns/B 194.1 MiB/s 15.72 c/B SHA3-512 | 7.06 ns/B 135.0 MiB/s 22.61 c/B After (Intel SHLD 64-bit, amd64, 1.13x faster): SHA3-224 | 3.48 ns/B 273.7 MiB/s 11.15 c/B SHA3-256 | 3.68 ns/B 258.9 MiB/s 11.79 c/B SHA3-384 | 4.80 ns/B 198.7 MiB/s 15.36 c/B SHA3-512 | 6.89 ns/B 138.4 MiB/s 22.05 c/B After (Intel BMI2 64-bit, amd64, 1.45x faster): SHA3-224 | 2.71 ns/B 352.1 MiB/s 8.67 c/B SHA3-256 | 2.86 ns/B 333.2 MiB/s 9.16 c/B SHA3-384 | 3.72 ns/B 256.2 MiB/s 11.91 c/B SHA3-512 | 5.34 ns/B 178.5 MiB/s 17.10 c/B Benchmarks of new 32-bit implementations on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2): Before (win32): SHA3-224 | 12.05 ns/B 79.16 MiB/s 38.56 c/B SHA3-256 | 12.75 ns/B 74.78 MiB/s 40.82 c/B SHA3-384 | 16.63 ns/B 57.36 MiB/s 53.22 c/B SHA3-512 | 23.97 ns/B 39.79 MiB/s 76.72 c/B After (generic 32-bit BI, win32, 1.23x to 1.29x faster): SHA3-224 | 9.76 ns/B 97.69 MiB/s 31.25 c/B SHA3-256 | 10.27 ns/B 92.82 MiB/s 32.89 c/B SHA3-384 | 13.22 ns/B 72.16 MiB/s 42.31 c/B SHA3-512 | 18.65 ns/B 51.13 MiB/s 59.70 c/B After (Intel BMI2 32-bit BI, win32, 1.66x to 1.70x faster): SHA3-224 | 7.26 ns/B 131.4 MiB/s 23.23 c/B SHA3-256 | 7.65 ns/B 124.7 MiB/s 24.47 c/B SHA3-384 | 9.87 ns/B 96.67 MiB/s 31.58 c/B SHA3-512 | 14.05 ns/B 67.85 MiB/s 44.99 c/B Benchmarks of new 32-bit implementation on ARM Cortex-A8 (1008 Mhz, gcc-4.9.1): Before: SHA3-224 | 148.6 ns/B 6.42 MiB/s 149.8 c/B SHA3-256 | 157.2 ns/B 6.07 MiB/s 158.4 c/B SHA3-384 | 205.3 ns/B 4.65 MiB/s 206.9 c/B SHA3-512 | 296.3 ns/B 3.22 MiB/s 298.6 c/B After (1.56x faster): SHA3-224 | 96.12 ns/B 9.92 MiB/s 96.89 c/B SHA3-256 | 101.5 ns/B 9.40 MiB/s 102.3 c/B SHA3-384 | 131.4 ns/B 7.26 MiB/s 132.5 c/B SHA3-512 | 188.2 ns/B 5.07 MiB/s 189.7 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>