summaryrefslogtreecommitdiff
path: root/cipher/rijndael.c
AgeCommit message (Collapse)AuthorFilesLines
2015-09-04w32: Fix alignment problem with AESNI on Windows >= 8Werner Koch1-15/+42
* cipher/cipher-selftest.c (_gcry_cipher_selftest_alloc_ctx): New. * cipher/rijndael.c (selftest_basic_128, selftest_basic_192) (selftest_basic_256): Allocate context on the heap. -- The stack alignment on Windows changed and because ld seems to limit stack variables to a 8 byte alignment (we request 16), we get bus errors from the selftests if AESNI is in use. GnuPG-bug-id: 2085 Signed-off-by: Werner Koch <wk@gnupg.org>
2015-08-10Optimize OCB offset calculationJussi Kivilinna1-21/+3
* cipher/cipher-internal.h (ocb_get_l): New. * cipher/cipher-ocb.c (_gcry_cipher_ocb_authenticate) (ocb_crypt): Use 'ocb_get_l' instead of '_gcry_cipher_ocb_get_l'. * cipher/camellia-glue.c (get_l): Remove. (_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth): Precalculate offset array when block count matches parallel operation size; Use 'ocb_get_l' instead of 'get_l'. * cipher/rijndael-aesni.c (get_l): Add fast path for 75% most common offsets. (aesni_ocb_enc, aesni_ocb_dec, _gcry_aes_aesni_ocb_auth): Precalculate offset array when block count matches parallel operation size. * cipher/rijndael-ssse3-amd64.c (get_l): Add fast path for 75% most common offsets. * cipher/rijndael.c (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth): Use 'ocb_get_l' instead of '_gcry_cipher_ocb_get_l'. * cipher/serpent.c (get_l): Remove. (_gcry_serpent_ocb_crypt, _gcry_serpent_ocb_auth): Precalculate offset array when block count matches parallel operation size; Use 'ocb_get_l' instead of 'get_l'. * cipher/twofish.c (get_l): Remove. (_gcry_twofish_ocb_crypt, _gcry_twofish_ocb_auth): Use 'ocb_get_l' instead of 'get_l'. -- Patch optimizes OCB offset calculation for generic code and assembly implementations with parallel block processing. Benchmark of OCB AES-NI on Intel Haswell: $ tests/bench-slope --cpu-mhz 3201 cipher aes Before: AES | nanosecs/byte mebibytes/sec cycles/byte CTR enc | 0.274 ns/B 3483.9 MiB/s 0.876 c/B CTR dec | 0.273 ns/B 3490.0 MiB/s 0.875 c/B OCB enc | 0.289 ns/B 3296.1 MiB/s 0.926 c/B OCB dec | 0.299 ns/B 3189.9 MiB/s 0.957 c/B OCB auth | 0.260 ns/B 3670.0 MiB/s 0.832 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte CTR enc | 0.273 ns/B 3489.4 MiB/s 0.875 c/B CTR dec | 0.273 ns/B 3487.5 MiB/s 0.875 c/B OCB enc | 0.248 ns/B 3852.8 MiB/s 0.792 c/B OCB dec | 0.261 ns/B 3659.5 MiB/s 0.834 c/B OCB auth | 0.227 ns/B 4205.5 MiB/s 0.726 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-07-27Reduce amount of duplicated code in OCB bulk implementationsJussi Kivilinna1-2/+6
* cipher/cipher-ocb.c (_gcry_cipher_ocb_authenticate) (ocb_crypt): Change bulk function to return number of unprocessed blocks. * src/cipher.h (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth) (_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth) (_gcry_serpent_ocb_crypt, _gcry_serpent_ocb_auth) (_gcry_twofish_ocb_crypt, _gcry_twofish_ocb_auth): Change return type to 'size_t'. * cipher/camellia-glue.c (get_l): Only if USE_AESNI_AVX or USE_AESNI_AVX2 defined. (_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth): Change return type to 'size_t' and return remaining blocks; Remove unaccelerated common code path. Enable remaining common code only if USE_AESNI_AVX or USE_AESNI_AVX2 defined; Remove unaccelerated common code. * cipher/rijndael.c (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth): Change return type to 'size_t' and return zero. * cipher/serpent.c (get_l): Only if USE_SSE2, USE_AVX2 or USE_NEON defined. (_gcry_serpent_ocb_crypt, _gcry_serpent_ocb_auth): Change return type to 'size_t' and return remaining blocks; Remove unaccelerated common code path. Enable remaining common code only if USE_SSE2, USE_AVX2 or USE_NEON defined; Remove unaccelerated common code. * cipher/twofish.c (get_l): Only if USE_AMD64_ASM defined. (_gcry_twofish_ocb_crypt, _gcry_twofish_ocb_auth): Change return type to 'size_t' and return remaining blocks; Remove unaccelerated common code path. Enable remaining common code only if USE_AMD64_ASM defined; Remove unaccelerated common code. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-07-26Add OCB bulk mode for AES SSSE3 implementationJussi Kivilinna1-0/+19
* cipher/rijndael-ssse3-amd64.c (SSSE3_STATE_SIZE): New. [HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS] (vpaes_ssse3_prepare): Use 'ssse3_state' for storing current SSSE3 state. [HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS] (vpaes_ssse3_cleanup): Restore SSSE3 state from 'ssse3_state'. (_gcry_aes_ssse3_do_setkey, _gcry_aes_ssse3_prepare_decryption) (_gcry_aes_ssse3_encrypt, _gcry_aes_ssse3_cfb_enc) (_gcry_aes_ssse3_cbc_enc, _gcry_aes_ssse3_ctr_enc) (_gcry_aes_ssse3_decrypt, _gcry_aes_ssse3_cfb_dec) (_gcry_aes_ssse3_cbc_dec, _gcry_aes_ssse3_cbc_dec): Add 'ssse3_state' array. (get_l, ssse3_ocb_enc, ssse3_ocb_dec, _gcry_aes_ssse3_ocb_crypt) (_gcry_aes_ssse3_ocb_auth): New. * cipher/rijndael.c (_gcry_aes_ssse3_ocb_crypt) (_gcry_aes_ssse3_ocb_auth): New. (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth) [USE_SSSE3]: Use SSSE3 implementation for OCB. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-05-03Fix WIN64 assembly glue for AESJussi Kivilinna1-20/+24
* cipher/rinjdael.c (do_encrypt, do_decrypt) [!HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS]: Change input operands to input+output to mark volatile nature of the used registers. -- Function arguments cannot be passed to assembly block as input operands as target function modifies those input registers. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-05-02Enable AMD64 AES implementation for WIN64Jussi Kivilinna1-0/+34
* cipher/rijndael-amd64.S: Enable when HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS defined. (ELF): New macro to mask lines with ELF specific commands. * cipher/rijndael-internal.h (USE_AMD64_ASM): Enable when HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS defined. (do_encrypt, do_decrypt) [USE_AMD64_ASM && !HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS]: Use assembly block to call AMD64 assembly encrypt/decrypt function. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-04-18Add OCB bulk crypt/auth functions for AES/AES-NIJussi Kivilinna1-0/+161
* cipher/cipher-internal.h (gcry_cipher_handle): Add bulk.ocb_crypt and bulk.ocb_auth. (_gcry_cipher_ocb_get_l): New prototype. * cipher/cipher-ocb.c (get_l): Rename to ... (_gcry_cipher_ocb_get_l): ... this. (_gcry_cipher_ocb_authenticate, ocb_crypt): Use bulk function when available. * cipher/cipher.c (_gcry_cipher_open_internal): Setup OCB bulk functions for AES. * cipher/rijndael-aesni.c (get_l, aesni_ocb_enc, aes_ocb_dec) (_gcry_aes_aesni_ocb_crypt, _gcry_aes_aesni_ocb_auth): New. * cipher/rijndael.c [USE_AESNI] (_gcry_aes_aesni_ocb_crypt) (_gcry_aes_aesni_ocb_auth): New prototypes. (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth): New. * src/cipher.h (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth): New prototypes. * tests/basic.c (check_ocb_cipher_largebuf): New. (check_ocb_cipher): Add large buffer encryption/decryption test. -- Patch adds bulk encryption/decryption/authentication code for AES-NI accelerated AES. Benchmark on Intel i5-4570 (3200 Mhz, turbo off): Before: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 2.12 ns/B 449.7 MiB/s 6.79 c/B OCB dec | 2.12 ns/B 449.6 MiB/s 6.79 c/B OCB auth | 2.07 ns/B 459.9 MiB/s 6.64 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 0.292 ns/B 3262.5 MiB/s 0.935 c/B OCB dec | 0.297 ns/B 3212.2 MiB/s 0.950 c/B OCB auth | 0.260 ns/B 3666.1 MiB/s 0.832 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2015-01-20rijndael: fix wrong ifdef for SSSE3 setkeyJussi Kivilinna1-1/+1
* cipher/rijndael.c (do_setkey): Use USE_SSSE3 instead of USE_AESNI around SSSE3 setkey selection. -- Reported-by: Richard H Lee <ricardohenrylee@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-27Add Intel SSSE3 based vector permutation AES implementationJussi Kivilinna1-2/+94
* cipher/Makefile.am: Add 'rijndael-ssse3-amd64.c'. * cipher/rijndael-internal.h (USE_SSSE3): New. (RIJNDAEL_context_s) [USE_SSSE3]: Add 'use_ssse3'. * cipher/rijndael-ssse3-amd64.c: New. * cipher/rijndael.c [USE_SSSE3] (_gcry_aes_ssse3_do_setkey) (_gcry_aes_ssse3_prepare_decryption, _gcry_aes_ssse3_encrypt) (_gcry_aes_ssse3_decrypt, _gcry_aes_ssse3_cfb_enc) (_gcry_aes_ssse3_cbc_enc, _gcry_aes_ssse3_ctr_enc) (_gcry_aes_ssse3_cfb_dec, _gcry_aes_ssse3_cbc_dec): New. (do_setkey): Add HWF check for SSSE3 and setup for SSSE3 implementation. (prepare_decryption, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc) (_gcry_aes_ctr_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_dec): Add selection for SSSE3 implementation. * configure.ac [host=x86_64]: Add 'rijndael-ssse3-amd64.lo'. -- This patch adds "AES with vector permutations" implementation by Mike Hamburg. Public-domain source-code is available at: http://crypto.stanford.edu/vpaes/ Benchmark on Intel Core2 T8100 (2.1Ghz, no turbo): Old (AMD64 asm): AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 8.79 ns/B 108.5 MiB/s 18.46 c/B ECB dec | 9.07 ns/B 105.1 MiB/s 19.05 c/B CBC enc | 7.77 ns/B 122.7 MiB/s 16.33 c/B CBC dec | 7.74 ns/B 123.2 MiB/s 16.26 c/B CFB enc | 7.88 ns/B 121.0 MiB/s 16.54 c/B CFB dec | 7.56 ns/B 126.1 MiB/s 15.88 c/B OFB enc | 9.02 ns/B 105.8 MiB/s 18.94 c/B OFB dec | 9.07 ns/B 105.1 MiB/s 19.05 c/B CTR enc | 7.80 ns/B 122.2 MiB/s 16.38 c/B CTR dec | 7.81 ns/B 122.2 MiB/s 16.39 c/B New (ssse3): AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 5.77 ns/B 165.2 MiB/s 12.13 c/B ECB dec | 7.13 ns/B 133.7 MiB/s 14.98 c/B CBC enc | 5.27 ns/B 181.0 MiB/s 11.06 c/B CBC dec | 6.39 ns/B 149.3 MiB/s 13.42 c/B CFB enc | 5.27 ns/B 180.9 MiB/s 11.07 c/B CFB dec | 5.28 ns/B 180.7 MiB/s 11.08 c/B OFB enc | 6.11 ns/B 156.1 MiB/s 12.83 c/B OFB dec | 6.13 ns/B 155.5 MiB/s 12.88 c/B CTR enc | 5.26 ns/B 181.5 MiB/s 11.04 c/B CTR dec | 5.24 ns/B 182.0 MiB/s 11.00 c/B Benchmark on Intel i5-2450M (2.5Ghz, no turbo, aes-ni disabled): Old (AMD64 asm): AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 8.06 ns/B 118.3 MiB/s 20.15 c/B ECB dec | 8.21 ns/B 116.1 MiB/s 20.53 c/B CBC enc | 7.88 ns/B 121.1 MiB/s 19.69 c/B CBC dec | 7.57 ns/B 126.0 MiB/s 18.92 c/B CFB enc | 7.87 ns/B 121.2 MiB/s 19.67 c/B CFB dec | 7.56 ns/B 126.2 MiB/s 18.89 c/B OFB enc | 8.27 ns/B 115.3 MiB/s 20.67 c/B OFB dec | 8.28 ns/B 115.1 MiB/s 20.71 c/B CTR enc | 8.02 ns/B 119.0 MiB/s 20.04 c/B CTR dec | 8.02 ns/B 118.9 MiB/s 20.05 c/B New (ssse3): AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 4.03 ns/B 236.6 MiB/s 10.07 c/B ECB dec | 5.28 ns/B 180.8 MiB/s 13.19 c/B CBC enc | 3.77 ns/B 252.7 MiB/s 9.43 c/B CBC dec | 4.69 ns/B 203.3 MiB/s 11.73 c/B CFB enc | 3.75 ns/B 254.3 MiB/s 9.37 c/B CFB dec | 3.69 ns/B 258.6 MiB/s 9.22 c/B OFB enc | 4.17 ns/B 228.7 MiB/s 10.43 c/B OFB dec | 4.17 ns/B 228.7 MiB/s 10.42 c/B CTR enc | 3.72 ns/B 256.5 MiB/s 9.30 c/B CTR dec | 3.72 ns/B 256.1 MiB/s 9.31 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-25rijndael: fix compiler warnings on ARMJussi Kivilinna1-69/+68
* cipher/rijndael-internal.h (RIJNDAEL_context_s): Add u32 variants of keyschedule arrays to unions u1 and u2. (keyschedenc32, keyscheddec32): New. * cipher/rijndael.c (u32_a_t): Remove. (do_setkey): Add and use tkk[].data32, k_u32, tk_u32 and W_u32; Remove casting byte arrays to u32_a_t. (prepare_decryption, do_encrypt_fn, do_decrypt_fn): Use keyschedenc32 and keyscheddec32; Remove casting byte arrays to u32_a_t. -- Patch fixes 'cast increases required alignment' compiler warnings that GCC was showing: rijndael.c: In function 'do_setkey': rijndael.c:310:13: warning: cast increases required alignment of target type [-Wcast-align] *((u32_a_t*)tk[j]) = *((u32_a_t*)k[j]); ^ rijndael.c:310:34: warning: cast increases required alignment of target type [-Wcast-align] *((u32_a_t*)tk[j]) = *((u32_a_t*)k[j]); [removed the rest] Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-23rijndael: use more compact look-up tables and add table prefetchingJussi Kivilinna1-260/+385
* cipher/rijndael-internal.h (rijndael_prefetchfn_t): New. (RIJNDAEL_context): Add 'prefetch_enc_fn' and 'prefetch_dec_fn'. * cipher/rijndael-tables.h (S, T1, T2, T3, T4, T5, T6, T7, T8, S5, U1) (U2, U3, U4): Remove. (encT, dec_tables, decT, inv_sbox): Add. * cipher/rijndael.c (_gcry_aes_amd64_encrypt_block) (_gcry_aes_amd64_decrypt_block, _gcry_aes_arm_encrypt_block) (_gcry_aes_arm_encrypt_block): Add parameter for passing table pointer to assembly implementation. (prefetch_table, prefetch_enc, prefetch_dec): New. (do_setkey): Setup context prefetch functions depending on selected rijndael implementation; Use new tables for key setup. (prepare_decryption): Use new tables for decryption key setup. (do_encrypt_aligned): Rename to... (do_encrypt_fn): ... to this, change to use new compact tables, make handle unaligned input and unroll rounds loop by two. (do_encrypt): Remove handling of unaligned input/output; pass table pointer to assembly implementations. (rijndael_encrypt, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc) (_gcry_aes_ctr_enc, _gcry_aes_cfb_dec): Prefetch encryption tables before encryption. (do_decrypt_aligned): Rename to... (do_decrypt_fn): ... to this, change to use new compact tables, make handle unaligned input and unroll rounds loop by two. (do_decrypt): Remove handling of unaligned input/output; pass table pointer to assembly implementations. (rijndael_decrypt, _gcry_aes_cbc_dec): Prefetch decryption tables before decryption. * cipher/rijndael-amd64.S: Use 1+1.25 KiB tables for encryption+decryption; remove tables from assembly file. * cipher/rijndael-arm.S: Ditto. -- Patch replaces 4+4.25 KiB look-up tables in generic implementation and 8+8 KiB look-up tables in AMD64 implementation and 2+2 KiB look-up tables in ARM implementation with 1+1.25 KiB look-up tables, and adds prefetching of look-up tables. AMD64 assembly is slower than before because of additional rotation instructions. The generic C implementation is now better optimized and actually faster than before. Benchmark results on Intel i5-4570 (turbo off) (64-bit, AMD64 assembly): tests/bench-slope --disable-hwf intel-aesni --cpu-mhz 3200 cipher aes Old: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 3.10 ns/B 307.5 MiB/s 9.92 c/B ECB dec | 3.15 ns/B 302.5 MiB/s 10.09 c/B CBC enc | 3.46 ns/B 275.5 MiB/s 11.08 c/B CBC dec | 3.19 ns/B 299.2 MiB/s 10.20 c/B CFB enc | 3.48 ns/B 274.4 MiB/s 11.12 c/B CFB dec | 3.23 ns/B 294.8 MiB/s 10.35 c/B OFB enc | 3.29 ns/B 290.2 MiB/s 10.52 c/B OFB dec | 3.31 ns/B 288.3 MiB/s 10.58 c/B CTR enc | 3.64 ns/B 261.7 MiB/s 11.66 c/B CTR dec | 3.65 ns/B 261.6 MiB/s 11.67 c/B New: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 4.21 ns/B 226.7 MiB/s 13.46 c/B ECB dec | 4.27 ns/B 223.2 MiB/s 13.67 c/B CBC enc | 4.15 ns/B 229.8 MiB/s 13.28 c/B CBC dec | 3.85 ns/B 247.8 MiB/s 12.31 c/B CFB enc | 4.16 ns/B 229.1 MiB/s 13.32 c/B CFB dec | 3.88 ns/B 245.9 MiB/s 12.41 c/B OFB enc | 4.38 ns/B 217.8 MiB/s 14.01 c/B OFB dec | 4.36 ns/B 218.6 MiB/s 13.96 c/B CTR enc | 4.30 ns/B 221.6 MiB/s 13.77 c/B CTR dec | 4.30 ns/B 221.7 MiB/s 13.76 c/B Benchmark on Intel i5-4570 (turbo off) (32-bit mingw, generic C): tests/bench-slope.exe --disable-hwf intel-aesni --cpu-mhz 3200 cipher aes Old: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 6.03 ns/B 158.2 MiB/s 19.29 c/B ECB dec | 5.81 ns/B 164.1 MiB/s 18.60 c/B CBC enc | 6.22 ns/B 153.4 MiB/s 19.90 c/B CBC dec | 5.91 ns/B 161.3 MiB/s 18.92 c/B CFB enc | 6.25 ns/B 152.7 MiB/s 19.99 c/B CFB dec | 6.24 ns/B 152.8 MiB/s 19.97 c/B OFB enc | 6.33 ns/B 150.6 MiB/s 20.27 c/B OFB dec | 6.33 ns/B 150.7 MiB/s 20.25 c/B CTR enc | 6.28 ns/B 152.0 MiB/s 20.08 c/B CTR dec | 6.28 ns/B 151.7 MiB/s 20.11 c/B New: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 5.02 ns/B 190.0 MiB/s 16.06 c/B ECB dec | 5.33 ns/B 178.8 MiB/s 17.07 c/B CBC enc | 4.64 ns/B 205.4 MiB/s 14.86 c/B CBC dec | 4.95 ns/B 192.7 MiB/s 15.84 c/B CFB enc | 4.75 ns/B 200.7 MiB/s 15.20 c/B CFB dec | 4.74 ns/B 201.1 MiB/s 15.18 c/B OFB enc | 5.29 ns/B 180.3 MiB/s 16.93 c/B OFB dec | 5.29 ns/B 180.3 MiB/s 16.93 c/B CTR enc | 4.77 ns/B 200.0 MiB/s 15.26 c/B CTR dec | 4.77 ns/B 199.8 MiB/s 15.27 c/B Benchmark on Cortex-A8 (ARM assembly): tests/bench-slope --cpu-mhz 1008 cipher aes Old: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 21.84 ns/B 43.66 MiB/s 22.02 c/B ECB dec | 22.35 ns/B 42.67 MiB/s 22.53 c/B CBC enc | 22.97 ns/B 41.53 MiB/s 23.15 c/B CBC dec | 23.48 ns/B 40.61 MiB/s 23.67 c/B CFB enc | 22.72 ns/B 41.97 MiB/s 22.90 c/B CFB dec | 23.41 ns/B 40.74 MiB/s 23.59 c/B OFB enc | 23.65 ns/B 40.32 MiB/s 23.84 c/B OFB dec | 23.67 ns/B 40.29 MiB/s 23.86 c/B CTR enc | 23.24 ns/B 41.03 MiB/s 23.43 c/B CTR dec | 23.23 ns/B 41.05 MiB/s 23.42 c/B New: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 26.03 ns/B 36.64 MiB/s 26.24 c/B ECB dec | 26.97 ns/B 35.36 MiB/s 27.18 c/B CBC enc | 23.21 ns/B 41.09 MiB/s 23.39 c/B CBC dec | 23.36 ns/B 40.83 MiB/s 23.54 c/B CFB enc | 23.02 ns/B 41.42 MiB/s 23.21 c/B CFB dec | 23.67 ns/B 40.28 MiB/s 23.86 c/B OFB enc | 27.86 ns/B 34.24 MiB/s 28.08 c/B OFB dec | 27.87 ns/B 34.21 MiB/s 28.10 c/B CTR enc | 23.47 ns/B 40.63 MiB/s 23.66 c/B CTR dec | 23.49 ns/B 40.61 MiB/s 23.67 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-06rijndael: split Padlock part to separate fileJussi Kivilinna1-78/+8
* cipher/Makefile.am: Add 'rijndael-padlock.c'. * cipher/rijndael-padlock.c: New. * cipher/rijndael.c (do_padlock, do_padlock_encrypt) (do_padlock_decrypt): Move to 'rijndael-padlock.c'. * configure.ac [mpi_cpu_arch=x86]: Add 'rijndael-padlock.lo'. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-01rijndael: refactor to reduce number of #ifdefs and branchesJussi Kivilinna1-217/+152
* cipher/rijndael-aesni.c (_gcry_aes_aesni_encrypt) (_gcry_aes_aesni_decrypt): Make return stack burn depth. * cipher/rijndael-amd64.S (_gcry_aes_amd64_encrypt_block) (_gcry_aes_amd64_decrypt_block): Ditto. * cipher/rijndael-arm.S (_gcry_aes_arm_encrypt_block) (_gcry_aes_arm_decrypt_block): Ditto. * cipher/rijndael-internal.h (RIJNDAEL_context_s) (rijndael_cryptfn_t): New. (RIJNDAEL_context): New members 'encrypt_fn' and 'decrypt_fn'. * cipher/rijndael.c (_gcry_aes_amd64_encrypt_block) (_gcry_aes_amd64_decrypt_block, _gcry_aes_aesni_encrypt) (_gcry_aes_aesni_decrypt, _gcry_aes_arm_encrypt_block) (_gcry_aes_arm_decrypt_block): Change prototypes. (do_padlock_encrypt, do_padlock_decrypt): New. (do_setkey): Separate key-length to rounds conversion from HW features check; Add selection for ctx->encrypt_fn and ctx->decrypt_fn. (do_encrypt_aligned, do_decrypt_aligned): Move inside '[!USE_AMD64_ASM && !USE_ARM_ASM]'; Move USE_AMD64_ASM and USE_ARM_ASM to... (do_encrypt, do_decrypt): ...here; Return stack depth; Remove second temporary buffer from non-aligned input/output case. (do_padlock): Move decrypt_flag to last argument; Return stack depth. (rijndael_encrypt): Remove #ifdefs, just call ctx->encrypt_fn. (_gcry_aes_cfb_enc, _gcry_aes_cbc_enc): Remove USE_PADLOCK; Call ctx->encrypt_fn in place of do_encrypt/do_encrypt_aligned. (_gcry_aes_ctr_enc): Call ctx->encrypt_fn in place of do_encrypt_aligned; Make tmp buffer 16-byte aligned and wipe buffer after use. (rijndael_encrypt): Remove #ifdefs, just call ctx->decrypt_fn. (_gcry_aes_cfb_dec): Remove USE_PADLOCK; Call ctx->decrypt_fn in place of do_decrypt/do_decrypt_aligned. (_gcry_aes_cbc_dec): Ditto; Make savebuf buffer 16-byte aligned. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-01rijndael: move AES-NI blocks before PadlockJussi Kivilinna1-43/+45
* cipher/rijndael.c (do_setkey, rijndael_encrypt, _gcry_aes_cfb_enc) (rijndael_decrypt, _gcry_aes_cfb_dec): Move USE_AESNI before USE_PADLOCK. (check_decryption_praparation) [USE_PADLOCK]: Move to... (prepare_decryption) [USE_PADLOCK]: ...here. -- Make order of AES-NI and Padlock #ifdefs consistent. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2014-12-01rijndael: split AES-NI functions to separate fileJussi Kivilinna1-1330/+63
* cipher/Makefile.in: Add 'rijndael-aesni.c'. * cipher/rijndael-aesni.c: New. * cipher/rijndael-internal.h: New. * cipher/rijndael.c (MAXKC, MAXROUNDS, BLOCKSIZE, ATTR_ALIGNED_16) (USE_AMD64_ASM, USE_ARM_ASM, USE_PADLOCK, USE_AESNI, RIJNDAEL_context) (keyschenc, keyschdec, padlockkey): Move to 'rijndael-internal.h'. (u128_s, aesni_prepare, aesni_cleanup, aesni_cleanup_2_6) (aesni_do_setkey, do_aesni_enc, do_aesni_dec, do_aesni_enc_vec4) (do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Move to 'rijndael-aesni.c'. (prepare_decryption, rijndael_encrypt, _gcry_aes_cfb_enc) (_gcry_aes_cbc_enc, _gcry_aes_ctr_enc, rijndael_decrypt) (_gcry_aes_cfb_dec, _gcry_aes_cbc_dec) [USE_AESNI]: Move to functions in 'rijdael-aesni.c'. * configure.ac [mpi_cpu_arch=x86]: Add 'rijndael-aesni.lo'. -- Clean-up rijndael.c before new new hardware acceleration support gets added. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-12-03rijndael: fix compiler warning on aarch64Jussi Kivilinna1-2/+6
* cipher/rijndael.c (do_setkey): Use braces for empty if statement instead of semicolon. -- Patch fixes following warning: rijndael.c: In function 'do_setkey': rijndael.c:507:9: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] ; ^ Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-15cipher: use size_t for internal buffer lengthsJussi Kivilinna1-5/+5
* cipher/arcfour.c (do_encrypt_stream, encrypt_stream): Use 'size_t' for buffer lengths. * cipher/blowfish.c (_gcry_blowfish_ctr_enc, _gcry_blowfish_cbc_dec) (_gcry_blowfish_cfb_dec): Ditto. * cipher/camellia-glue.c (_gcry_camellia_ctr_enc) (_gcry_camellia_cbc_dec, _gcry_blowfish_cfb_dec): Ditto. * cipher/cast5.c (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec) (_gcry_cast5_cfb_dec): Ditto. * cipher/cipher-aeswrap.c (_gcry_cipher_aeswrap_encrypt) (_gcry_cipher_aeswrap_decrypt): Ditto. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Ditto. * cipher/cipher-ccm.c (_gcry_cipher_ccm_encrypt) (_gcry_cipher_ccm_decrypt): Ditto. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) (_gcry_cipher_cfb_decrypt): Ditto. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Ditto. * cipher/cipher-internal.h (gcry_cipher_handle->bulk) (_gcry_cipher_cbc_encrypt, _gcry_cipher_cbc_decrypt) (_gcry_cipher_cfb_encrypt, _gcry_cipher_cfb_decrypt) (_gcry_cipher_ofb_encrypt, _gcry_cipher_ctr_encrypt) (_gcry_cipher_aeswrap_encrypt, _gcry_cipher_aeswrap_decrypt) (_gcry_cipher_ccm_encrypt, _gcry_cipher_ccm_decrypt): Ditto. * cipher/cipher-ofb.c (_gcry_cipher_cbc_encrypt): Ditto. * cipher/cipher-selftest.h (gcry_cipher_bulk_cbc_dec_t) (gcry_cipher_bulk_cfb_dec_t, gcry_cipher_bulk_ctr_enc_t): Ditto. * cipher/cipher.c (cipher_setkey, cipher_setiv, do_ecb_crypt) (do_ecb_encrypt, do_ecb_decrypt, cipher_encrypt) (cipher_decrypt): Ditto. * cipher/rijndael.c (_gcry_aes_ctr_enc, _gcry_aes_cbc_dec) (_gcry_aes_cfb_dec, _gcry_aes_cbc_enc, _gcry_aes_cfb_enc): Ditto. * cipher/salsa20.c (salsa20_setiv, salsa20_do_encrypt_stream) (salsa20_encrypt_stream, salsa20r12_encrypt_stream): Ditto. * cipher/serpent.c (_gcry_serpent_ctr_enc, _gcry_serpent_cbc_dec) (_gcry_serpent_cfb_dec): Ditto. * cipher/twofish.c (_gcry_twofish_ctr_enc, _gcry_twofish_cbc_dec) (_gcry_twofish_cfb_dec): Ditto. * src/cipher-proto.h (gcry_cipher_stencrypt_t) (gcry_cipher_stdecrypt_t, cipher_setiv_fuct_t): Ditto. * src/cipher.h (_gcry_aes_cfb_enc, _gcry_aes_cfb_dec) (_gcry_aes_cbc_enc, _gcry_aes_cbc_dec, _gcry_aes_ctr_enc) (_gcry_blowfish_cfb_dec, _gcry_blowfish_cbc_dec) (_gcry_blowfish_ctr_enc, _gcry_cast5_cfb_dec, _gcry_cast5_cbc_dec) (_gcry_cast5_ctr_enc, _gcry_camellia_cfb_dec, _gcry_camellia_cbc_dec) (_gcry_camellia_ctr_enc, _gcry_serpent_cfb_dec, _gcry_serpent_cbc_dec) (_gcry_serpent_ctr_enc, _gcry_twofish_cfb_dec, _gcry_twofish_cbc_dec) (_gcry_twofish_ctr_enc): Ditto. -- On 64-bit platforms, cipher module internally converts 64-bit size_t values to 32-bit unsigned integers. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-15Avoid unneeded stack burning with AES-NI and reduce number of ↵Jussi Kivilinna1-69/+89
'decryption_prepared' checks * cipher/rijndael.c (RIJNDAEL_context): Make 'decryption_prepared', 'use_padlock' and 'use_aesni' 1-bit members in bitfield. (do_setkey): Move 'hwfeatures' inside [USE_AESNI || USE_PADLOCK]. (do_aesni_enc_aligned): Rename to... (do_aesni_enc): ...this, as function does not require aligned input. (do_aesni_dec_aligned): Rename to... (do_aesni_dec): ...this, as function does not require aligned input. (do_aesni): Remove. (rijndael_encrypt): Call 'do_aesni_enc' instead of 'do_aesni'. (rijndael_decrypt): Call 'do_aesni_dec' instead of 'do_aesni'. (check_decryption_preparation): New. (do_decrypt): Remove 'decryption_prepared' check. (rijndael_decrypt): Ditto and call 'check_decryption_preparation'. (_gcry_aes_cbc_dec): Ditto. (_gcry_aes_cfb_enc): Add 'burn_depth' and burn stack only when needed. (_gcry_aes_cbc_enc): Ditto. (_gcry_aes_ctr_enc): Ditto. (_gcry_aes_cfb_dec): Ditto. (_gcry_aes_cbc_dec): Ditto and correct clearing of 'savebuf'. -- Patch is mostly about reducing overhead for short buffers. Results on Intel i5-4570: After: $ tests/benchmark --cipher-repetitions 1000 --cipher-with-keysetup cipher aes Running each test 1000 times. ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- AES 480ms 540ms 1750ms 300ms 1630ms 300ms 1640ms 1640ms 350ms 350ms 2130ms 2140ms Before: $ tests/benchmark --cipher-repetitions 1000 --cipher-with-keysetup cipher aes Running each test 1000 times. ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- AES 520ms 590ms 1760ms 310ms 1640ms 310ms 1610ms 1600ms 360ms 360ms 2150ms 2160ms Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-09Fix tail handling for AES-NI counter modeJussi Kivilinna1-7/+6
* cipher/rijndael.c (do_aesni_ctr): Fix outputting of updated counter-IV. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-06Speed-up AES-NI key setupJussi Kivilinna1-99/+300
* cipher/rijndael.c [USE_AESNI] (m128i_t): Remove. [USE_AESNI] (u128_t): New. [USE_AESNI] (aesni_do_setkey): New. (do_setkey) [USE_AESNI]: Move AES-NI accelerated key setup to 'aesni_do_setkey'. (do_setkey): Call _gcry_get_hw_features only once. Clear stack after use in generic key setup part. (rijndael_setkey): Remove stack burning. (prepare_decryption) [USE_AESNI]: Use 'u128_t' instead of 'm128i_t' to avoid compiler generated SSE2 instructions and XMM register usage, unroll 'aesimc' setup loop (prepare_decryption): Clear stack after use. [USE_AESNI] (do_aesni_enc_aligned): Update comment about alignment. (do_decrypt): Do not burning stack after prepare_decryption. -- Patch improves the speed of AES key setup with AES-NI instructions. Patch also removes problematic the use of vector typedef, which might cause interference with XMM register usage in AES-NI accelerated code. New: $ tests/benchmark --cipher-with-keysetup --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- AES 520ms 590ms 1760ms 310ms 1640ms 300ms 1620ms 1610ms 350ms 360ms 2160ms 2140ms AES192 640ms 680ms 2030ms 370ms 1920ms 350ms 1890ms 1880ms 400ms 410ms 2490ms 2490ms AES256 730ms 780ms 2330ms 430ms 2210ms 420ms 2170ms 2180ms 470ms 480ms 2830ms 2840ms Old: $ tests/benchmark --cipher-with-keysetup --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- AES 670ms 740ms 1910ms 470ms 1790ms 470ms 1770ms 1760ms 520ms 510ms 2310ms 2310ms AES192 820ms 860ms 2220ms 550ms 2110ms 540ms 2070ms 2070ms 600ms 590ms 2670ms 2680ms AES256 920ms 970ms 2510ms 620ms 2390ms 600ms 2360ms 2370ms 650ms 660ms 3020ms 3020ms Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-06Tweak AES-NI bulk CTR mode slightlyJussi Kivilinna1-38/+45
* cipher/rijndael.c [USE_AESNI] (aesni_cleanup_2_5): Rename to... (aesni_cleanup_2_6): ...this and clear also 'xmm6'. [USE_AESNI && __i386__] (do_aesni_ctr, do_aesni_ctr_4): Prevent inlining only on i386, allow on AMD64. [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Use counter block from 'xmm5' and byte-swap mask from 'xmm6'. (_gcry_aes_ctr_enc) [USE_AESNI]: Preload counter block to 'xmm5' and byte-swap mask to 'xmm6'. (_gcry_aes_ctr_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_dec): Use 'aesni_cleanup_2_6'. -- Small tweak that yeilds ~5% more speed on Intel Core i5-4570. After: AES | nanosecs/byte mebibytes/sec cycles/byte CTR enc | 0.274 ns/B 3482.5 MiB/s 0.877 c/B CTR dec | 0.274 ns/B 3486.8 MiB/s 0.876 c/B Before: AES | nanosecs/byte mebibytes/sec cycles/byte CTR enc | 0.288 ns/B 3312.5 MiB/s 0.922 c/B CTR dec | 0.288 ns/B 3312.6 MiB/s 0.922 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-11-04Make test vectors 'static const'Jussi Kivilinna1-1/+1
* cipher/arcfour.c (selftest): Change test vectors to 'static const'. * cipher/blowfish.c (selftest): Ditto. * cipher/camellia-glue.c (selftest): Ditto. * cipher/cast5.c (selftest): Ditto. * cipher/des.c (selftest): Ditto. * cipher/rijndael.c (selftest): Ditto. * tests/basic.c (cipher_cbc_mac_cipher, check_aes128_cbc_cts_cipher) (check_ctr_cipher, check_cfb_cipher, check_ofb_cipher) (check_ccm_cipher, check_stream_cipher) (check_stream_cipher_large_block, check_bulk_cipher_modes) (check_ciphers, check_digests, check_hmac, check_pubkey_sign) (check_pubkey_sign_ecdsa, check_pubkey_crypt, check_pubkey): Ditto. -- Some test vectors have been defined without 'static' and thus end up being initialized on runtime. Change these to 'static'. Also change test vectors const where possible. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-23Improve the speed of the cipher mode codeJussi Kivilinna1-24/+34
* cipher/bufhelp.h (buf_cpy): New. (buf_xor, buf_xor_2dst): If buffers unaligned, always jump to per-byte processing. (buf_xor_n_copy_2): New. (buf_xor_n_copy): Use 'buf_xor_n_copy_2'. * cipher/blowfish.c (_gcry_blowfish_cbc_dec): Avoid extra memory copy and use new 'buf_xor_n_copy_2'. * cipher/camellia-glue.c (_gcry_camellia_cbc_dec): Ditto. * cipher/cast5.c (_gcry_cast_cbc_dec): Ditto. * cipher/serpent.c (_gcry_serpent_cbc_dec): Ditto. * cipher/twofish.c (_gcry_twofish_cbc_dec): Ditto. * cipher/rijndael.c (_gcry_aes_cbc_dec): Ditto. (do_encrypt, do_decrypt): Use 'buf_cpy' instead of 'memcpy'. (_gcry_aes_cbc_enc): Avoid copying IV, use 'last_iv' pointer instead. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt): Avoid copying IV, update pointer to IV instead. (_gcry_cipher_cbc_decrypt): Avoid extra memory copy and use new 'buf_xor_n_copy_2'. (_gcry_cipher_cbc_encrypt, _gcry_cipher_cbc_decrypt): Avoid extra accesses to c->spec, use 'buf_cpy' instead of memcpy. * cipher/cipher-ccm.c (do_cbc_mac): Ditto. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) (_gcry_cipher_cfb_decrypt): Ditto. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Ditto. * cipher/cipher-ofb.c (_gcry_cipher_ofb_encrypt) (_gcry_cipher_ofb_decrypt): Ditto. * cipher/cipher.c (do_ecb_encrypt, do_ecb_decrypt): Ditto. -- Patch improves the speed of the generic block cipher mode code. Especially on targets without faster unaligned memory accesses, the generic code was slower than the algorithm specific bulk versions. With this patch, this issue should be solved. Tests on Cortex-A8; compiled for ARMv4, without unaligned-accesses: Before: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 490ms 500ms 560ms 580ms 530ms 540ms 560ms 560ms 550ms 540ms 1080ms 1080ms TWOFISH 230ms 230ms 290ms 300ms 260ms 240ms 290ms 290ms 240ms 240ms 520ms 510ms DES 720ms 720ms 800ms 860ms 770ms 770ms 810ms 820ms 770ms 780ms - - CAST5 340ms 340ms 440ms 250ms 390ms 250ms 440ms 430ms 260ms 250ms - - After: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 500ms 490ms 520ms 520ms 530ms 520ms 530ms 540ms 500ms 520ms 1060ms 1070ms TWOFISH 230ms 220ms 250ms 230ms 260ms 230ms 260ms 260ms 230ms 230ms 500ms 490ms DES 720ms 720ms 750ms 760ms 740ms 750ms 770ms 770ms 760ms 760ms - - CAST5 340ms 340ms 370ms 250ms 370ms 250ms 380ms 390ms 250ms 250ms - - Tests on Cortex-A8; compiled for ARMv7-A, with unaligned-accesses: Before: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 430ms 440ms 480ms 530ms 470ms 460ms 490ms 480ms 470ms 460ms 930ms 940ms TWOFISH 220ms 220ms 250ms 230ms 240ms 230ms 270ms 250ms 230ms 240ms 480ms 470ms DES 550ms 540ms 620ms 690ms 570ms 540ms 630ms 650ms 590ms 580ms - - CAST5 300ms 300ms 380ms 230ms 330ms 230ms 380ms 370ms 230ms 230ms - - After: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 430ms 430ms 460ms 450ms 460ms 450ms 470ms 470ms 460ms 470ms 900ms 930ms TWOFISH 220ms 210ms 240ms 230ms 230ms 230ms 250ms 250ms 230ms 230ms 470ms 470ms DES 540ms 540ms 580ms 570ms 570ms 570ms 560ms 620ms 580ms 570ms - - CAST5 300ms 290ms 310ms 230ms 320ms 230ms 350ms 350ms 230ms 230ms - - Tests on Intel Atom N160 (i386): Before: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 380ms 380ms 410ms 420ms 400ms 400ms 410ms 410ms 390ms 400ms 820ms 800ms TWOFISH 340ms 340ms 370ms 350ms 360ms 340ms 370ms 370ms 330ms 340ms 710ms 700ms DES 660ms 650ms 710ms 740ms 680ms 700ms 700ms 710ms 680ms 680ms - - CAST5 340ms 340ms 380ms 330ms 360ms 330ms 390ms 390ms 320ms 330ms - - After: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 380ms 380ms 390ms 410ms 400ms 390ms 410ms 400ms 400ms 390ms 810ms 800ms TWOFISH 330ms 340ms 350ms 360ms 350ms 340ms 380ms 370ms 340ms 360ms 700ms 710ms DES 630ms 640ms 660ms 690ms 680ms 680ms 700ms 690ms 680ms 680ms - - CAST5 340ms 330ms 350ms 330ms 370ms 340ms 380ms 390ms 330ms 330ms - - Tests in Intel i5-4570 (x86-64): Before: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 560ms 560ms 600ms 590ms 600ms 570ms 570ms 570ms 580ms 590ms 1200ms 1180ms TWOFISH 240ms 240ms 270ms 160ms 260ms 160ms 250ms 250ms 160ms 160ms 430ms 430ms DES 570ms 570ms 640ms 590ms 630ms 580ms 600ms 600ms 610ms 620ms - - CAST5 410ms 410ms 470ms 150ms 470ms 150ms 450ms 450ms 150ms 160ms - - After: ECB/Stream CBC CFB OFB CTR CCM --------------- --------------- --------------- --------------- --------------- --------------- SEED 560ms 560ms 590ms 570ms 580ms 570ms 570ms 570ms 590ms 590ms 1200ms 1200ms TWOFISH 240ms 240ms 260ms 160ms 250ms 170ms 250ms 250ms 160ms 160ms 430ms 430ms DES 570ms 570ms 620ms 580ms 630ms 570ms 600ms 590ms 620ms 620ms - - CAST5 410ms 410ms 460ms 150ms 460ms 160ms 450ms 450ms 150ms 150ms - - Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-10-23Enable assembler optimizations on earlier ARM coresDmitry Eremin-Solenikov1-19/+19
* cipher/blowfish-armv6.S => cipher/blowfish-arm.S: adapt to pre-armv6 CPUs. * cipher/blowfish.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/camellia-armv6.S => cipher/camellia-arm.S: adapt to pre-armv6 CPUs. * cipher/camellia.c, cipher-camellia-glue.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/cast5-armv6.S => cipher/cast5-arm.S: adapt to pre-armv6 CPUs. * cipher/cast5.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/rijndael-armv6.S => cipher/rijndael-arm.S: adapt to pre-armv6 CPUs. * cipher/rijndael.c: enable assembly on armv4/armv5 little-endian CPUs. * cipher/twofish-armv6.S => cipher/twofish-arm.S: adapt to pre-armv6 CPUs. * cipher/twofish.c: enable assembly on armv4/armv5 little-endian CPUs. -- Our ARMv6 assembly optimized code can be easily adapted to earlier CPUs. The only incompatible place is rev instruction used to do byte swapping. Replace it on <= ARMv6 with a series of 4 instructions. Compare: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 620ms 610ms 650ms 680ms 620ms 630ms 660ms 660ms 630ms 630ms CAMELLIA128 720ms 720ms 780ms 790ms 770ms 760ms 780ms 780ms 770ms 760ms CAMELLIA256 910ms 910ms 970ms 970ms 960ms 950ms 970ms 970ms 960ms 950ms CAST5 820ms 820ms 930ms 920ms 890ms 860ms 930ms 920ms 880ms 890ms BLOWFISH 550ms 560ms 650ms 660ms 630ms 600ms 660ms 650ms 610ms 620ms ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 130ms 140ms 180ms 200ms 160ms 170ms 190ms 200ms 170ms 170ms CAMELLIA128 150ms 160ms 210ms 220ms 200ms 190ms 210ms 220ms 190ms 190ms CAMELLIA256 180ms 180ms 260ms 240ms 240ms 230ms 250ms 250ms 230ms 230ms CAST5 170ms 160ms 270ms 120ms 240ms 130ms 260ms 270ms 130ms 120ms BLOWFISH 160ms 150ms 260ms 110ms 230ms 120ms 250ms 260ms 110ms 120ms Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> [ jk: in camellia.h and twofish.c, USE_ARMV6_ASM => USE_ARM_ASM ] [ jk: fix blowfish-arm.S when __ARM_FEATURE_UNALIGNED defined ] [ jk: in twofish.S remove defined(HAVE_ARM_ARCH_V6) ] [ jk: ARMv6 => ARM in comments ]
2013-10-01cipher: Simplify the cipher dispatcher cipher.c.Werner Koch1-15/+15
* src/gcrypt-module.h (gcry_cipher_spec_t): Move to ... * src/cipher-proto.h (gcry_cipher_spec_t): here. Merge with cipher_extra_spec_t. Add fields ALGO and FLAGS. Set these fields in all cipher modules. * cipher/cipher.c: Change most code to replace the former module system by a simpler system to gain information about the algorithms. (disable_pubkey_algo): Simplified. Not anymore thread-safe, though. * cipher/md.c (_gcry_md_selftest): Use correct structure. Not a real problem because both define the same function as their first field. * cipher/pubkey.c (_gcry_pk_selftest): Take care of the disabled flag. Signed-off-by: Werner Koch <wk@gnupg.org>
2013-09-04Move stack burning from block ciphers to cipher modesJussi Kivilinna1-6/+14
* src/gcrypt-module.h (gcry_cipher_encrypt_t) (gcry_cipher_decrypt_t): Return 'unsigned int'. * cipher/cipher.c (dummy_encrypt_block, dummy_decrypt_block): Return zero. (do_ecb_encrypt, do_ecb_decrypt): Get largest stack burn depth from block cipher crypt function and burn stack at end. * cipher/cipher-aeswrap.c (_gcry_cipher_aeswrap_encrypt) (_gcry_cipher_aeswrap_decrypt): Ditto. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Ditto. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) (_gcry_cipher_cfb_decrypt): Ditto. * cipher/cipher-ctr.c (_gcry_cipher_cbc_encrypt): Ditto. * cipher/cipher-ofb.c (_gcry_cipher_ofb_encrypt) (_gcry_cipher_ofb_decrypt): Ditto. * cipher/blowfish.c (encrypt_block, decrypt_block): Return burn stack depth. * cipher/camellia-glue.c (camellia_encrypt, camellia_decrypt): Ditto. * cipher/cast5.c (encrypt_block, decrypt_block): Ditto. * cipher/des.c (do_tripledes_encrypt, do_tripledes_decrypt) (do_des_encrypt, do_des_decrypt): Ditto. * cipher/idea.c (idea_encrypt, idea_decrypt): Ditto. * cipher/rijndael.c (rijndael_encrypt, rijndael_decrypt): Ditto. * cipher/seed.c (seed_encrypt, seed_decrypt): Ditto. * cipher/serpent.c (serpent_encrypt, serpent_decrypt): Ditto. * cipher/twofish.c (twofish_encrypt, twofish_decrypt): Ditto. * cipher/rfc2268.c (encrypt_block, decrypt_block): New. (_gcry_cipher_spec_rfc2268_40): Use encrypt_block and decrypt_block. -- Patch moves stack burning from block ciphers and cipher mode loop to end of cipher mode functions. This greatly reduces the overall CPU usage of the problematic _gcry_burn_stack. Internal cipher module API is changed so that encrypt/decrypt functions now return the stack burn depth as unsigned int to cipher mode function. (Note, patch also adds missing burn_stack for RFC2268_40 cipher). _gcry_burn_stack CPU time (looping tests/benchmark cipher blowfish): arch CPU Old New i386 Intel-Haswell 4.1% 0.16% x86_64 Intel-Haswell 3.4% 0.07% armhf Cortex-A8 8.7% 0.14% New vs. old (armhf/Cortex-A8): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.05x 1.05x 1.04x 1.04x 1.04x 1.04x 1.07x 1.05x 1.04x 1.04x 3DES 1.04x 1.03x 1.04x 1.03x 1.04x 1.04x 1.04x 1.04x 1.04x 1.04x CAST5 1.19x 1.20x 1.15x 1.00x 1.17x 1.00x 1.15x 1.05x 1.00x 1.00x BLOWFISH 1.21x 1.22x 1.16x 1.00x 1.18x 1.00x 1.16x 1.16x 1.00x 1.00x AES 1.09x 1.09x 1.00x 1.00x 1.00x 1.00x 1.07x 1.07x 1.00x 1.00x AES192 1.11x 1.11x 1.00x 1.00x 1.00x 1.00x 1.08x 1.09x 1.01x 1.00x AES256 1.07x 1.08x 1.01x .99x 1.00x 1.00x 1.07x 1.06x 1.00x 1.00x TWOFISH 1.10x 1.09x 1.09x 1.00x 1.09x 1.00x 1.08x 1.09x 1.00x 1.00x ARCFOUR 1.00x 1.00x DES 1.07x 1.11x 1.06x 1.08x 1.07x 1.07x 1.06x 1.06x 1.06x 1.06x TWOFISH128 1.10x 1.10x 1.09x 1.00x 1.09x 1.00x 1.08x 1.08x 1.00x 1.00x SERPENT128 1.06x 1.07x 1.02x 1.00x 1.06x 1.00x 1.06x 1.05x 1.00x 1.00x SERPENT192 1.07x 1.06x 1.03x 1.00x 1.06x 1.00x 1.06x 1.05x 1.00x 1.00x SERPENT256 1.06x 1.07x 1.02x 1.00x 1.06x 1.00x 1.05x 1.06x 1.00x 1.00x RFC2268_40 0.97x 1.01x 0.99x 0.98x 1.00x 0.97x 0.96x 0.96x 0.97x 0.97x SEED 1.45x 1.54x 1.53x 1.56x 1.50x 1.51x 1.50x 1.50x 1.42x 1.42x CAMELLIA128 1.08x 1.07x 1.06x 1.00x 1.07x 1.00x 1.06x 1.06x 1.00x 1.00x CAMELLIA192 1.08x 1.08x 1.08x 1.00x 1.07x 1.00x 1.07x 1.07x 1.00x 1.00x CAMELLIA256 1.08x 1.09x 1.07x 1.01x 1.08x 1.00x 1.07x 1.07x 1.00x 1.00x SALSA20 .99x 1.00x Raw data: New (armhf/Cortex-A8): Running each test 100 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 8620ms 8680ms 9640ms 10010ms 9140ms 8960ms 9630ms 9660ms 9180ms 9180ms 3DES 13990ms 14000ms 14780ms 15300ms 14320ms 14370ms 14780ms 14780ms 14480ms 14480ms CAST5 2980ms 2980ms 3780ms 2300ms 3290ms 2320ms 3770ms 4100ms 2320ms 2320ms BLOWFISH 2740ms 2660ms 3530ms 2060ms 3050ms 2080ms 3530ms 3530ms 2070ms 2070ms AES 2200ms 2330ms 2330ms 2450ms 2270ms 2270ms 2700ms 2690ms 2330ms 2320ms AES192 2550ms 2670ms 2700ms 2910ms 2630ms 2640ms 3060ms 3060ms 2680ms 2690ms AES256 2920ms 3010ms 3040ms 3190ms 3010ms 3000ms 3380ms 3420ms 3050ms 3050ms TWOFISH 2790ms 2840ms 3300ms 2950ms 3010ms 2870ms 3310ms 3280ms 2940ms 2940ms ARCFOUR 2050ms 2050ms DES 5640ms 5630ms 6440ms 6970ms 5960ms 6000ms 6440ms 6440ms 6120ms 6120ms TWOFISH128 2790ms 2840ms 3300ms 2950ms 3010ms 2890ms 3310ms 3290ms 2930ms 2930ms SERPENT128 4530ms 4340ms 5210ms 4470ms 4740ms 4620ms 5020ms 5030ms 4680ms 4680ms SERPENT192 4510ms 4340ms 5190ms 4460ms 4750ms 4620ms 5020ms 5030ms 4680ms 4680ms SERPENT256 4540ms 4330ms 5220ms 4460ms 4730ms 4600ms 5030ms 5020ms 4680ms 4680ms RFC2268_40 10530ms 7790ms 11140ms 9490ms 10650ms 10710ms 11710ms 11690ms 11000ms 11000ms SEED 4530ms 4540ms 5050ms 5380ms 4760ms 4810ms 5060ms 5060ms 4850ms 4860ms CAMELLIA128 2660ms 2630ms 3170ms 2750ms 2880ms 2740ms 3170ms 3170ms 2780ms 2780ms CAMELLIA192 3430ms 3400ms 3930ms 3530ms 3650ms 3500ms 3940ms 3940ms 3570ms 3560ms CAMELLIA256 3430ms 3390ms 3940ms 3500ms 3650ms 3510ms 3930ms 3940ms 3550ms 3550ms SALSA20 1910ms 1900ms Old (armhf/Cortex-A8): Running each test 100 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 9030ms 9100ms 10050ms 10410ms 9540ms 9360ms 10350ms 10190ms 9560ms 9570ms 3DES 14580ms 14460ms 15300ms 15720ms 14880ms 14900ms 15350ms 15330ms 15030ms 15020ms CAST5 3560ms 3570ms 4350ms 2300ms 3860ms 2330ms 4340ms 4320ms 2330ms 2320ms BLOWFISH 3320ms 3250ms 4110ms 2060ms 3610ms 2080ms 4100ms 4090ms 2070ms 2070ms AES 2390ms 2530ms 2320ms 2460ms 2280ms 2270ms 2890ms 2880ms 2330ms 2330ms AES192 2830ms 2970ms 2690ms 2900ms 2630ms 2650ms 3320ms 3330ms 2700ms 2690ms AES256 3110ms 3250ms 3060ms 3170ms 3000ms 3000ms 3610ms 3610ms 3050ms 3060ms TWOFISH 3080ms 3100ms 3600ms 2940ms 3290ms 2880ms 3560ms 3570ms 2940ms 2930ms ARCFOUR 2060ms 2050ms DES 6060ms 6230ms 6850ms 7540ms 6380ms 6400ms 6830ms 6840ms 6500ms 6510ms TWOFISH128 3060ms 3110ms 3600ms 2940ms 3290ms 2890ms 3560ms 3560ms 2940ms 2930ms SERPENT128 4820ms 4630ms 5330ms 4460ms 5030ms 4620ms 5300ms 5300ms 4680ms 4680ms SERPENT192 4830ms 4620ms 5320ms 4460ms 5040ms 4620ms 5300ms 5300ms 4680ms 4680ms SERPENT256 4820ms 4640ms 5330ms 4460ms 5030ms 4620ms 5300ms 5300ms 4680ms 4660ms RFC2268_40 10260ms 7850ms 11080ms 9270ms 10620ms 10380ms 11250ms 11230ms 10690ms 10710ms SEED 6580ms 6990ms 7710ms 8370ms 7140ms 7240ms 7600ms 7610ms 6870ms 6900ms CAMELLIA128 2860ms 2820ms 3360ms 2750ms 3080ms 2740ms 3350ms 3360ms 2790ms 2790ms CAMELLIA192 3710ms 3680ms 4240ms 3520ms 3910ms 3510ms 4200ms 4210ms 3560ms 3560ms CAMELLIA256 3700ms 3680ms 4230ms 3520ms 3930ms 3510ms 4200ms 4210ms 3550ms 3560ms SALSA20 1900ms 1900ms Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-20Move ARMv6 detection to configure.acJussi Kivilinna1-8/+1
* cipher/blowfish-armv6.S: Replace __ARM_ARCH >= 6 checks with HAVE_ARM_ARCH_V6. * cipher/blowfish.c: Ditto. * cipher/camellia-armv6.S: Ditto. * cipher/camellia.h: Ditto. * cipher/cast5-armv6.S: Ditto. * cipher/cast5.c: Ditto. * cipher/rijndael-armv6.S: Ditto. * cipher/rijndael.c: Ditto. * configure.ac: Add HAVE_ARM_ARCH_V6 check. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-08-14rijndael: add ARMv6 assembly implementationJussi Kivilinna1-8/+40
* cipher/Makefile.am: Add 'rijndael-armv6.S'. * cipher/rijndael-armv6.S: New file. * cipher/rijndael.c (USE_ARMV6_ASM): New macro. [USE_ARMV6_ASM] (_gcry_aes_armv6_encrypt_block) (_gcry_aes_armv6_decrypt_block): New prototypes. (do_encrypt_aligned) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (do_encrypt): Disable input/output alignment when USE_ARMV6_ASM. (do_decrypt_aligned) [USE_ARMV6_ASM]: Use ARMv6 assembly function. (do_decrypt): Disable input/output alignment when USE_ARMV6_ASM. * configure.ac (HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS): New check for gcc/as compatibility with ARM assembly implementations. (aes) [arm]: Add 'rijndael-armv6.lo'. -- Add optimized ARMv6 assembly implementation for AES. Implementation is tuned for Cortex-A8. Unaligned access handling is done in assembly part. For now, only enable this on little-endian systems as big-endian correctness have not been tested yet. Old vs new. Cortex-A8 (on Debian Wheezy/armhf): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 2.61x 3.12x 2.16x 2.59x 2.26x 2.25x 2.08x 2.08x 2.23x 2.23x AES192 2.60x 3.06x 2.18x 2.65x 2.29x 2.29x 2.12x 2.12x 2.25x 2.27x AES256 2.62x 3.09x 2.24x 2.72x 2.30x 2.34x 2.17x 2.19x 2.32x 2.32x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-06-20Check if assembler is compatible with AMD64 assembly implementationsJussi Kivilinna1-1/+1
* cipher/blowfish-amd64.S: Enable only if HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS is defined. * cipher/camellia-aesni-avx-amd64.S: Ditto. * cipher/camellia-aesni-avx2-amd64.S: Ditto. * cipher/cast5-amd64.S: Ditto. * cipher/rinjdael-amd64.S: Ditto. * cipher/serpent-avx2-amd64.S: Ditto. * cipher/serpent-sse2-amd64.S: Ditto. * cipher/twofish-amd64.S: Ditto. * cipher/blowfish.c: Use AMD64 assembly implementation only if HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS is defined * cipher/camellia-glue.c: Ditto. * cipher/cast5.c: Ditto. * cipher/rijndael.c: Ditto. * cipher/serpent.c: Ditto. * cipher/twofish.c: Ditto. * configure.ac: Check gcc/as compatibility with AMD64 assembly implementations. -- Later these checks can be split and assembly implementations adapted to handle different platforms, but for now disable AMD64 assembly implementations if assembler does not look to be able to handle them. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-29rinjdael: add amd64 assembly implementationJussi Kivilinna1-0/+32
* cipher/Makefile.am: Add 'rijndael-amd64.S'. * cipher/rijndael-amd64.S: New file. * cipher/rijndael.c (USE_AMD64_ASM): New macro. [USE_AMD64_ASM] (_gcry_aes_amd64_encrypt_block) (_gcry_aes_amd64_decrypt_block): New prototypes. (do_encrypt_aligned) [USE_AMD64_ASM]: Use amd64 assembly function. (do_encrypt): Disable input/output alignment when USE_AMD64_ASM is set. (do_decrypt_aligned) [USE_AMD64_ASM]: Use amd64 assembly function. (do_decrypt): Disable input/output alignment when USE_AMD64_AES is set. * configure.ac (aes) [x86-64]: Add 'rijndael-amd64.lo'. -- Add optimized amd64 assembly implementation for AES. Old vs new, on AMD Phenom II: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 1.74x 1.72x 1.81x 1.85x 1.82x 1.76x 1.67x 1.64x 1.79x 1.81x AES192 1.77x 1.77x 1.79x 1.88x 1.90x 1.80x 1.69x 1.69x 1.85x 1.81x AES256 1.79x 1.81x 1.83x 1.89x 1.88x 1.82x 1.72x 1.70x 1.87x 1.89x Old vs new, on Intel Core2: ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 1.77x 1.75x 1.78x 1.76x 1.76x 1.77x 1.75x 1.76x 1.76x 1.82x AES192 1.80x 1.73x 1.81x 1.76x 1.79x 1.85x 1.77x 1.76x 1.80x 1.85x AES256 1.81x 1.77x 1.81x 1.77x 1.80x 1.79x 1.78x 1.77x 1.81x 1.85x Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-24cipher-selftest: make selftest work with any block-sizeJussi Kivilinna1-3/+3
* cipher/cipher-selftest.c (_gcry_selftest_helper_cbc_128) (_gcry_selftest_helper_cfb_128, _gcry_selftest_helper_ctr_128): Renamed functions from '<name>_128' to '<name>'. (_gcry_selftest_helper_cbc, _gcry_selftest_helper_cfb) (_gcry_selftest_helper_ctr): Make work with different block sizes. * cipher/cipher-selftest.h (_gcry_selftest_helper_cbc_128) (_gcry_selftest_helper_cfb_128, _gcry_selftest_helper_ctr_128): Renamed prototypes from '<name>_128' to '<name>'. * cipher/camellia-glue.c (selftest_ctr_128, selftest_cfb_128) (selftest_ctr_128): Change to use new function names. * cipher/rijndael.c (selftest_ctr_128, selftest_cfb_128) (selftest_ctr_128): Change to use new function names. * cipher/serpent.c (selftest_ctr_128, selftest_cfb_128) (selftest_ctr_128): Change to use new function names. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-23rinjdael: add parallel processing for CFB decryption with AES-NIJussi Kivilinna1-1/+173
* cipher/cipher-selftest.c (_gcry_selftest_helper_cfb_128): New function for CFB selftests. * cipher/cipher-selftest.h (_gcry_selftest_helper_cfb_128): New prototype. * cipher/rijndael.c [USE_AESNI] (do_aesni_enc_vec4): New function. (_gcry_aes_cfb_dec) [USE_AESNI]: Add parallelized CFB decryption. (selftest_cfb_128): New function. (selftest): Call selftest_cfb_128. -- CFB decryption can be parallelized for additional performance. On Intel Sandy-Bridge processor, this change makes CFB decryption 4.6 times faster. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-22Add AES bulk CBC decryption selftestJussi Kivilinna1-0/+18
* cipher/rinjdael.c (selftest_cbc_128): New. (selftest): Call selftest_cbc_128. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-05-22Change AES bulk CTR encryption selftest use new selftest helper functionJussi Kivilinna1-86/+7
* cipher/rinjdael.c: (selftest_ctr_128): Change to use new selftest helper function. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
2013-04-18cipher: Fix regression in Padlock support.Werner Koch1-7/+2
* cipher/rijndael.c (do_setkey): Remove dummy padlock key generation case and use the standard one. -- This is really a brown paper bag bug. I should have been able to fix it by a bit of code staring or bi-secting it myself. Instead Rafaël Carré did this and with the donation of a VIA nano board from Stefan Krüger. Thanks to both of you. (regression since commit b825c5db17292988d261fefdc83cbc43d97d4b02) Signed-off-by: Werner Koch <wk@gnupg.org> (cherry picked from commit f1f016855418aae561ede4472590d45a24ab4476)
2013-02-19Rinjdael: Fix use of SSE2 outside USE_AESNI/ctx->use_aesniJussi Kivilinna1-2/+10
* cipher/rijndael.c (_gcry_aes_cbc_enc): Check if AES-NI is enabled before calling aesni_prepare() and aesni_cleanup(). -- aesni_cleanup() contains SSE2 instructions that are interpreted as MMX on CPUs without SSE2 support (Pentium-III, etc). This causes x87 register state to be poisoned, causing crashes later on when program tries to use floating point registers. Add '#ifdef USE_AESNI' and 'if (ctx->use_aesni)' for aesni_cleanup() and, while at it, for aesni_prepare() too. Reported-by: Mitsutoshi NAKANO <bkbin005@rinku.zaq.ne.jp> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-12-03Optimize buffer xoring.Jussi Kivilinna1-32/+18
* cipher/Makefile.am (libcipher_la_SOURCES): Add 'bufhelp.h'. * cipher/bufhelp.h: New. * cipher/cipher-aeswrap.c (_gcry_cipher_aeswrap_encrypt) (_gcry_cipher_aeswrap_decrypt): Use 'buf_xor' for buffer xoring. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gcry_cipher_cfb_decrypt): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-ofb.c (_gcry_cipher_ofb_encrypt) (_gcry_cipher_ofb_decrypt): Use 'buf_xor' for buffer xoring and remove resulting used variables. * cipher/rijndael.c (_gry_aes_cfb_enc): Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gry_aes_cfb_dev): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. (_gry_aes_cbc_enc, _gry_aes_ctr_enc, _gry_aes_cbc_dec): Use 'buf_xor' for buffer xoring and remove resulting unused variables. -- Add faster helper functions for buffer xoring and replace byte buffer xor loops. This give following speed up. Note that CTR speed up is from refactoring code to use buf_xor() and removal of integer division/modulo operations issued per each processed byte. This removal of div/mod most likely gives even greater speed increase on CPU architechtures that do not have hardware division unit. Benchmark ratios (old-vs-new, AMD Phenom II, x86-64): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 0.99x 1.01x 1.06x 1.02x 1.03x 1.06x 1.04x 1.02x 1.58x 1.58x 3DES 1.00x 1.00x 1.01x 1.01x 1.02x 1.02x 1.02x 1.01x 1.22x 1.23x CAST5 0.98x 1.00x 1.09x 1.03x 1.09x 1.09x 1.07x 1.07x 1.98x 1.95x BLOWFISH 1.00x 1.00x 1.18x 1.05x 1.07x 1.07x 1.05x 1.05x 1.93x 1.91x AES 1.00x 0.98x 1.18x 1.14x 1.13x 1.13x 1.14x 1.14x 1.18x 1.18x AES192 0.98x 1.00x 1.13x 1.14x 1.13x 1.10x 1.14x 1.16x 1.15x 1.15x AES256 0.97x 1.02x 1.09x 1.13x 1.13x 1.09x 1.10x 1.14x 1.11x 1.13x TWOFISH 1.00x 1.00x 1.15x 1.17x 1.18x 1.16x 1.18x 1.13x 2.37x 2.31x ARCFOUR 1.03x 0.97x DES 1.01x 1.00x 1.04x 1.04x 1.04x 1.05x 1.05x 1.02x 1.56x 1.55x TWOFISH128 0.97x 1.03x 1.18x 1.17x 1.18x 1.15x 1.15x 1.15x 2.37x 2.31x SERPENT128 1.00x 1.00x 1.10x 1.11x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x SERPENT192 1.00x 1.00x 1.07x 1.08x 1.08x 1.09x 1.08x 1.08x 1.65x 1.66x SERPENT256 1.00x 1.00x 1.09x 1.09x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x RFC2268_40 1.03x 0.99x 1.05x 1.02x 1.03x 1.03x 1.04x 1.03x 1.46x 1.46x SEED 1.00x 1.00x 1.10x 1.10x 1.09x 1.09x 1.10x 1.07x 1.80x 1.76x CAMELLIA128 1.00x 1.00x 1.23x 1.12x 1.15x 1.17x 1.15x 1.12x 2.15x 2.13x CAMELLIA192 1.05x 1.03x 1.23x 1.21x 1.21x 1.16x 1.12x 1.25x 1.90x 1.90x CAMELLIA256 1.03x 1.07x 1.10x 1.19x 1.08x 1.14x 1.12x 1.10x 1.90x 1.92x Benchmark ratios (old-vs-new, AMD Phenom II, i386): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.00x 1.00x 1.04x 1.05x 1.04x 1.02x 1.02x 1.02x 1.38x 1.40x 3DES 1.01x 1.00x 1.02x 1.04x 1.03x 1.01x 1.00x 1.02x 1.20x 1.20x CAST5 1.00x 1.00x 1.03x 1.09x 1.07x 1.04x 1.13x 1.00x 1.74x 1.74x BLOWFISH 1.04x 1.08x 1.03x 1.13x 1.07x 1.12x 1.03x 1.00x 1.78x 1.74x AES 0.96x 1.00x 1.09x 1.08x 1.14x 1.13x 1.07x 1.03x 1.14x 1.09x AES192 1.00x 1.03x 1.07x 1.03x 1.07x 1.07x 1.06x 1.03x 1.08x 1.11x AES256 1.00x 1.00x 1.06x 1.06x 1.10x 1.06x 1.05x 1.03x 1.10x 1.10x TWOFISH 0.95x 1.10x 1.13x 1.23x 1.05x 1.14x 1.09x 1.13x 1.95x 1.86x ARCFOUR 1.00x 1.00x DES 1.02x 0.98x 1.04x 1.04x 1.05x 1.02x 1.04x 1.00x 1.45x 1.48x TWOFISH128 0.95x 1.10x 1.26x 1.19x 1.09x 1.14x 1.17x 1.00x 2.00x 1.91x SERPENT128 1.02x 1.00x 1.08x 1.04x 1.10x 1.06x 1.08x 1.04x 1.42x 1.42x SERPENT192 1.02x 1.02x 1.06x 1.06x 1.10x 1.08x 1.04x 1.06x 1.42x 1.42x SERPENT256 1.02x 0.98x 1.06x 1.06x 1.10x 1.06x 1.04x 1.06x 1.42x 1.40x RFC2268_40 1.00x 1.00x 1.02x 1.06x 1.04x 1.02x 1.02x 1.02x 1.35x 1.35x SEED 1.00x 0.97x 1.11x 1.05x 1.06x 1.08x 1.08x 1.05x 1.56x 1.57x CAMELLIA128 1.03x 0.97x 1.12x 1.14x 1.06x 1.10x 1.06x 1.06x 1.73x 1.59x CAMELLIA192 1.06x 1.00x 1.13x 1.10x 1.11x 1.11x 1.15x 1.08x 1.57x 1.58x CAMELLIA256 1.06x 1.03x 1.10x 1.10x 1.11x 1.11x 1.13x 1.08x 1.57x 1.62x [v2]: - include stdint.h only when it's available - use uintptr_t instead of long and intptr_t Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-29Optimize AES-NI CTR mode.Jussi Kivilinna1-51/+39
* cipher/rijndael.c [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Make handling of 64-bit overflow and carry conditional. Avoid generic to vector register passing of value '1'. Generate and use '-1' instead. -- We only need to handle 64-bit carry in few special cases, that happen very rarely. So move carry handling to slow-path and only detect need for carry handling on fast-path. Also avoid moving '1' from generic register to vector register, as that might be slow on some CPUs. Instead generate '-1' with SSE2 instructions and use subtraction instead of addition to increase IV. Overall this gives ~8% improvement in speed for AES CTR mode on Intel Sandy-Bridge. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-28Make a cpp conditional in rijndael.c better readable.Werner Koch1-3/+5
* cipher/rijndael.c (USE_AESNI): Modify cpp conditionals for better readability.
2012-11-28Fix building with Clang on x86-64 and i386Jussi Kivilinna1-13/+13
* cipher/rijndael.c [USE_AESNI] (do_aesni_enc_aligned) (do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Add explicit suffix to 'cmp' instructions. -- Clang throws errors on missing instruction suffixes, such as: rijndael.c:1091:39: error: ambiguous instructions require an explicit suffix (could be 'cmpb', 'cmpw', 'cmpl', or 'cmpq') <inline asm>:39:2: note: instantiated into assembly here cmp $1, -44(%rbp) With this patch building on x86-64 works fine. Other issues still exists on i386, namely with MPI, which can be overcome with 'clang -fheinous-gnu-extensions'. [v2]: - remove do_aesni_enc_vec4 modification as that function didn't make it to upstream. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> NB: I still believe it is a bad idea of clang to define __GNUC__ and not being 100% compatible to gcc. [wk]
2012-11-26Fix missing 64bit carry handling in AES-NI CTR modeJussi Kivilinna1-18/+171
* cipher/rijndael.c [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Add carry handling to 64-bit addition. (selftest_ctr_128): New function for testing IV handling in bulk CTR function. (selftest): Add call to selftest_ctr_128. -- Carry handling checks if lower 64-bit part of SSE register was overflowed and if it was, increment upper parts since that point. Also add selftests to verify correct operation. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-26Add parallelized AES-NI CBC decryptionJussi Kivilinna1-9/+152
* cipher/rijndael.c [USE_AESNI] (aesni_cleanup_5): New macro. [USE_AESNI] (do_aesni_dec_vec4): New function. (_gcry_aes_cbc_dec) [USE_AESNI]: Add parallelized CBC loop. (_gcry_aes_cbc_dec) [USE_AESNI]: Change IV storage register from xmm3 to xmm5. -- This gives ~60% improvement in CBC decryption speed on sandy-bridge (x86-64). Overall speed improvement with this and previous CBC patches is over 400%. Before: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms After: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2130ms 450ms 1880ms 670ms 2250ms 2280ms 490ms 490ms AES192 880ms 920ms 2460ms 540ms 2210ms 830ms 2580ms 2570ms 580ms 570ms AES256 1020ms 1070ms 2800ms 620ms 2560ms 970ms 2880ms 2880ms 660ms 650ms Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-26Clear xmm5 after use in AES-NI CTR modeJussi Kivilinna1-4/+5
* cipher/rijndael.c [USE_AESNI]: Rename aesni_cleanup_2_4 to aesni_cleanup_2_5. [USE_AESNI] (aesni_cleanup_2_5): Clear xmm5 register. (_gcry_aes_ctr_enc, _gcry_aes_cbc_dec) [USE_AESNI]: Use aesni_cleanup_2_5 instead of aesni_cleanup_2_4. -- xmm5 register is used by parallelized AES-NI CTR mode, so it should be cleaned up after use too. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-26Optimize AES-NI CBC encryptionJussi Kivilinna1-10/+37
* cipher/rijndeal.c (_gcry_aes_cbc_enc) [USE_AESNI]: Add AES-NI spesific loop and use SSE2 assembler for xoring and copying of blocks. -- This gives ~35% improvement in 'tests/benchmark cipher aes' on Sandy-Bridge CPU (x86-64). Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-26Improve parallelizability of CBC decryption for AES-NIJussi Kivilinna1-22/+75
* cipher/rijndael.c (_gcry_aes_cbc_dec) [USE_AESNI]: Add AES-NI specific CBC mode loop with temporary block and IV stored in free SSE registers. -- Benchmark results on Intel Core i5-2450M (x86-64) show ~2.5x improvement: Before: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 690ms 780ms 2940ms 2110ms 1880ms 670ms 2250ms 2250ms 490ms 500ms AES192 890ms 930ms 3260ms 2390ms 2220ms 820ms 2580ms 2590ms 560ms 570ms AES256 1040ms 1070ms 3590ms 2640ms 2540ms 970ms 2880ms 2890ms 650ms 650ms After: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-21Fix for strict aliasing rules.Werner Koch1-18/+18
* cipher/rijndael.c (do_setkey, prepare_decryption): Use u32_a_t for casting. -- gcc 4.7.1 now show warnings for more functions. Like: rijndael.c:412:19: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] This fixes them using the may_alias attribute.
2012-11-21Add x86_64 support for AES-NIJussi Kivilinna1-103/+96
* cipher/rijndael.c [ENABLE_AESNI_SUPPORT]: Enable USE_AESNI on x86-64. (do_setkey) [USE_AESNI_is_disabled_here]: Use %[key] and %[ksch] directly as registers instead of using temporary register %%esi. [USE_AESNI] (do_aesni_enc_aligned, do_aesni_dec_aligned, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Use %[key] directly as register instead of using temporary register %%esi. [USE_AESNI] (do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Change %[key] from generic "g" type to register "r". * src/hwfeatures.c (_gcry_detect_hw_features) [__x86_64__]: Do not clear AES-NI feature flag. -- AES-NI assembler uses %%esi for key-material pointer register. However %[key] can be marked as "r" (register) and automatically be 64bit on x86-64 and be 32bit on i386. So use %[key] for pointer register instead of %esi and that way make same AES-NI code work on both x86-64 and i386. [v2] - Add GNU style changelog - Fixed do_setkey changes, use %[ksch] for output instead of %[key] - Changed [key] assembler arguments from "g" to "r" to force use of registers in all cases (when tested v1, "g" did work as indented and %[key] mapped to register on i386 and x86-64, but that might not happen always). Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
2012-11-21Use configure test for aligned attribute.Werner Koch1-2/+2
* configure.ac (HAVE_GCC_ATTRIBUTE_ALIGNED): New test and ac_define. * cipher/cipher-internal.h, cipher/rijndael.c, random/rndhw.c: Use new macro instead of a fixed test for __GNUC__. -- We assume that compilers that grok "__attribute__ ((aligned (16)))" implement that in the same way as gcc does. In case it turns out that this is not the case we will need to do two more things: Detect such different behaviour and come up with a construct to allows the use of that other style of alignment forcing.
2012-11-21Fix segv with AES-NI on some platforms.Werner Koch1-1/+1
* cipher/rijndael.c (RIJNDAEL_context): Align on 16 bytes. -- The trigger for this problem is the allocation of the context in the selftest functions. The other code paths use a 16 byte alignment anyway by means of the allocation of the context in cipher.c Thanks to Gentoo hacker Joakim Tjernlund for figuring out the reason of this problem. GnuPG-bug-id: 1452
2012-06-21Beautify last change.Werner Koch1-3/+6
* cipher/rijndael.c: Replace C99 feature from last patch. Keep cpp lines short. * random/rndhw.c: Keep cpp lines short. * src/hwfeatures.c (_gcry_detect_hw_features): Make cpp def chain better readable.