tcg: allocate TB structs before the corresponding translated code

Allocating an arbitrarily-sized array of tbs results in either (a) a lot of memory wasted or (b) unnecessary flushes of the code cache when we run out of TB structs in the array. An obvious solution would be to just malloc a TB struct when needed, and keep the TB array as an array of pointers (recall that tb_find_pc() needs the TB array to run in O(log n)). Perhaps a better solution, which is implemented in this patch, is to allocate TB's right before the translated code they describe. This results in some memory waste due to padding to have code and TBs in separate cache lines--for instance, I measured 4.7% of padding in the used portion of code_gen_buffer when booting aarch64 Linux on a host with 64-byte cache lines. However, it can allow for optimizations in some host architectures, since TCG backends could safely assume that the TB and the corresponding translated code are very close to each other in memory. See this message by rth for a detailed explanation: https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html Subject: Re: GSoC 2017 Proposal: TCG performance enhancements Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net> Suggested-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1496790745-314-3-git-send-email-cota@braap.org> [rth: Simplify the arithmetic in tcg_tb_alloc] Signed-off-by: Richard Henderson <rth@twiddle.net>
author: Emilio G. Cota <cota@braap.org> 2017-06-06 19:12:25 -0400
committer: Richard Henderson <rth@twiddle.net> 2017-06-19 11:10:59 -0700
commit: 6e3b2bfd6af488a896f7936e99ef160f8f37e6f2 (patch)
tree: 12a0e88db3595401c00c3532d811b48f364e37c1 /tcg
parent: b255b2c8a5484742606e8760870ba3e14d0c9605 (diff)
download: qemu-6e3b2bfd6af488a896f7936e99ef160f8f37e6f2.tar.gz
2 files changed, 21 insertions, 1 deletions
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 564292f54d..35598296c5 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -383,6 +383,26 @@ void tcg_context_init(TCGContext *s)
     }
 }
 
+/*
+ * Allocate TBs right before their corresponding translated code, making
+ * sure that TBs and code are on different cache lines.
+ */
+TranslationBlock *tcg_tb_alloc(TCGContext *s)
+{
+    uintptr_t align = qemu_icache_linesize;
+    TranslationBlock *tb;
+    void *next;
+
+    tb = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, align);
+    next = (void *)ROUND_UP((uintptr_t)(tb + 1), align);
+
+    if (unlikely(next > s->code_gen_highwater)) {
+        return NULL;
+    }
+    s->code_gen_ptr = next;
+    return tb;
+}
+
 void tcg_prologue_init(TCGContext *s)
 {
     size_t prologue_size, total_size;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 5ec48d1787..9e37722799 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -697,7 +697,6 @@ struct TCGContext {
        here, because there's too much arithmetic throughout that relies
        on addition and subtraction working on bytes.  Rely on the GCC
        extension that allows arithmetic on void*.  */
-    int code_gen_max_blocks;
     void *code_gen_prologue;
     void *code_gen_epilogue;
     void *code_gen_buffer;
@@ -756,6 +755,7 @@ static inline bool tcg_op_buf_full(void)
 /* tb_lock must be held for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
+TranslationBlock *tcg_tb_alloc(TCGContext *s);
 
 void tb_lock(void);
 void tb_unlock(void);
author	Emilio G. Cota <cota@braap.org>	2017-06-06 19:12:25 -0400
committer	Richard Henderson <rth@twiddle.net>	2017-06-19 11:10:59 -0700
commit	6e3b2bfd6af488a896f7936e99ef160f8f37e6f2 (patch)
tree	12a0e88db3595401c00c3532d811b48f364e37c1 /tcg
parent	b255b2c8a5484742606e8760870ba3e14d0c9605 (diff)
download	qemu-6e3b2bfd6af488a896f7936e99ef160f8f37e6f2.tar.gz