peter/qemu - QEMU hacking for Peter

Age	Commit message (Collapse)	Author	Files	Lines
2017-03-14	qemu-timer: do not include sysemu/cpus.h from util/qemu-timer.h	Paolo Bonzini	1	-0/+1
	This dependency is the wrong way, and we will need util/qemu-timer.h from sysemu/cpus.h in the next patch. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-09	translate-all: exit cpu_restore_state early if translating	Alex Bennée	1	-0/+13
	The translation code uses cpu_ld*_code which can trigger a tlb_fill which if it fails will erroneously attempts a fault resolution. This never works during translation as the TB being generated hasn't been added yet. The target should have checked retaddr before calling cpu_restore_state but for those that have yet to be fixed we do it here to avoid a recursive tb_lock() under MTTCG's new locking regime. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net>
2017-03-03	Merge branch 'icount-update' into HEAD	Paolo Bonzini	1	-1/+1
	Merge the original development branch due to breakage caused by the MTTCG merge. Conflicts: cpu-exec.c translate-common.c Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-24	tcg: enable tb_lock() for SoftMMU	Alex Bennée	1	-14/+1
	tb_lock() has long been used for linux-user mode to protect code generation. By enabling it now we prepare for MTTCG and ensure all code generation is serialised by this lock. The other major structure that needs protecting is the l1_map and its PageDesc structures. For the SoftMMU case we also use tb_lock() to protect these structures instead of linux-user mmap_lock() which as the name suggests serialises updates to the structure as a result of guest mmap operations. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net>
2017-02-24	tcg: drop global lock during TCG code execution	Jan Kiszka	1	-2/+7
	This finally allows TCG to benefit from the iothread introduction: Drop the global mutex while running pure TCG CPU code. Reacquire the lock when entering MMIO or PIO emulation, or when leaving the TCG loop. We have to revert a few optimization for the current TCG threading model, namely kicking the TCG thread in qemu_mutex_lock_iothread and not kicking it in qemu_cpu_kick. We also need to disable RAM block reordering until we have a more efficient locking mechanism at hand. Still, a Linux x86 UP guest and my Musicpal ARM model boot fine here. These numbers demonstrate where we gain something: 20338 jan 20 0 331m 75m 6904 R 99 0.9 0:50.95 qemu-system-arm 20337 jan 20 0 331m 75m 6904 S 20 0.9 0:26.50 qemu-system-arm The guest CPU was fully loaded, but the iothread could still run mostly independent on a second core. Without the patch we don't get beyond 32206 jan 20 0 330m 73m 7036 R 82 0.9 1:06.00 qemu-system-arm 32204 jan 20 0 330m 73m 7036 S 21 0.9 0:17.03 qemu-system-arm We don't benefit significantly, though, when the guest is not fully loading a host CPU. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Message-Id: <1439220437-23957-10-git-send-email-fred.konrad@greensocs.com> [FK: Rebase, fix qemu_devices_reset deadlock, rm address_space_* mutex] Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com> [EGC: fixed iothread lock for cpu-exec IRQ handling] Signed-off-by: Emilio G. Cota <cota@braap.org> [AJB: -smp single-threaded fix, clean commit msg, BQL fixes] Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> [PM: target-arm changes] Acked-by: Peter Maydell <peter.maydell@linaro.org>
2017-02-24	mttcg: translate-all: Enable locking debug in a debug build	Pranith Kumar	1	-36/+16
	Enable tcg lock debug asserts in a debug build by default instead of relying on DEBUG_LOCKING. None of the other DEBUG_* macros have asserts, so this patch removes DEBUG_LOCKING and enable these asserts in a debug build. CC: Richard Henderson <rth@twiddle.net> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> [AJB: tweak ifdefs so can be early in series] Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net>
2017-02-22	cpu-exec: unify icount_decr and tcg_exit_req	Paolo Bonzini	1	-1/+1
	The icount interrupt flag and tcg_exit_req serve almost the same purpose, let's make them completely the same. The former TB_EXIT_REQUESTED and TB_EXIT_ICOUNT_EXPIRED cases are unified, since we can distinguish them from the value of the interrupt flag. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-31	trace: switch to modular code generation for sub-directories	Daniel P. Berrange	1	-1/+1
	Introduce rules in the top level Makefile that are able to generate trace.[ch] files in every subdirectory which has a trace-events file. The top level directory is handled specially, so instead of creating trace.h, it creates trace-root.h. This allows sub-directories to include the top level trace-root.h file, without ambiguity wrt to the trace.g file in the current sub-dir. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170125161417.31949-7-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-27	replay: exception replay fix	Pavel Dovgalyuk	1	-0/+2
	This patch fixes replaying the exception when TB cache is full. It breaks cpu loop execution through setting exception_index to process such queued work as TB flush. v8: moved setting of exeption_index to tb_gen_code Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Message-Id: <20170126123418.5412.33815.stgit@PASHA-ISP> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-08	translate-all: Avoid -Werror=switch-bool	Richard Henderson	1	-1/+1
	gcc 5.3.0 diagnoses translate-all.c: In function ‘alloc_code_gen_buffer’: translate-all.c:756:17: error: switch condition has boolean value switch (buf2 != MAP_FAILED) { ^ Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-11-01	log: Add locking to large logging blocks	Richard Henderson	1	-0/+2
	Reuse the existing locking provided by stdio to keep in_asm, cpu, op, op_opt, op_ind, and out_asm as contiguous blocks. While it isn't possible to interleave e.g. in_asm or op_opt logs because of the TB lock protecting all code generation, it is possible to interleave cpu logs, or to interleave a cpu dump with an out_asm dump. For mingw32, we appear to have no viable solution for this. The locking functions are not properly exported from the system runtime library. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-10-31	tcg: move locking for tb_invalidate_phys_page_range up	Alex Bennée	1	-8/+31
	In the linux-user case all things that involve ''l1_map' and PageDesc tweaks are protected by the memory lock (mmpa_lock). For SoftMMU mode we previously relied on single threaded behaviour, with MTTCG we now use the tb_lock(). As a result we need to do a little re-factoring and push the taking of this lock up the call tree. This requires a slightly different entry for the SoftMMU and user-mode cases from tb_invalidate_phys_range. This also means user-mode breakpoint insertion needs to take two locks but it hadn't taken any previously so this is an improvement. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <20161027151030.20863-20-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	*_run_on_cpu: introduce run_on_cpu_data type	Paolo Bonzini	1	-7/+6
	This changes the _run_on_cpu APIs (and helpers) to pass data in a run_on_cpu_data type instead of a plain void . This is because we sometimes want to pass a target address (target_ulong) and this fails on 32 bit hosts emulating 64 bit guests. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <20161027151030.20863-24-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	tcg: protect translation related stuff with tb_lock.	KONRAD Frederic	1	-6/+28
	This protects all translation related work with tb_lock() too ensure thread safety. This effectively serialises all code generation. In addition to the code generation we also take the lock for TB invalidation. This has a knock on effect of meaning tb_lock() is held for modification of the SoftMMU TLB by non-self threads which will be used in later patches. Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com> Message-Id: <1439220437-23957-8-git-send-email-fred.konrad@greensocs.com> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [AJB: moved into tree, clean-up history] Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-Id: <20161027151030.20863-10-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	translate-all: Add assert_(memory\|tb)_lock annotations	Alex Bennée	1	-1/+21
	This adds calls to the assert_(memory\|tb)_lock for all public APIs which are documented as needing them held for linux-user mode. The asserts are NOPs for system-mode although these will be converted when MTTCG is enabled. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-Id: <20161027151030.20863-9-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	tcg: comment on which functions have to be called with tb_lock held	Paolo Bonzini	1	-5/+23
	softmmu requires more functions to be thread-safe, because translation blocks can be invalidated from e.g. notdirty callbacks. Probably the same holds for user-mode emulation, it's just that no one has ever tried to produce a coherent locking there. This patch will guide the introduction of more tb_lock and tb_unlock calls for system emulation. Note that after this patch some (most) of the mentioned functions are still called outside tb_lock/tb_unlock. The next one will rectify this. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-Id: <20161027151030.20863-7-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	translate-all: add DEBUG_LOCKING asserts	Alex Bennée	1	-0/+41
	This adds asserts to check the locking on the various translation engines structures. There are two sets of structures that are protected by locks. The first the l1map and PageDesc structures used to track which translation blocks are associated with which physical addresses. In user-mode this is covered by the mmap_lock. The second case are TB context related structures which are protected by tb_lock which is also user-mode only. Currently the asserts do nothing in SoftMMU mode but this will change for MTTCG. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-Id: <20161027151030.20863-4-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-31	translate_all: DEBUG_FLUSH -> DEBUG_TB_FLUSH	Alex Bennée	1	-4/+4
	Make the debug define consistent with the others. The flush operation is all about invalidating TranslationBlocks on flush events. Also fix up the commenting on the other DEBUG for the benefit of checkpatch. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-Id: <20161027151030.20863-3-alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-26	tcg: Add EXCP_ATOMIC	Richard Henderson	1	-0/+1
	When we cannot emulate an atomic operation within a parallel context, this exception allows us to stop the world and try again in a serial context. Reviewed-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-10-24	translate-all.c: Compute L1 page table properties at runtime	Vijaya Kumar K	1	-25/+46
	Remove L1 page mapping table properties computing statically using macros which is dependent on TARGET_PAGE_BITS. Drop macros V_L1_SIZE, V_L1_SHIFT, V_L1_BITS macros and replace with variables which are computed at early stage of VM boot. Removing dependency can help to make TARGET_PAGE_BITS dynamic. Signed-off-by: Vijaya Kumar K <vijayak@cavium.com> Message-id: 1465808915-4887-4-git-send-email-vijayak@caviumnetworks.com [PMM: assert(v_l1_shift % V_L2_BITS == 0) cache v_l2_levels initialize from page_init() rather than vl.c minor code style fixes put v_l1_size into a local where used as a loop limit] Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-09-27	tcg: Make tb_flush() thread safe	Sergey Fedorov	1	-10/+28
	Use async_safe_run_on_cpu() to make tb_flush() thread safe. This is possible now that code generation does not happen in the middle of execution. It can happen that multiple threads schedule a safe work to flush the translation buffer. To keep statistics and debugging output sane, always check if the translation buffer has already been flushed. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> [AJB: minor re-base fixes] Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <1470158864-17651-13-git-send-email-alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16	tcg: Merge GETPC and GETRA	Richard Henderson	1	-0/+2
	The return address argument to the softmmu template helpers was confused. In the legacy case, we wanted to indicate that there is no return address, and so passed in NULL. However, we then immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero value, indicating the presence of an (invalid) return address. Push the GETPC_ADJ subtraction down to the only point it's required: immediately before use within cpu_restore_state_from_tb, after all NULL pointer checks have been completed. This makes GETPC and GETRA identical. Remove GETRA as the lesser used macro, replacing all uses with GETPC. Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-09-13	tcg: set up tb->page_addr before insertion	Alex Bennée	1	-4/+4
	This ensures that if we find the TB on the slow path that tb->page_addr is correctly set before being tested. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Message-Id: <20160715175852.30749-9-sergey.fedorov@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13	tcg: Prepare TB invalidation for lockless TB lookup	Paolo Bonzini	1	-0/+3
	When invalidating a translation block, set an invalid flag into the TranslationBlock structure first. It is also necessary to check whether the target TB is still valid after acquiring 'tb_lock' but before calling tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in future. Note that we don't have to check 'last_tb'; an already invalidated TB will not be executed anyway and it is thus safe to patch it. Suggested-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13	tcg: Prepare safe access to tb_flushed out of tb_lock	Sergey Fedorov	1	-2/+2
	Ensure atomicity and ordering of CPU's 'tb_flushed' access for future translation block lookup out of 'tb_lock'. This field can only be touched from another thread by tb_flush() in user mode emulation. So the only access to be sequential atomic is: * a single write in tb_flush(); * reads/writes out of 'tb_lock'. In future, before enabling MTTCG in system mode, tb_flush() must be safe and this field becomes unnecessary. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <20160715175852.30749-5-sergey.fedorov@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13	tcg: Prepare safe tb_jmp_cache lookup out of tb_lock	Sergey Fedorov	1	-3/+7
	Ensure atomicity of CPU's 'tb_jmp_cache' access for future translation block lookup out of 'tb_lock'. Note that this patch does not make CPU's TLB invalidation safe if it is done from some other thread while the CPU is in its execution loop. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Message-Id: <20160715175852.30749-4-sergey.fedorov@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-30	translate: early exit in tb_flush if there is no tcg	Christian Borntraeger	1	-0/+3
	tb_flush does all kind of things, which are very tcg specific. As it is called from some places even for KVM (e.g. gdb server) it is better to detect these cases and do an early exit. This also fixes a crash in the gdb server that was triggered by commit 909eaac9bbc2 ("tb hash: track translated blocks with qht"). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Richard Henderson <rth@twiddle.net> Reported-by: Brent Baccala <cosine@freesoft.org> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Message-id: 1472148686-39841-1-git-send-email-borntraeger@de.ibm.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-08-02	qht: do not segfault when gathering stats from an uninitialized qht	Emilio G. Cota	1	-31/+39
	So far, QHT functions assume that the passed qht has previously been initialized--otherwise they segfault. This patch makes an exception for qht_statistics_init, with the goal of simplifying calling code. For instance, qht_statistics_init is called from the 'info jit' dump, and given that under KVM the TB qht is never initialized, we get a segfault. Thus, instead of complicating the 'info jit' code with additional checks, let's allow passing an uninitialized qht to qht_statistics_init. While at it, add a test for this to test-qht. Before the patch (for $ qemu -enable-kvm [...]): (qemu) info jit [...] direct jump count 0 (0%) (2 jumps=0 0%) Program received signal SIGSEGV, Segmentation fault. After the patch the "TB hash buckets", "TB hash occupancy" and "TB hash avg chain" lines are omitted. (qemu) info jit [...] direct jump count 0 (0%) (2 jumps=0 0%) TB hash buckets 0/0 (-nan% head buckets used) TB hash occupancy nan% avg chain occ. Histogram: (null) TB hash avg chain nan buckets. Histogram: (null) [...] Reported by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1469205390-14369-1-git-send-email-cota@braap.org> [Extract printing statistics to an entirely separate function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-08	translate-all: Fix user-mode self-modifying code in 2 page long TB	Stanislav Shmarov	1	-5/+5
	In user-mode emulation Translation Block can consist of 2 guest pages. In that case QEMU also mprotects 2 host pages that are dedicated for guest memory, containing instructions. QEMU detects self-modifying code with SEGFAULT signal processing. In case if instruction in 1st page is modifying memory of 2nd page (or vice versa) QEMU will mark 2nd page with PAGE_WRITE, invalidate TB, generate new TB contatining 1 guest instruction and exit to CPU loop. QEMU won't call mprotect, and new TB will cause same SEGFAULT. Page will have both PAGE_WRITE_ORG and PAGE_WRITE flags, so QEMU will handle the signal as guest binary problem, and exit with guest SEGFAULT. Solution is to do following: In case if current TB was invalidated continue to invalidate TBs from remaining guest pages and mark pages as PAGE_WRITE. After that disable host page protection with mprotect. If current tb was invalidated longjmp to main loop. That is more efficient, since we won't get SEGFAULT when executing new TB. Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Signed-off-by: Stanislav Shmarov <snarpix@gmail.com> Message-Id: <1467880392-1043630-1-git-send-email-snarpix@gmail.com> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-20	exec: [tcg] Track which vCPU is performing translation and execution	Lluís Vilanova	1	-0/+2
	Information is tracked inside the TCGContext structure, and later used by tracing events with the 'tcg' and 'vcpu' properties. The 'cpu' field is used to check tracing of translation-time events ("_trans"). The 'tcg_env' field is used to pass it to execution-time events ("_exec"). Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Message-id: 146549350162.18437.3033661139638458143.stgit@fimbulvetr.bsc.es Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-06-16	os-posix: include sys/mman.h	Paolo Bonzini	1	-2/+0
	qemu/osdep.h checks whether MAP_ANONYMOUS is defined, but this check is bogus without a previous inclusion of sys/mman.h. Include it in sysemu/os-posix.h and remove it from everywhere else. Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-11	translate-all: add tb hash bucket info to 'info jit' dump	Emilio G. Cota	1	-0/+36
	Examples: - Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags): TB count 715135/2684354 [...] TB hash buckets 388775/524288 (74.15% head buckets used) TB hash occupancy 33.04% avg chain occ. Histogram: [0,10)%\|▆ █ ▅▁▃▁▁\|[90,100]% TB hash avg chain 1.017 buckets. Histogram: 1\|█▁▁\|3 - Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0): TB count 712636/2684354 [...] TB hash buckets 344924/524288 (65.79% head buckets used) TB hash occupancy 31.64% avg chain occ. Histogram: [0,10)%\|█ ▆ ▅▁▃▁▂\|[90,100]% TB hash avg chain 1.047 buckets. Histogram: 1\|█▁▁▁\|4 - Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0): TB count 702818/2684354 [...] TB hash buckets 112741/524288 (21.50% head buckets used) TB hash occupancy 10.15% avg chain occ. Histogram: [0,10)%\|█ ▁ ▁▁▁▁▁\|[90,100]% TB hash avg chain 2.107 buckets. Histogram: [1.0,10.2)\|█▁▁▁▁▁▁▁▁▁\|[83.8,93.0] - Good hashing, but no auto-resize: TB count 715634/2684354 TB hash buckets 8192/8192 (100.00% head buckets used) TB hash occupancy 98.30% avg chain occ. Histogram: [95.3,95.8)%\|▁▁▃▄▃▄▁▇▁█\|[99.5,100.0]% TB hash avg chain 22.070 buckets. Histogram: [15.0,16.7)\|▁▂▅▄█▅▁▁▁▁\|[30.3,32.0] Acked-by: Sergey Fedorov <sergey.fedorov@linaro.org> Suggested-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1465412133-3029-16-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-11	tb hash: track translated blocks with qht	Emilio G. Cota	1	-47/+38
	Having a fixed-size hash table for keeping track of all translation blocks is suboptimal: some workloads are just too big or too small to get maximum performance from the hash table. The MRU promotion policy helps improve performance when the hash table is a little undersized, but it cannot make up for severely undersized hash tables. Furthermore, frequent MRU promotions result in writes that are a scalability bottleneck. For scalability, lookups should only perform reads, not writes. This is not a big deal for now, but it will become one once MTTCG matures. The appended fixes these issues by using qht as the implementation of the TB hash table. This solution is superior to other alternatives considered, namely: - master: implementation in QEMU before this patchset - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU. - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU. MRU is implemented here by adding an intermediate struct that contains the u32 hash and a pointer to the TB; this allows us, on an MRU promotion, to copy said struct (that is not at the head), and put this new copy at the head. After a grace period, the original non-head struct can be eliminated, and after another grace period, freed. - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize + no MRU for lookups; MRU for inserts. The appended solution is the following: - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize + no MRU for lookups; MRU for inserts. The plots below compare the considered solutions. The Y axis shows the boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis sweeps the number of buckets (or initial number of buckets for qht-autoresize). The plots in PNG format (and with errorbars) can be seen here: http://imgur.com/a/Awgnq Each test runs 5 times, and the entire QEMU process is pinned to a single core for repeatability of results. Host: Intel Xeon E5-2690 28 ++------------+-------------+-------------+-------------+------------++ A*** + + + master A*** + 27 ++ * xxhash ##B###++ \| A****A** xxhash-rcu $$C$$$ \| 26 C$$ A**A**** qht-fixed-nomru%%D%%%++ D%%$$ A***A***Aqht-dyn-mru AE*A 25 ++ %%$$ qht-dyn-nomru &&F&&&++ B#####% \| 24 ++ #C$$$$$ ++ \| B### $ \| \| ## C$$$$$$ \| 23 ++ # C$$$$$$ ++ \| B###### C$$$$$$ %%%D 22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C \| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E 21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B + E@@@ F&&& + E@ + F&&& + + 20 ++------------+-------------+-------------+-------------+------------++ 14 16 18 20 22 24 log2 number of buckets Host: Intel i7-4790K 14.5 ++------------+------------+-------------+------------+------------++ A + + + master A* + 14 ++ xxhash ##B###++ 13.5 ++ xxhash-rcu $$C$$$++ \| qht-fixed-nomru %%D%%% \| 13 ++ A**** qht-dyn-mru @@E@@@++ \| A*A**A** qht-dyn-nomru &&F&&& \| 12.5 C$$ A**A**A*A** A 12 ++ $$ A ++ D%%% $$ \| 11.5 ++ %% ++ B### %C$$$$$$ \| 11 ++ ## D%%%%% C$$$$$ ++ \| # % C$$$$$$ \| 10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C 10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B + F&& D%%%%%%B######B######B#####B###@@@D%%% + 9.5 ++------------+------------+-------------+------------+------------++ 14 16 18 20 22 24 log2 number of buckets Note that the original point before this patch series is X=15 for "master"; the little sensitivity to the increased number of buckets is due to the poor hashing function in master. xxhash-rcu has significant overhead due to the constant churn of allocating and deallocating intermediate structs for implementing MRU. An alternative would be do consider failed lookups as "maybe not there", and then acquire the external lock (tb_lock in this case) to really confirm that there was indeed a failed lookup. This, however, would not be enough to implement dynamic resizing--this is more complex: see "Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming" by Triplett, McKenney and Walpole. This solution was discarded due to the very coarse RCU read critical sections that we have in MTTCG; resizing requires waiting for readers after every pointer update, and resizes require many pointer updates, so this would quickly become prohibitive. qht-fixed-nomru shows that MRU promotion is advisable for undersized hash tables. However, qht-dyn-mru shows that MRU promotion is not important if the hash table is properly sized: there is virtually no difference in performance between qht-dyn-nomru and qht-dyn-mru. Before this patch, we're at X=15 on "xxhash"; after this patch, we're at X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we can achieve with optimum sizing of the hash table, while keeping the hash table scalable for readers. The improvement we get before and after this patch for booting debian jessie with arm-softmmu is: - Intel Xeon E5-2690: 10.5% less time - Intel i7-4790K: 5.2% less time We could get this same improvement _for this particular workload_ by statically increasing the size of the hash table. But this would hurt workloads that do not need a large hash table. The dynamic (upward) resizing allows us to start small and enlarge the hash table as needed. A quick note on downsizing: the table is resized back to 215 buckets on every tb_flush; this makes sense because it is not guaranteed that the table will reach the same number of TBs later on (e.g. most bootup code is thrown away after boot); it makes sense to grow the hash table as more code blocks are translated. This also avoids the complication of having to build downsizing hysteresis logic into qht. Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-11	tb hash: hash phys_pc, pc, and flags with xxhash	Emilio G. Cota	1	-5/+5
	For some workloads such as arm bootup, tb_phys_hash is performance-critical. The is due to the high frequency of accesses to the hash table, originated by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's. More info: https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html To dig further into this I modified an arm image booting debian jessie to immediately shut down after boot. Analysis revealed that quite a bit of time is unnecessarily spent in tb_phys_hash: the cause is poor hashing that results in very uneven loading of chains in the hash table's buckets; the longest observed chain had ~550 elements. The appended addresses this with two changes: 1) Use xxhash as the hash table's hash function. xxhash is a fast, high-quality hashing function. 2) Feed the hashing function with not just tb_phys, but also pc and flags. This improves performance over using just tb_phys for hashing, since that resulted in some hash buckets having many TB's, while others getting very few; with these changes, the longest observed chain on a single hash bucket is brought down from ~550 to ~40. Tests show that the other element checked for in tb_find_physical, cs_base, is always a match when tb_phys+pc+flags are a match, so hashing cs_base is wasteful. It could be that this is an ARM-only thing, though. UPDATE: On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote: > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB > consisting of only a delay slot). > It may well still turn out to be reasonable to ignore cs_base for hashing. BTW, after this change the hash table should not be called "tb_hash_phys" anymore; this is addressed later in this series. This change gives consistent bootup time improvements. I tested two host machines: - Intel Xeon E5-2690: 11.6% less time - Intel i7-4790K: 19.2% less time Increasing the number of hash buckets yields further improvements. However, using a larger, fixed number of buckets can degrade performance for other workloads that do not translate as many blocks (600K+ for debian-jessie arm bootup). This is dealt with later in this series. Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-09	cpu-exec: Rename cpu_resume_from_signal() to cpu_loop_exit_noexc()	Peter Maydell	1	-2/+2
	The function cpu_resume_from_signal() is now always called with a NULL puc argument, and is rather misnamed since it is never called from a signal handler. It is essentially forcing an exit to the top level cpu loop but without raising any exception, so rename it to cpu_loop_exit_noexc() and drop the useless unused argument. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Acked-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Riku Voipio <riku.voipio@linaro.org> Message-id: 1463494687-25947-4-git-send-email-peter.maydell@linaro.org
2016-06-09	user-exec: Push resume-from-signal code out to handle_cpu_signal()	Peter Maydell	1	-4/+8
	Since the only caller of page_unprotect() which might cause it to need to call cpu_resume_from_signal() is handle_cpu_signal() in the user-mode code, push the longjump handling out to that function. Since this is the only caller of cpu_resume_from_signal() which passes a non-NULL puc argument, split the non-NULL handling into a new cpu_exit_tb_from_sighandler() function. This allows us to merge the softmmu and usermode implementations of the cpu_resume_from_signal() function, which are now identical. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Acked-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Riku Voipio <riku.voipio@linaro.org> Message-id: 1463494687-25947-3-git-send-email-peter.maydell@linaro.org
2016-06-09	translate-all.c: Don't pass puc, locked to tb_invalidate_phys_page()	Peter Maydell	1	-11/+15
	The user-mode-only function tb_invalidate_phys_page() is only called from two places: * page_unprotect(), which passes in a non-zero pc, a puc pointer and the value 'true' for the locked argument * page_set_flags(), which passes in a zero pc, a NULL puc pointer and a 'false' locked argument If the pc is non-zero then we may call cpu_resume_from_signal(), which does a longjmp out of the calling code (and out of the signal handler); this is to cover the case of a target CPU with "precise self-modifying code" (currently only x86) executing a store instruction which modifies code in the same TB as the store itself. Rather than doing the longjump directly here, return a flag to the caller which indicates whether the current TB was modified, and move the longjump to page_unprotect. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org> Acked-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Riku Voipio <riku.voipio@linaro.org> Message-id: 1463494687-25947-2-git-send-email-peter.maydell@linaro.org
2016-05-23	memory: remove unnecessary masking of MemoryRegion ram_addr	Paolo Bonzini	1	-2/+1
	mr->ram_block->offset is already aligned to both host and target size (see qemu_ram_alloc_internal). Remove further masking as it is unnecessary. Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-19	cpu: move exec-all.h inclusion out of cpu.h	Paolo Bonzini	1	-0/+1
	exec-all.h contains TCG-specific definitions. It is not needed outside TCG-specific files such as translate.c, exec.c or *helper.c. One generic function had snuck into include/exec/exec-all.h; move it to include/qom/cpu.h. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-12	tcg: Remove needless CPUState::current_tb	Sergey Fedorov	1	-18/+2
	This field was used for telling cpu_interrupt() to unlink a chain of TBs being executed when it worked that way. Now, cpu_interrupt() don't do this anymore. So we don't need this field anymore. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Message-Id: <1462273462-14036-1-git-send-email-sergey.fedorov@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Rework tb_invalidated_flag	Sergey Fedorov	1	-4/+1
	'tb_invalidated_flag' was meant to catch two events: * some TB has been invalidated by tb_phys_invalidate(); * the whole translation buffer has been flushed by tb_flush(). Then it was checked: * in cpu_exec() to ensure that the last executed TB can be safely linked to directly call the next one; * in cpu_exec_nocache() to decide if the original TB should be provided for further possible invalidation along with the temporarily generated TB. It is always safe to patch an invalidated TB since it is not going to be used anyway. It is also safe to call tb_phys_invalidate() for an already invalidated TB. Thus, setting this flag in tb_phys_invalidate() is simply unnecessary. Moreover, it can prevent from pretty proper linking of TBs, if any arbitrary TB has been invalidated. So just don't touch it in tb_phys_invalidate(). If this flag is only used to catch whether tb_flush() has been called then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using only 'true' and 'false' to set its value. Also, instead of setting it in tb_gen_code(), just after tb_flush() has been called, do it right inside of tb_flush(). In cpu_exec(), this flag is used to track if tb_flush() has been called and have made 'next_tb' (a reference to the last executed TB) invalid for linking it to directly call the next TB. tb_flush() can be called during the CPU execution loop from tb_gen_code(), during TB execution or by another thread while 'tb_lock' is released. Catch for translation buffer flush reliably by resetting this flag once before first TB lookup and each time we find it set before trying to add a direct jump. Don't touch in in tb_find_physical(). Each vCPU has its own execution loop in multithreaded mode and thus should have its own copy of the flag to be able to reset it with its own 'next_tb' and don't affect any other vCPU execution thread. So make this flag per-vCPU and move it to CPUState. In cpu_exec_nocache(), we only need to check if tb_flush() has been called from tb_gen_code() called by cpu_exec_nocache() itself. To do this reliably, preserve the old value of the flag, reset it before calling tb_gen_code(), check afterwards, and combine the saved value back to the flag. This patch is based on the patch "tcg: move tb_invalidated_flag to CPUState" from Paolo Bonzini <pbonzini@redhat.com>. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: code_bitmap and code_write_count are not used by user-mode emulation	Paolo Bonzini	1	-3/+8
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [Sergey Fedorov: eliminate the field entirely in user-mode] Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> [rth: merged followup fixup] Message-Id: <1462982777-4513-1-git-send-email-sergey.fedorov@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Clean up tb_jmp_unlink()	Sergey Fedorov	1	-12/+9
	Unify the code of this function with tb_jmp_remove_from_list(). Making these functions similar improves their readability. Also this could be a step towards making this function thread-safe. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Extract removing of jumps to TB from tb_phys_invalidate()	Sergey Fedorov	1	-18/+26
	Move the code for removing jumps to a TB out of tb_phys_invalidate() to a separate static inline function tb_jmp_unlink(). This simplifies tb_phys_invalidate() and improves code structure. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Rename tb_jmp_remove() to tb_remove_from_jmp_list()	Sergey Fedorov	1	-3/+4
	tb_jmp_remove() was only used to remove the TB from a list of all TBs jumping to the same TB which is n-th jump destination of the given TB. Put a comment briefly describing the function behavior and rename it to better reflect its purpose. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Init TB's direct jumps before making it visible	Sergey Fedorov	1	-13/+19
	Initialize TB's direct jump list data fields and reset the jumps before tb_link_page() puts it into the physical hash table and the physical page list. So TB is completely initialized before it becomes visible. This is pure rearrangement of code to a more suitable place, though it could be a preparation for relaxing the locking scheme in future. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Rearrange tb_link_page() to avoid forward declaration	Sergey Fedorov	1	-103/+101
	Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Use uintptr_t type for jmp_list_{next\|first} fields of TB	Sergey Fedorov	1	-18/+20
	These fields do not contain pure pointers to a TranslationBlock structure. So uintptr_t is the most appropriate type for them. Also put some asserts to assure that the two least significant bits of the pointer are always zero before assigning it to jmp_list_first. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	tcg: Clean up direct block chaining data fields	Sergey Fedorov	1	-24/+27
	Briefly describe in a comment how direct block chaining is done. It should help in understanding of the following data fields. Rename some fields in TranslationBlock and TCGContext structures to better reflect their purpose (dropping excessive 'tb_' prefix in TranslationBlock but keeping it in TCGContext): tb_next_offset => jmp_reset_offset tb_jmp_offset => jmp_insn_offset tb_next => jmp_target_addr jmp_next => jmp_list_next jmp_first => jmp_list_first Avoid using a magic constant as an invalid offset which is used to indicate that there's no n-th jump generated. Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com> Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12	translate-all: Adjust 256mb testing for mips64	Richard Henderson	1	-2/+2
	Make sure we preserve the high 32-bits when masking for mips64. Signed-off-by: Richard Henderson <rth@twiddle.net>