In my previous article, I discussed types of simulators and levels of simulation. This article covers several topics, specifically full-system simulation, cycle-accurate microarchitectural simulation, and approaches to collecting traces and their further use.
Any device (for example, a network card) can be modeled as a standalone device to investigate its workflow and develop firmware or drivers. However, using this device separately from the rest of the platform infrastructure does not make much sense because of the additional effort required to prepare the input data and convert the output. You just can’t simply run the corresponding device driver, since it requires a CPU, memory, buses, and many other devices on the hardware side as well as an operating system, a network stack, and applications on the software side. A separate packet generator and server for receiving replies may also be required.
A full-system simulator provides an environment for running a full unmodified software stack including everything from BIOS and bootloader to OS and its subsystems (for example, network stack, drivers, user level applications). A full-system simulator includes such computer devices as a CPU, memory, a disk drive, input-output devices, a display, and a network device.
As an example, a block diagram of the Intel x58 chipset is shown below. To develop a full-platform computer simulator of this chipset, most of the listed devices should be implemented, including those that are inside of the IOH (Input/Output Hub) and ICH (Input/Output Controller Hub) and not shown in detail on the block diagram.
Full-system simulators are most often implemented at the level of CPU instructions (ISA – Instruction Set Architecture, see my previous article). This allows users to relatively quickly and inexpensively create a simulator instead of wasting time on creating overly detailed models of all devices if implementing it at the microarchitectural or logical gate levels.
The ISA level is also good because it remains more or less constant unlike, for example, the API/ABI level, which changes more often. In addition, the implementation at the instruction level allows running unmodified binary software. In other words, you can perform a dump right from hard disk, specify it as the image for the model in a full-system simulator, and voila—OS and other programs are booted and run in the simulator without any additional effort.
As I mentioned earlier, simulation of the entire system, with all devices, is a rather time-consuming process. Therefore, implementation at the detailed level will be very slow. On the contrary, the instruction level allows the OS and programs to run at speeds sufficient for a comfortable and smooth user experience.
Simulators’ performance is measured in IPS (instructions per second), more precisely MIPS (millions of IPS), that is, the number of processor instructions executed by the simulator per second. At the same time, the simulation speed also depends on the performance of the host where the simulation runs. Therefore, the “slowdown” term is often used to refer to simulator speed compared to real host performance.
The full-system simulators available on the market today are extremely fast, so fast that users may not even notice that they are working in a simulator. Virtualization support available in modern CPUs as well as the so-called binary translation algorithms dramatically increase simulation performance. As a result, the simulation is only 5–10 times slower than the real system, and it often works at the same speed. However, many factors impact simulation speed. For example, if we want to simulate a system with several dozens of CPUs, then the overall speed will immediately drop by several dozen times. Situations like this are resolved in Simics, the latest versions of which support multiprocessor host hardware and effectively parallelize simulated cores to the cores of real processors.
Microarchitectural simulation is usually about 1000–10000 times slower than the real system. Implementations at the level of logical elements are even slower. That’s why FPGA is often used at this level as an emulator, which can significantly increase performance.
Despite the low speed of execution, microarchitectural simulators are quite common. In fact, it is necessary to model CPU internal blocks in order to accurately simulate the execution time of each instruction.
You may wonder why one can’t simply put the execution time for each instruction right into the model program code (or “hardcore” as programmers say). The reason is that such a simulator will not work accurately, since the execution time of the same instruction may differ from one execution to another.
The simplest example is the memory read instruction. If a requested memory cell (its value) is available in the cache, the execution time will be minimal. If the cache does not have it (a so-called cache miss event), the execution time of the instruction increases dramatically. Thus, a cache model is required for accurate simulation. However, a cache model is not the only factor that affects the speed. If the data are not in the cache, the processor doesn’t stop, and it continues executing the following instructions, choosing the ones that are not dependent on the result of this reading from memory. This is so-called out-of-order execution (OOO), and it is necessary to minimize processor idle time. Correctly calculating instruction execution times requires considering all these factors and detailed modeling of all these CPU blocks.
On a related note, during the described OOO execution, a conditional jump may occur. If the result of the condition is currently unknown, the CPU does not stop the execution but makes an “assumption” and continues to preventively execute instructions starting from the jump point as if the jump already happened. Such a block, called a branch predictor, must also be implemented in the microarchitectural simulator.
The diagram below shows the main CPU blocks and demonstrates the complexity of microarchitectural implementation.
In the real CPU, all these blocks are synchronized with special clock signals. The same approach is used in the model. That’s why such microarchitectural simulators are called cycle accurate. Their main purpose is to accurately predict the performance of future CPUs that are currently under design and construction and correctly measure the execution time of a program running on this CPU, for example, some CPU benchmark. If, for a new CPU, benchmark values are lower than necessary, then CPU algorithms and blocks need to be redesigned.
Cycle-accurate simulation is very slow, and it is used only for investigating specific parts of program execution where it is necessary to measure the real execution speed and evaluate the future performance of a device for which a prototype is being modeled. A functional simulator is used to simulate the rest of the program execution. How does this joint functional and cycle-accurate simulation work?
First, OS and all the prerequisites to run the target program are booted on the functional simulator. Neither the OS itself nor the initial stages of running the program and its configuration make much sense for evaluating performance. However, technically, we can’t skip these steps and jump right to the interesting point, so the preliminary steps must be run on the functional simulator. After that, there are two options. The first option is to replace the model with the cycle-accurate one and continue execution. A simulation model when executable code is used (i.e. unmodified compiled binary program files) is called execution-driven simulation. This is the most common simulation, and it is used in functional and cycle-accurate simulators. The second is trace-driven simulation.
This type of simulation consists of two steps. Using a functional simulator or real system (other methods are also possible: for example, using a compiler), the program action log is collected and written to a file. This log is called a trace. Depending on what is being investigated, the trace may include executable instructions, memory addresses, port numbers, or interrupt information.
The next step is the so-called trace playback, when the cycle-accurate simulator reads the trace and executes all the operations and instructions from there one by one. As a result, it is possible to calculate the execution time and get other interesting information like, for example, percentage of cache hit.
It is worth mentioning that trace execution is deterministic, i.e. the same sequence of actions can be reproduced as many times as needed. By changing the parameters of the model (the size of the cache, buffers, and queues) and using various internal algorithms or fine-tuning them, one can investigate how a particular parameter affects the overall system performance and which parameter set gives the best results. All of this can be done with a virtual model prototype of the device before creating a hardware prototype.
This approach is rather difficult because it requires application pre-running to collect the trace, and the trace file is huge. However, it is still very common, particularly because it is enough to simulate only part of the device or platform, while execution-driven simulation usually requires a full model.
As a summary, in this article, we discussed features of full-system simulation, took a look at simulation performance at different levels, and delved deeper into cycle-accurate simulation and traces. In the next article, I will describe the most common scenarios for simulator usage, both for personal use as well as for commercial development in large companies.