New system designs that become market-leading products are the result of innovation that’s not only evolutionary and revolutionary, but also elegant, easy-to-use, and high-quality system design. Market surveys, statistics, and customer panels have a track record of pin-pointing evolutionary design concepts; however, revolutionary concepts are more elusive, as evolutionary customers tend to think in a linear fashion based on what exists. Revolutionary concepts often come from entrepreneurs who can see further ahead or a small group of customers who can envision a better way forward.
Based on practical experience working hundreds of design worldwide, I’ve come to the conclusion that both revolutionary and evolutionary products need a strong systems engineering effort. Unlike product development and manufacturing that have a well-defined and rigid design workflow, systems design of electronics and embedded real-time software is still in the infancy.
A number of unanswered questions have created a multitude of methodologies and tools to go with the methodologies. Should system designers use a top-down or bottom-up design style? Is a centralized or distributed approach to processing the best method? Is a symmetrical or asymmetrical topology warranted? Is power or speed the driving criteria? The answer to these questions, and more, can lead to a conceptual block diagram that starts the design process, leading to a design specification.
Many computer scientists contend that memory bandwidth is one of the main issues limiting today’s processor performance, especially with the evolution of multiple-core processor chips, and multiple execution-unit CPUs. Processor cores and instruction pipelines are often stalled waiting for instruction or data-cache accesses. Programmers contend that minimizing program variables will reduce memory accesses and increase performance, while chip designers keep improving memory bandwidth by adding more memory channels to processor cores using I1, D1, L2, L3, SDRAM, and disk memory structures. In many ways, this is linear thinking, based on the original Von Neumann computer architecture.
One might consider running a single thread program code on both processor cores, and utilize registers on each core. This reduces the number of variables to read/write from cache down to 16 variables, or a 66.6 percent reduction in cache accesses, which consume more power, and takes more cycles. Each core would need access to the other processor core set of registers, for example. In addition, many programs have tight loops to process application critical information, and if a single-thread program is run on both cores, then could each core process even/odd flows of this application critical loop simultaneously? And sequential single cycle instructions outside of loops could be executed on separate cores simultaneously? While there might be many dual-core related issues with this approach, what might be the theoretical performance/power improvement of this approach?
One solution is a system-level dual vs. single-core model that assumes 80 percent integer instructions and 20 percent floating point instructions per 1,010 instructions, including ten loops with 1,000 instructions each. To simplify the analysis, assume that there were no prior instruction dependencies; however, this could be added with an additional day of effort.
This model was used to determine the effectiveness of having the compiler issue instructions to a dual-core configuration and utilize the extra registers on each core in terms of performance and power consumption. First, two blocks were added to generate executable binaries in the form of mnemonic instruction arrays, according to the order of execution. Next, two standard library blocks with four-stage pipelines were added, including the common Instruction_Set block, which sets the cycles per instruction. The Power_Manager was added with estimated power consumption in milliwatts, based on the standby, active, wait, and idle power states.
The dual- versus single-core model provides answers to many questions including, what might be the theoretical performance/power improvement of this approach? Looking at the results, the performance is better than expected, namely the dual-core configuration requires 6,370 cycles to complete a thread, whereas the single-core configuration requires 17,160 cycles to complete the same thread. The dual-core configuration completed the thread 63 percent faster, whereas common sense would suggest it might complete 50 percent faster. In terms of power consumption, it’s about the same for both configurations. Thus, system-level modeling was able to generate results that show a dual-core, instruction-synchronized execution for a single thread was 63 percent faster than a single core at the same power level.