From the simple embedded processor in your washing machine to powerful processors in data center servers, most computing today takes place on general-purpose programmable processors or CPUs. CPUs are attractive because they are easy to program and because large code bases exist for them. The programmability of CPUs stems from their execution of sequences of simple instructions, such as ADD or BRANCH; however, the energy required to fetch and interpret an instruction is 10x to 4000x more than that required to perform a simple operation such as ADD. This high overhead was acceptable when processor performance and efficiency were scaling according to Moore’s Law.32 One could simply wait and an existing application would run faster and more efficiently. Our economy has become dependent on these increases in computing performance and efficiency to enable new features and new applications. Today, Moore’s Law has largely ended,12 and we must look to alternative architectures with lower overhead, such as domain-specific accelerators, to continue scaling of performance and efficiency. There are several ways to realize domain-specific accelerators as discussed in the sidebar on accelerator options.

A domain-specific accelerator is a hardware computing engine that is specialized for a particular domain of applications. Accelerators have been designed for graphics,26 deep learning,16 simulation,2 bioinformatics,49 image processing,38 and many other tasks. Accelerators can offer orders of magnitude improvements in performance/cost and performance/W compared to general-purpose computers. For example, our bioinformatics accelerator, Darwin,49 is up to 15,000x faster than a CPU at reference-based, long-read assembly. The performance and efficiency of accelerators is due to a combination of specialized operations, parallelism, efficient memory systems, and reduction of overhead. Domain-specific accelerators7 are becoming more pervasive and more visible, because they are one of the few remaining ways to continue to improve performance and efficiency now that Moore’s Law has ended.22

Most applications require modifications to achieve high speed up on domain-specific accelerators. These applications are highly tuned to balance the performance of conventional processors with their memory systems. When specialization reduces the cost of processing to near zero, they become memory limited. The application must be reworked, codesigning the application with the accelerator, to reduce memory bandwidth and memory footprint. Even after rework, many domain-specific accelerators remain memory dominated.

A well-designed accelerator covers the broadest possible space of applications—accelerating a domain rather than a sing le application. Adding domain-specific instructions to a programmable processor provides the efficiency of the specialized instruction while retaining flexibility. Complex instructions give better efficiency because they amortize the high overhead of programmability. Building a parallel computer from domain-specific processing elements can also accelerate a large domain of applications with only a small loss of efficiency.

[Read more…]