HomeArticle

Subverting general-purpose CPUs, the world's most power-efficient processor is officially released.

半导体行业观察2025-07-25 11:17
The company's original intention was to express disappointment with traditional CPUs.

For decades, we've been building general - purpose CPUs the wrong way - this is the bold declaration of the Efficient Computer team. To this end, they officially launched their first product, the E1 processor, today, hoping to initiate a new era of general computing efficiency.

Efficient Computer stated. This is a general - purpose processor that completely overturns the industry's long - standing reliance on the von Neumann architecture. Apart from being Efficient Computer's first independent hardware product, another notable aspect of this chip is that the company claims it to be "the world's most energy - efficient general - purpose processor."

According to the introduction, unlike traditional von Neumann processors that consume excessive energy when transferring data between memory and computing cores, the Electron E1 processor is built on Efficient's Fabric architecture - a spatial dataflow architecture for executing general - purpose code that eliminates the need for costly step - by - step calculations. Compared with traditional low - power CPUs, this approach can boost energy efficiency by up to 100 times, enabling edge intelligence applications to have a service life of up to several years in environments with limited power and maintenance.

The Company's Original Intention: Disappointment with Traditional CPUs

Efficient Computer bluntly stated that the company's engineers and developers are deeply disappointed with the limitations of traditional von Neumann processors because these processors consume excessive energy when transferring data between memory and computing cores. We are well aware that to achieve true efficiency, we need to fundamentally rethink how processors are designed.

As they said, for a long time, our processors have been constrained by the control - flow model, constantly moving data back and forth between caches, memory, and computing units - I think everyone would agree that each step in this process consumes a significant amount of energy. In fact, currently, CPUs consume excessive energy just on moving data, sometimes even more than the energy required to process the data. Traditional architectures focus on performance, power consumption, or power consumption per operation, often overlooking the overhead of data movement. This is precisely the bottleneck for power - constrained embedded systems or systems using small custom batteries or battery - powered systems.

According to them, modern computing systems face a fundamental trade - off: achieving extreme energy efficiency often comes at the cost of sacrificing general programmability. The reason for this is that in modern architectures, data movement (rather than computation) is the main bottleneck for energy, performance, and scalability. Therefore, achieving high efficiency means placing most of the computing work as close to memory access as possible.

As shown in the figure, today's systems, including CPUs, GPUs, and other programmable accelerators, address this challenge by using distributed data memories between processing units (PEs) (see the NUMA diagram). The idea is to pre - determine which program data should be mapped to which memory and how to assign tasks to PEs near each corresponding memory. This approach is widely known as non - uniform memory access (NUMA) because different memories have different (i.e., non - uniform) access times for different processors.

NUMA itself cannot balance efficiency and generality. Except for the simplest programs, determining how to co - schedule data and tasks is almost an insoluble problem for compilers and systems. Workloads with irregular computation patterns (such as sparse machine learning models) are difficult to analyze, and even with complete information, it may not be possible to significantly reduce data movement by mapping data. Another common approach is to shift this burden to programmers using domain - specific languages or low - level APIs. However, this sacrifices generality and limits usability.

As is well known, traditional processors execute sequential operations through branch prediction, and between each operation, the processor needs to refer to the memory adjacent to the processing pipeline and perform configuration changes. In the view of the CEO and co - founder of Efficient Computer: "In each cycle, you perform billions of such operations per second; this is very wasteful."

For this reason, they took a different approach with the Electron E1 processor. Based on a decade of research at Carnegie Mellon University, we built the Fabric architecture from scratch, aiming to bring significant energy efficiency improvements to general computing applications.

The E1 is built on a proprietary spatial dataflow architecture called Fabric, which eliminates the overhead associated with instruction fetching, decoding, and register file movement. According to Efficient, this design achieves unprecedented energy efficiency in a chip that retains full software programmability and general utility, capable of performing up to 1 trillion 8 - bit integer operations per second per watt (1 TOPS/W).

"If you ask a computer to perform an operation of X plus Y, then 95% to 99% of the energy consumption is spent on instruction provision, decoding, pipeline reconfiguration, and operand provision. Only 1% to 5% of the energy consumption is actually used to perform the addition operation," said Lucia, the company's founder.

The Dataflow Architecture: The Essence of the Breakthrough

The Electron E1 processor is built on this spatial dataflow architecture, which can execute general - purpose code without the need for costly step - by - step calculations. Efficient's goal is to solve this problem through static scheduling and dataflow control - not buffering, but running. It has no cache, no out - of - order design, and it is not a VLIW or DSP design. It is a general - purpose processor.

Most people, when hearing "low - power chips" or "embedded CPUs," think of an in - order ARM Cortex - M or a processor slightly above this level, with some out - of - order execution instructions in the architecture and sufficient on - chip memory or some off - chip DRAM. The model is simple: a small processor gradually fetches, decodes, sorts, schedules, executes, and then exits pipeline instructions, and moves data in and out of memory as needed.

Efficient's architecture, simply referred to as "Fabric," is based on the spatial dataflow model. Instead of passing instructions through a centralized pipeline, the E1 binds instructions to specific computing nodes called "tiles" and then allows data to flow between them. A node (such as a multiplier) processes its operands only when all of its operand registers are filled. The result is then sent to the next "tile" that needs it. It has no program counter and no global scheduler. It is said that this native dataflow execution model can significantly reduce the energy overhead wasted by traditional CPUs during data movement.

The Electron E1 is essentially a grid composed of small computing blocks, each capable of performing basic operations such as mathematics, logic, and memory access. The compiler statically schedules each tile to meet its requirements and routes the data. The Efficient compiler converts C++ or Rust code into a dataflow graph - this is the key point here. Since it can run regular C++ or Rust, Efficient claims it to be a general - purpose CPU.

Of course, this itself brings challenges. What happens if your program graph is too large for the chip? Efficient solves this problem through pipeline reconfiguration - the compiler splits the program graph into multiple chunks, and the chip dynamically loads new configurations during execution. Each chunk even contains a small recent configuration cache, so loops and repeated patterns do not force a complete reload every time.

The interconnection between tiles is also statically routed and bufferless, determined at compile time. Since there is no flow control or retry logic, if two data paths usually conflict, the compiler must resolve it at compile time. This allows Fabric to maintain extremely high energy efficiency but also shifts many responsibilities to the toolchain. Relying on a "perfect" compiler has always been a problem in the traditional computing field, so it will be very interesting to see how this solution works.

A significant milestone for the first candidate release chip of the Electron E1 is that it supports 32 - bit floating - point operations. Many low - power architectures only support integer operations and run in fixed - point math formats. Professor Brandon Lucia, the CEO, emphasized that 32 bits are crucial for the scalability of this architecture.

Importantly, this is not a software - simulated dataflow; the hardware is designed as a dataflow engine. Whether it is flexible enough to adapt to real - world embedded software or whether it will cause too many edge cases remains to be seen. Architecturally, it is far from traditional CPU designs while still claiming to be "general - purpose." It is said that this is where its power consumption advantage lies.

Physically, the Electron E1 uses a standard BGA package and integrates on - chip memory and peripheral interfaces to minimize external dependencies. It contains 4 MB of MRAM (for non - volatile code and data storage), as well as 3 MB of SRAM and 128 KB of cache. The chip supports six instances each of QSPI, UART, SPI slave interfaces, and I2C master interfaces, as well as 72 GPIO lines and a real - time clock.

In terms of performance, the E1 offers two optional operating modes: a low - voltage mode that can achieve 6 GOPS at a 25 MHz Fabric clock, and a high - voltage mode that can achieve up to 24 GOPS at a 100 MHz clock. The programmable wake - up controller and integrated buck and LDO regulators support dynamic power modes, including sleep and deep sleep.

The active scalar RISC - V core of the chip can enter a power - off state while the Fabric continues to execute.

The E1 is powered by a 1.8 V supply, with an internal logic voltage range of 0.55 V to 0.8 V and a rated industrial temperature range of - 40°C to 125°C.

The Electron E1 also supports full - stack programmability, compiling regular code into a dataflow graph and placing it throughout the Fabric architecture. During this process, the system maintains deterministic and static scheduling, and the compiled program can run persistently for up to 100 million times.

The Software Stack: Another Core Behind It

In addition to the E1, Efficient also released the first public version of its compiler toolchain, effcc, which abstracts the unique hardware and uses standard development interfaces. This compiler is built on LLVM and MLIR, can accept standard C code, and can be integrated into existing developer workflows with minimal changes. Developers only need to point to the new compiler binary to use common build tools such as Make and CMake, as well as editors such as Visual Studio Code.

This maintains compatibility with the debugging tools and workflows that engineers are already familiar with. As Lucia said: "You just use effcc, and it will run on the front end like Clang... If you use VS Code, Make, or CMake, you just need to let it know where our compiler is, and then everything will work fine."

The front end of the compiler uses Clang, and in the middle, Efficient converts the input into an intermediate representation customized for Fabric. Advanced optimization routines (including an AI - based scheduling framework, the modular optimization framework (MOF)) analyze the code structure and efficiently map it to the spatial grid. This includes automatically routing instruction outputs to downstream blocks and optimizing dataflow paths to minimize latency and power consumption.

With these tools, developers can simulate execution on a web - based Playground, which includes an interactive visualization of how instructions propagate in the Fabric. The company emphasizes a "two - minute Hello World" development experience, almost eliminating the learning curve associated with a new platform.

As can be seen, the Electron E1 uses standard tools; the compiler front end is based on Clang and supports the aforementioned C++ and Rust. They also claim to support machine learning frameworks such as PyTorch, TensorFlow, and JAX, although the degree of manual intervention required to enable these frameworks is unclear.

Previously, their toolchain, effcc, was just a sandbox compiler playground. With the release of the E1, it can now be fully downloaded, which means developers can directly integrate it into their workflows and target real chips. EFCC takes regular code and reduces it to the spatial dataflow model of Fabric, handling graph decomposition, operation mapping to tiles, configuration generation, and pipeline management. In traditional compilers, these decisions are made dynamically at runtime; here, they are resolved statically at compile time. This compile - time resolution is the source of efficiency, but it also means the compiler must be very intelligent.

Efficient promises that developers don't need to learn a new mental model to get started - just write C++ code, and the compiler will handle the mapping. My biggest question is, what happens if some code cannot be clearly mapped, or if the compiler encounters special cases? Can developers gain insights into the problem? Or, my thinking is more like that of high - performance chip programmers who have performance tools. Whether a new paradigm like the E1 can be adopted likely depends on the stability of the tool when developers start stress - testing it in real - world environments.

Some Thoughts

Efficient stated that the E1 is best suited for embedded and edge AI workloads, which are usually limited by current CPUs and narrow - band accelerators. Efficient offers accelerator - level energy efficiency and CPU - level programmability, positioning the E1 between general computing and AI inference - specific chips.

Independent accelerators isolate the dense matrix multiplication core of the machine learning pipeline, while the E1 can handle upstream signal processing, sensor fusion, and downstream analysis or control logic on the same chip.

Efficient has started providing E1 samples to early trial customers and is collaborating with partners in the industrial and aerospace verticals. The upcoming Photon P1, a high - end successor to the E1, will expand the applicability of the architecture, enabling it to be used in larger - scale edge computing scenarios and possibly even extending to the low - end data center layer. As Lucia said: "Our architecture combines scalability and efficiency, and this is the real essence."

"My vision for the company is that we start from the embedded field with the E1 and then expand all the way to the edge, the cloud, and finally to the data center," emphasized Lucia.

However, Todd Austin, a professor of computer science and engineering at the University of Michigan, said that chips like the E1 are a good example of efficient architectures because they minimize the part of the silicon used for non - pure computation, such as fetching instructions, temporarily storing data, and checking whether network routes are in use.

Rakesh Kumar, a computer architect at the University of Illinois at Urbana - Champaign, also said that Lucia's team "is conducting a lot of ingenious research to provide extremely low power consumption for general computing." He predicted that the challenge for this startup will be economic viability. "Ultra - low - power companies have always struggled because the market for low - power, inexpensive microcontrollers is highly competitive. The key challenge is to discover new features" and get customers to pay for them.

Let's return to performance, especially energy efficiency. Efficient claims that the Electron E1 is 10 to 100 times more energy - efficient than market - leading embedded ARM CPUs (specifically referring to Cortex M33, M85, and A5 - level cores). Efficient's core metric is "operations per joule," which makes sense if your design goal is battery life. The question is: how much useful work can be done per unit of energy?

However, the CEO also emphasized that "TOPS per watt" is their key metric. Frankly, TOPS per watt makes me a bit worried. TOPS per watt is usually an AI accelerator metric, not a general - purpose CPU metric. It also depends on precision. Although the E1 supports FP32, comparing the TOPS of a general - purpose CPU starts to fall into the performance marketing trap that we usually see in machine learning chips rather than embedded chips. In addition, traditional CPUs may have large vector engines, which mask the true serial performance. I