A chip startup challenges Nvidia and Intel.
When exiting stealth mode in October 2024, Israeli chip startup NextSilicon stated that its upcoming Maverick - 2 is the world's first Intelligent Compute Accelerator (ICA). Designed to meet the needs of High - Performance Computing Artificial Intelligence (HPC - AI) applications, it is a "novel and original computing architecture" that can improve performance while reducing power consumption and costs.
Just now, after eight years, $303 million in seed funding, and three rounds of venture capital, NextSilicon has finally launched multiple versions of its 64 - bit dataflow engine. Meanwhile, the company will also introduce an independently developed RISC - V processor called Arbel, which may be used in conjunction with Maverick - 2 to create products similar to NVIDIA's "Superchip".
From left to right: NextSilicon Arbel RISC - V CPU, Maverick - 1 DFP, Maverick - 2 DFP, and the dual - chip Maverick - 2 for the OAM socket.
NextSilicon was founded in 2017, well before the GenAI boom. However, people had already realized at that time that the architectures of HPC and AI computing engines were about to diverge - and it was not favorable for the HPC simulation and modeling fields focused on 64 - bit and 32 - bit floating - point calculations. Even without an initial plan to directly enter the AI market like companies such as Cerebras Systems, Graphcore, Groq, Habana Labs, Nervana Systems, and SambaNova Systems, NextSilicon had raised $202.6 million in three rounds of financing. The Series C financing was completed in June 2021, with an amount of $120 million.
At that time, NextSilicon's valuation was approximately $1.5 billion. This funding and the completion of the prototype design work meant that the US Department of Energy could keep an eye on NextSilicon's developments. Back then, Sandia National Laboratories collaborated with NextSilicon to design and test the Maverick - 1 dataflow engine. Currently, Sandia is building a new - architecture supercomputer called "Spectra" as part of its Vanguard - II program. It is speculated that this supercomputer will be built using the Maverick - 2 dataflow engine released today.
A Brand - New Path
With NVIDIA firmly in control of the market, why is it necessary to build a new chip? "This is mainly because there is no dedicated accelerator for high - performance computing," said Elad Raz, the founder and CEO of NextSilicon, in an interview with the media last year. He pointed out that there are hundreds of companies accelerating for artificial intelligence and machine learning, and most large suppliers are shifting towards AI and ML. You can see what large supercomputers mean to them - they just build a new GPU cluster, but the cost is twice as much, the power consumption is twice as high, and they get the same FP64 floating - point computing power. NextSilicon is a company that prioritizes high - performance computing (HPC)."
They intend to embark on a brand - new path.
As is well - known, although GPUs and CPUs have contributed to significant scientific and social breakthroughs in the fields of high - performance computing (HPC) and artificial intelligence (AI), they are facing a future of diminishing returns. Instead of following the old path and investing huge amounts of money to build ever - larger AI factories equipped with increasingly powerful GPUs (and more advanced power and cooling systems), NextSilicon's founders decided to try a different approach.
Elad Raz pointed out that although the 80 - year - old von Neumann architecture provides us with a general - purpose programmable computing foundation, it also brings huge overhead. He said that 98% of the chip is used for control overhead tasks such as branch prediction, out - of - order logic, and instruction processing, while only 2% of the chip is used for performing the actual calculations at the core of the application.
So, Raz and his team conceived a new architecture called the "Intelligent Computing Architecture" (ICA). This architecture enables the chip to reconfigure itself to adapt to changing workloads, thereby minimizing overhead and maximizing computing power for handling the mathematical operations behind high - demand AI and HPC applications. This is the basis of NextSilicon's patent "Runtime Optimization of Reconfigurable Hardware" and the guiding principle of the non - von Neumann dataflow architecture used in its Maverick - 2 processor.
"NextSilicon's mission is to use software to accelerate your applications," Raz explained. "At its core is a complex software algorithm that can understand the important parts of the code and accelerate them. In contrast, most CPUs and GPUs are some form of processor core groups. They receive instructions and try to build complex pipelines and vector instruction sets and use out - of - order execution to reduce latency. We think this is the wrong approach. A better way is to apply the Pareto principle and see which 20% of the code takes up 80% of the running time. Why don't we apply the 80/20 rule to computing and memory? Why can't we automatically identify important computing cores and try to focus only on them?"
Raz then described the secret: "The application starts running on the host, and then we automatically identify the computationally intensive parts of the code. We retain the intermediate representation of the computational graph. We don't convert the computational graph into instructions. You need to think of it as an instant compiler for hardware. We retain the computational graph of the program and place it on the dataflow hardware. We obtain telemetry data from the hardware and do it recursively, so we are always optimizing computing and memory while the program is running."
"The advanced software analyzer is like a precise positioning system that continuously monitors your application. It precisely locates the critical code segments that consume performance and then reconfigures the hardware itself with nanosecond - level granularity to build a custom data pipeline optimized for that specific code. This asymmetric execution model can precisely direct excellent efficiency to where it can be most effective while keeping most of your code running normally," Raz summarized.
Raz also pointed out that NVIDIA's CUDA ecosystem is tying everyone to its GPUs, causing them to lose initiative and bargaining power. For this reason, NextSilicon has formulated not an iterative vision but a revolutionary one. The company will not stick to the rules but will establish a brand - new game rule, where the computing infrastructure:
1. Runs everything without compromise: Your existing CPU code, complex GPU kernels, demanding HPC tasks, and cutting - edge AI/ML models - run them without modifying the code.
2. Provides ultimate speed: Experience up to 10 times the acceleration with only a quarter of the power consumption. How? By dynamically optimizing the chip in real - time to optimize the hottest and most resource - intensive code paths of the application.
3. Eliminates vendor lock - in: Say goodbye to proprietary domain - specific languages (DSL). Say goodbye to cumbersome porting processes. Say goodbye to the nightmare of framework maintenance. Your code, your language, and accelerated development.
4. Keeps your innovation relevant: ICA can continuously adjust as the workload evolves. You will never encounter a "rewrite bottleneck".
In summary, NextSilicon's dataflow architecture is built on a graph structure. Instead of processing instructions one by one like the von Neumann architecture, the dataflow processor consists of a series of computational units (called ALUs) interconnected in a graph structure. Each ALU processes a specific type of function, such as multiplication or logical operations. When the input data arrives, the computation is automatically triggered, and the result flows to the next unit in the graph. Compared with serial data processing, this new method has significant advantages because the chip no longer needs to handle data extraction, decoding, or scheduling, which are overhead tasks that consume computing cycles.
One year after previewing Maverick - 2, NextSilicon has finally revealed the detailed specifications of this revolutionary chip.
A Distinctive Chip
As shown in the figure below, the Maverick - 2 chip has four computing regions, and 32 RISC - V E cores are located on the outer edges of the left and right sides of the chip. Statistically, the grid of computing blocks consists of seven columns, with eight computing blocks in each column, and there are a total of 224 computing blocks on the chip. Each computing block has hundreds of ALUs, so it is easy to have tens of thousands to nearly a hundred thousand ALUs. For the Maverick - 2 chip, which is manufactured using TSMC's 5 - nanometer process and has 54 billion transistors, such data may seem unreasonable.
However, if we make a 14 x 14 grid as shown in NextSilicon's diagram, then each computing block has 196 ALUs; we don't know how many floating - point units there are in a computing block. It makes sense for each ALU to have an FPU.
For comparison, NVIDIA's "Ampere" A100 GPU is manufactured using TSMC's 7 - nanometer process, has 54.2 billion transistors, and 6912 FP32 CUDA cores; the "Hopper" H100 and H200 GPUs are manufactured using the 4 - nanometer process, have 80 billion transistors, and 18432 FP32 cores. The Blackwell B200 socket has two chip sets, each containing 104 billion transistors, but each chip set only contains 16896 CUDA cores and is manufactured using the 4 - nanometer process. We speculate that ALUs are smaller than CUDA cores, and the number of ALUs on the Maverick - 2 chip is greater than the number of CUDA cores on NVIDIA GPUs.
Ultimately, the number of ALUs is not as important as the number of threads that a group of mill cores can support. Ilan Tayari, co - founder and vice - president of architecture at NextSilicon and former software director at Mellanox (now NVIDIA's network division), said that a typical CPU has two threads, a GPU has 32 to 64 threads, but a mill core can support hundreds of threads simultaneously. Of course, the size and shape of mill cores will vary, but each computing block may have dozens of mill cores, and each Maverick - 2 has 224 computing blocks, so it can easily support thousands of threads, all running at a frequency of 1.5 GHz - approximately equivalent to the speed of a slow CPU or an ordinary GPU - and all threads are connected to HBM3E video memory for fast bandwidth.
As shown on the right side of the figure above, this main logic unit is connected to a memory bus, which has a reservation station for temporarily storing data before the ALU calls it. (NextSilicon has patented this combination of reservation station, scheduler, and dataflow computing block.) Like a regular CPU, the Maverick ICA also uses a memory management unit and a translation look - aside buffer, but these units are used very infrequently and only when the ALU calls specific data. It does not perform speculation or prediction but only data extraction.
Tayari said proudly: "NextSilicon's dataflow architecture enables us to significantly reduce the overhead compared with traditional CPUs and GPUs. We have adjusted the allocation ratio of the silicon wafer. We use most of the resources for actual computing rather than control overhead. Our unique approach eliminates instruction processing overhead. We minimize unnecessary data movement, thereby keeping the computing units fully utilized. We are not trying to hide latency but tolerate and minimize it through design."
When an application is compiled for the dataflow engine, it is actually mapped onto the dataflow engine to form something called a mill core (which looks like a graph). It looks like the intermediate representation graph of the program before compilation and is placed on the ALUs. Elad Raz, co - founder and CEO of NextSilicon, said that multiple mill cores can be placed on the same computing block like Tetris, and mill cores can be loaded and removed within a few nanoseconds according to the needs of the workload.
It is reported that Maverick - 2 is available in single - chip and dual - chip configurations. The single - chip Maverick - 2 has 32 RISC - V cores, is manufactured using TSMC's 5nm process, and has a clock speed of 1.5 GHz. The card supports PCIe Gen5x16, is equipped with 96GB of HBM3E memory, and has a memory bandwidth of up to 3.2TB per second. It has a 128MB level - 1 cache, is equipped with a 100GbE network card, has a thermal design power (TDP) of 400W, and uses air - cooling. The dual - chip Maverick - 2 effectively doubles all these features, but it needs to be connected to the OAM (OCP Accelerator Module) bus, is equipped with two 100GbE network cards, supports air - cooling or liquid - cooling, and has a TDP of 750W.
NextSilicon also shared some internal benchmark test data of Maverick - 2. In terms of billions of updates per second (GUPS), Maverick - 2 can provide 32.6 GUPS with a power consumption of 460 watts, which is said to be 22 times faster than a CPU and nearly 6 times faster than a GPU. In the HPCG (High - Performance Conjugate Gradient) category, Maverick - 2 achieves a computing power of 600 GFLOPS with a power consumption of 750 watts, which is said to be comparable to leading GPUs but with only half of the power consumption.
Eyal Nagar, vice - president of R & D at NextSilicon, said: "What we are discussing in detail today is not just a chip but a foundation, a new way of thinking about computing. It opens up a whole new world full of possibilities and optim