How to bridge the multi-trillion AI storage gap?
AI storage is entering a window period of explosive growth. With technological breakthroughs such as the decoupling of the Universal Storage architecture from hardware and software-defined technologies, the new-generation distributed storage, as a more efficient data infrastructure, can now achieve sub - millisecond latency and TB - level throughput, supporting core scenarios such as training, inference, and Multi - Agent collaboration. It is gradually becoming the mainstream choice for the storage layer in the AI era.
This article provides an in - depth analysis of the technical path and future development trends of storage software in the AI era. We welcome you to exchange ideas with us on industry opportunities and investment prospects.
The Silicon Valley AI storage company Vast Data is currently in in - depth negotiations with CapitalG, a growth fund under Alphabet (Google's parent company), and its existing strategic investor, NVIDIA, for a new round of financing. Its valuation is as high as $30 billion. This company, which has been established for only 9 years, has received investments from well - known investment institutions and industrial players such as Tiger Global Management, Goldman Sachs, and DELL. In December 2023, it completed a round of financing with a valuation of $9.1 billion, and its valuation increased by 3.3 times in just one and a half years. Vast Data has won the favor of the business market and the US capital market with its architectural innovation in storage, which also makes the Chinese capital market start to think: What opportunities and challenges does the large - model bring to storage?
Why does storage present new opportunities in the AI era?
In the AI infrastructure, computing, storage, and networking are the most fundamental components. It can be said that computing power is the engine, storage capacity is the fuel, and network capacity is the blood vessel. All three are crucial for the smooth operation of AI applications. However, in the public view, the performance release of computing power is very intuitive, while the supporting role of storage is not obvious. In fact, as the global data volume continues to expand at an annual growth rate of 36%, it is expected to reach the YB level by 2030. How to store these massive amounts of data efficiently and securely has become a prerequisite for computing power to enable large - models to play their due effectiveness: "Computing power determines the lower limit of artificial intelligence, and data determines the upper limit."
Since the Transformer architecture was proposed in 2017, the focus of large - model development has been constantly shifting. In the initial stage, the training of large - models dominated, and the core goal was to improve model capabilities by increasing the number of parameters and data scale. Later, when large - models were put into practical applications, they faced bottlenecks in cost and efficiency. Technologies strongly related to inference, such as inference - specific chips and MoE, gradually became popular. The emergence of Agents has promoted the transformation of AI technology from single - task execution to complex decision - making and interaction, becoming the most imaginative segment of AI applications.
During this migration process, there have been many changes in the core requirements for storage capacity, which can be roughly summarized into five points:
1. Extreme throughput, low latency, and high concurrency based on reliability
Throughput: Traditional Internet applications only require MB/s - level throughput, but in the training phase of large - models, multiple GPU nodes need continuous read/write operations at dozens of GB/s (such as gradient synchronization). In the inference phase, bursty throughput at the level of hundreds of GB/s is required (such as KV Cache loading). In the Multi - Agent collaboration phase, even cluster - level throughput is needed, supporting an aggregated bandwidth of 500GB/s - 1TB/s (tens of thousands of QPS with each request carrying MB - level context data).
Latency: Traditional Internet applications can tolerate latency at the 10ms level, even during peak periods like the Double - Eleven shopping festival for e - commerce. However, in large - model training, AllReduce synchronization requires sub - millisecond latency. If the inference latency is > 1ms, service degradation will be triggered. Multi - Agent collaboration needs to maintain a storage response of < 1ms; otherwise, task flow between Agents will be blocked.
Concurrency: Traditional Internet applications have simple concurrency requirements, relying on horizontal scaling and caching, with loose latency and throughput requirements. Large - model training requires strongly consistent synchronization at the GB/s level, and storage bandwidth and latency directly affect training efficiency. Large - model inference requires high QPS and low latency to avoid the KV Cache loading becoming a bottleneck. Multi - Agents need real - time collaboration at the TB/s level, with extremely high requirements for concurrency complexity and consistency. For example, when multiple Agents modify the same memory segment simultaneously, distributed concurrency control is required.
2. Unified management of multi - modal data and version traceability
Data types have expanded from single - structured data (such as text) to multi - modal data (images, videos, audios, 3D point clouds, etc.). Therefore, the storage solution needs to be able to manage object storage, file systems, block storage, and KV databases (such as Redis) simultaneously, avoiding redundancy and latency caused by cross - format data copying.
In model fine - tuning and A/B testing, the storage system is required to support data snapshots and version chains to ensure that each experiment can be reproduced. For example, in the RLHF stage of GPT - 4, the version differences of tens of thousands of human feedback data need to be tracked.
To ensure the basic efficiency of Agents accessing external data, the management of metadata also needs to be more intelligent. For example, semantic tags can be used to achieve rapid retrieval of multi - modal data, replacing the traditional cumbersome file - path retrieval.
Note: Metadata is structured information that describes data attributes, such as the creation time, format, author, and storage location of data. It does not contain the actual content of the data but explains the background and characteristics of the data, similar to a "data instruction manual".
3. Using storage to replace computation
The efficiency optimization of AI inference mainly aims to solve the contradiction between resource allocation of computing power and storage capacity. At present, AI computing is brute - force computing. The core calculation of the Attention mechanism is the multiplication of the Query - Key matrix (QKT), with a computational complexity of O(n2) (n is the sequence length). Especially in multi - round conversations, the same QKT needs to be repeatedly calculated for the same context, resulting in a waste of computing power. For example, the DeepSeek 70B model generates 25TB of KV Cache every 10 minutes, but the GPU memory is only dozens of GB. After being discarded, it needs to be recalculated. The computing power that should be used for inference optimization is constrained by repeated matrix operations.
Using storage to replace computation replaces repeated calculations by storing intermediate results (mainly KV Cache). The storage dimension of KV Cache is n×d (d is the feature dimension), which is much smaller than the n×n attention matrix. The memory pressure is reduced from quadratic to linear.
Based on the cooperation between the persistent storage layer (storage capacity layer) and the computing power layer, the isolated and very limited HBM memory space is extended to an infinitely large external high - speed storage space. Replacing the repeated GPU computations with storage IO access can significantly reduce the computing power consumption during the inference process and greatly improve the inference efficiency. This is currently a global consensus in the field of AI Infra. The new - generation storage software architecture will play a huge role in this field, forming a pattern where computing power and storage capacity go hand in hand in the AI Agent inference scenario.
4. Support for persistent Agent memory
The native large - model has "amnesia", but Agents must have memory to achieve task execution coherence and personalized service capabilities. Memory data has strong fragmentation characteristics. Under the current storage solutions, memory is scattered and stored in multiple modules such as graphs, files, vectors, objects, and relations. Cross - modal retrieval requires multiple queries and result splicing, resulting in high latency. Scattered updates may cause memory conflicts, requiring additional maintenance of transaction logic and a sharp increase in complexity. Each module requires different hardware + software deployment solutions, resulting in high complexity in system deployment and maintenance, and the storage space cannot be shared globally. It is imperative to build a more general and convenient unified underlying data storage infrastructure for the parallel operation and mutual cooperation of a large number of Agents.
5. Autonomy and security
Under the geopolitical technology game, self - control has become an "important and urgent" matter. Storage has become a key defense line. It must be compatible with the domestic AI stack ecosystem and also meet the requirements of preventing leakage of KV Cache and vector libraries, desensitizing training data, and isolating inference to ensure that core data does not leave the country.
From another perspective, the current development speed of software has seriously lagged behind the progress of hardware technology, becoming a key bottleneck restricting the performance release of AI.
Looking at the development of key hardware modules:
① Storage capacity: The growth rate far exceeds Moore's Law. For example, the capacity of NVMe SSDs increases by more than 50% annually, while Moore's Law only predicts a 20% annual increase.
② Computing power leap: From CPU to heterogeneous computing with GPU/TPU, AI computing power has increased explosively.
③ Network bandwidth: RDMA (Remote Direct Memory Access) has a latency as low as tens of microseconds, more than 10 times faster than the traditional TCP/IP protocol stack.
It is not difficult to see that hardware has entered the "post - Moore era", showing an obvious "scissors gap" with traditional storage system software:
The hardware access latency of storage devices and network access latency have both been reduced to tens of microseconds, but the overhead of the traditional system software stack is still at the hundreds of microseconds level. The improvement in hardware performance is consumed by the inefficiency of the traditional storage software architecture, leading to contradictions such as data flood blockage (e.g., redundant paths for GPUs to directly access storage) and idle computing power (the time for GPUs to wait for data transfer far exceeds the computing time, and the advantages of the RDMA network cannot be fully utilized due to the software protocol stack).
Why is the traditional storage software architecture no longer sustainable?
The problems of the traditional storage architecture can be attributed to the efficiency bottleneck caused by the dependence on the OS kernel, the scalability defect of mixed storage of metadata and data, and the data islands and migration overhead caused by the separation of storage protocols.
For example, three agents collaborate to process a large - scale dataset. Agent A is responsible for data collection, Agent B for data cleaning, and Agent C for model training. Agent A receives and writes the raw data in a high - speed stream. Since the data volume is large and no modification is required, it uses the object interface. Agent B needs to read the raw data, clean and transform it, and output intermediate results. Since it needs to organize versioned data according to the directory structure, it uses the file interface. Agent C needs to efficiently and randomly read the cleaned data blocks to train the model, using the block interface to achieve the lowest latency and the highest IOPS.
1. Efficiency bottleneck caused by OS kernel dependence
System call and context - switching overhead: The three Agent processes will concurrently issue a large number of read/write system calls. Each call requires the CPU to perform a context switch between the Agent process and the OS. When the number of requests per second reaches one million, a large amount of CPU time is wasted on switching rather than processing actual data.
Data copying overhead: Data is transferred from the "hardware device" to the kernel buffer through DMA and then copied to the user - space memory of the Agent process. This redundant copying consumes a large number of CPU cycles and memory bandwidth.
The traditional storage system software completely relies on the OS kernel to allocate and schedule hardware resources such as CPU memory and to perform read/write access to external devices such as networks and hard drives. In today's era of high - speed hardware and high - speed RDMA networks, this basic paradigm of traditional system software has become a serious efficiency bottleneck.
Breaking free from the dependence and constraints of the OS kernel and autonomously and efficiently completing memory allocation and management, network access and interaction, read/write of external devices such as hard drives, and CPU and thread scheduling within the storage software is one of the core technologies of the new - generation storage system software. It will increase the IO processing efficiency of the storage system software by 10 - 30 times and reduce the latency by 90%. It can be compared to the efficiency improvement of GPUs in specific computing scenarios (such as matrix operations) compared to CPUs.
2. Scalability defect of mixed storage of metadata and data
Metadata hotspots: Agent B needs to frequently access millions of small files in the "file storage". Each access requires first searching for its metadata (inode). A large number of metadata access requests will make the disk area storing metadata a performance hotspot. Then, the actual data read/write operations are blocked, and the latency soars. Even if the underlying storage is high - performance SAN storage, its speed cannot be fully utilized.
Limitations of the global namespace: As the number of files grows explosively, the directory structure storing metadata becomes extremely large. The metadata management of traditional file systems is centralized, which is prone to becoming a read/write hotspot under high - capacity and high - concurrency access and is difficult to scale. Operations such as listing file directories or searching for files will become extremely slow.
3. Data islands and migration overhead caused by the separation of storage protocols (corresponding to the isolation of the "block interface", "file interface", and "object interface" in the figure)
Data islands and migration overhead: Agent A writes the data to the object storage. However, when Agent B needs to process it, it cannot directly and efficiently read the data in the object storage and must first migrate the data to the file storage. Similarly, for optimal performance, Agent C needs to export the data from the file storage to the block storage volume. As a result, the same data is stored three times, not only occupying additional space but also consuming complex engineering efforts for data migration. Moreover, a large amount of network overhead and latency are generated during the migration process, seriously slowing down the entire pipeline. In addition, this engineering method of reciprocating data migration between different isolated systems cannot guarantee data consistency and real - time performance, preventing AI from participating in real - time business processes.
Management complexity: Each storage interface has its independent "policy, management, and security" configurations. Administrators need to set up backup, snapshot, and access - control policies for the same data in