HomeArticle

This startup has created a game-changing model that rivals GPT-4.

晓曦2024-10-11 19:29
The competition in the large model track has entered the deep water zone. Only the true value creators can reach the final stage. The rising star: Borderless Ark, has launched a blockbuster model that rivals GPT-4, and is deeply engaged in the AI companion field vertically.

"Hello, I'm Archie."

In September of this year, the "Future Living Room" exhibition hall at the Bund Conference was once crowded to the point of being impassable. In front of an AI companion robot called "Archie", children repeatedly rubbed the edges of the exhibition stand, constantly shouting Archie's name. The reason why children are so reluctant to leave is that the interactive experience with Archie is very smooth.

Although it is an AI robot, Archie has a "high emotional intelligence" and can "see" users. It has a quick response speed, and interacting with it is as simple and smooth as communicating with a real person, comprehensively innovating the previous domestic AI application products in terms of experience. Behind this is that it is equipped with the ArkModel 2.0, a boundless ark large model with audio and video multimodal capabilities.

Before the National Day this year, GPT-4o was officially launched, and its advanced voice function has been expected by the outside world for too long. However, currently, it is still only available to Plus and Team users, and free users cannot experience it. The GPT-4o Realtime API was launched during the National Day, but the limitations are also very obvious: no video conversation capability, high cost (7 RMB/min), inability to customize the sound quality, and more voice hallucinations.

From the perspective of experience, the boundless ark large model can already indiscriminately achieve the extremely low-latency AI audio and video interactive capability of GPT-4o. In addition to being able to see users and quickly reply to users with emotions, we also see some capabilities that GPT-4o does not yet have from the boundless ark large model, such as: the large model can drive 3D virtual images and also drive the actions of hardware robots, with more innovations in interaction.

What is the background of its R & D team? What other surprises does the boundless ark large model have?

01 Starting a business for only one year, making a big splash with the first move

36Kr learned that the R & D team behind the boundless ark large model is a fledgling company that has been in business for only one year - Boundless Ark Intelligent Technology Co., Ltd. (hereinafter referred to as "Boundless Ark").

The founder and CEO, Dr. Zeng Xiaodong, is a senior expert in the field of NLP natural language processing. He has more than 15 years of algorithm research and application experience in this field and serves as a judge and regional chairperson for multiple Class A machine learning, natural language processing, and artificial intelligence conferences/journals. He was also the core algorithm scientist of Alibaba's first-generation machine translation system and the co-founder of Ant Technology Laboratory. It is understood that as early as 2017 when Dr. Zeng Xiaodong worked at Ant Group, he was elected to the MIT TR35 selected by the "MIT Technology Review", that is, "35 Innovators Under 35". It is worth mentioning that Yang Zhilin, the founder and CEO of Dark Side of the Moon, was also selected to this list this year.

The founding team members of Boundless Ark are all from the first echelon of AI businesses in well-known companies at home and abroad. 80% of the technical team are NLP natural language processing doctoral students, and they have many years of rich experience in the fields of NLP natural language processing, MT machine translation, IOT Internet of Things hardware, etc. The product and design director is a senior expert in the Internet experience strategy for many years and has won many international top awards such as the Red Dot Award, IF Award, and Global Golden Trend Award.

Among many AI start-up companies, although Boundless Ark has only been in business for more than one year, it has already demonstrated its strength in many aspects and has been recognized by many top competitions and lists.

During this year's WAIC, among more than 200 global top AI companies, Boundless Ark was selected into the finals of the Global Innovation Competition and finally won the excellent result of the fifth place in the world. Subsequently, Boundless Ark was also selected into the "2024 Hurun Future Star Potential Enterprise List" top 200.

Then, what kind of product and technical strength is it that can win such market recognition and attention?

As everyone has seen in some public events, the product application effect of the boundless ark large model is already very amazing.

With the update of the boundless ark large model to version 2.0, it also has more powerful capabilities - achieving comprehensive capabilities such as extremely low latency + audio and video multimodality + emotional expression + multilingual + driving software and hardware.

Just as demonstrated by the desktop robot Archie, it can see users in real time, interpret the medication method for the elderly who cannot clearly read the drug instructions, and accompany children in the oral language development period to chat freely.

From multiple aspects, the boundless ark large model is making AI agents more like real people.

02 The Boundless Ark Large Model, Making AI Interaction More Like a Real Person

GPT-4o has triggered the upsurge of end-to-end real-time multimodality, and major domestic and foreign large model manufacturers are following up.

But at this stage, major manufacturers are still unable to truly achieve a comprehensive innovation in interaction. Technical problems in the industry, such as extremely low-latency responses, the ability to interrupt conversations at any time, video interaction that can "see" users, and emotional expression, have not yet been solved. This also means that the multimodal large models similar to GPT-4o at this stage are still in a semi-finished state, and temporarily cannot provide API or SDK docking services.

An industry insider told 36Kr, "Major manufacturers are more committed to picking the low-hanging fruits in the general model capabilities, such as ASR speech recognition, LLM language model, TTS speech synthesis, etc. But if start-up companies want to have a place, they must have independent research and development capabilities and achieve technological breakthroughs in the general model in vertical fields and vertical scenarios."

If major manufacturers are doing to make the barrel of the large model without obvious shortcomings, then what Boundless Ark is doing is to become a long board, a "brick" that is needed by others.

After experiencing the products equipped with the Boundless Ark Large Model, we found that its greatest advantage is that it can truly achieve various capabilities such as audio and video multimodal interaction, ultra-low latency feedback, emotional and personalized expression. This also makes its hands-on experience excellent, with no cost and no obstacles. As long as the user can communicate, they can have a smooth conversation, as if facing a real person.

In order to more clearly reflect the advantages of the Boundless Ark Large Model in terms of capabilities, we have made a chart:

The Boundless Ark Large Model (ArkModel 2.0) is a multimodal end-to-end model that can simultaneously process data of text, audio, and image, and achieve the conversion of cross-modal tasks. Specifically, the model receives different forms of input. For example, audio is encoded through the Audio Encoder, and images are encoded through the Image Encoder. These encoded information is uniformly processed in ArkModel, and the model generates output through the prediction of the next token, so it can stream output text or audio in real time.

A significant feature of the model is its end-to-end optimization design, emphasizing the full-process learning directly from input to output. Synthetic data is the key in the optimization process, mainly used to generate large-scale training data, including data enhancement of various types such as generating text and speech from pictures or speech, and generating text from speech. This approach effectively improves the generalization ability and task adaptability of the model.

The Boundless Ark Large Model has surpassed well-known industry models such as GPT-4o in multiple multimodal evaluations:

As shown in the chart, the Boundless Ark Large Model has the following five significant advantages: (The following videos are all real shots without any post-editing)

• Advantage 1:

Realize an ultra-low latency feedback of 300 milliseconds. This achievement is not only for pure voice, but also in the case of audio and video interaction. In the current horizontal comparison in the domestic market, there is almost no competitor;

Advantage 2:

Can achieve audio and video multimodal interaction, can "see" users, can be interrupted at any time, and has reasoning ability;

Advantage 3:

Has a rich emotional system, the interaction is natural, and the sense of AI is removed, which is very suitable for the companion scene;

\n

With the update of the boundless ark large model to version 2.0, it also has more powerful capabilities - achieving comprehensive capabilities such as extremely low latency + audio and video multimodality + emotional expression + multilingual + driving software and hardware.

\n

Just as demonstrated by the desktop robot Archie, it can see users in real time, interpret the medication method for the elderly who cannot clearly read the drug instructions, and accompany children in the oral language development period to chat freely.

\n

From multiple aspects, the boundless ark large model is making AI agents more like real people.

\n

02 The Boundless Ark Large Model, Making AI Interaction More Like a Real Person

\n

GPT-4o has triggered the upsurge of end-to-end real-time multimodality, and major domestic and foreign large model manufacturers are following up.

\n

But at this stage, major manufacturers are still unable to truly achieve a comprehensive innovation in interaction. Technical problems in the industry, such as extremely low-latency responses, the ability to interrupt conversations at any time, video interaction that can \"see\" users, and emotional expression, have not yet been solved. This also means that the multimodal large models similar to GPT-4o at this stage are still in a semi-finished state, and temporarily cannot provide API or SDK docking services.

\n

An industry insider told 36Kr, \"Major manufacturers are more committed to picking the low-hanging fruits in the general model capabilities, such as ASR speech recognition, LLM language model, TTS speech synthesis, etc. But if start-up companies want to have a place, they must have independent research and development capabilities and achieve technological breakthroughs in the general model in vertical fields and vertical scenarios.\"

\n

If major manufacturers are doing to make the barrel of the large model without obvious shortcomings, then what Boundless Ark is doing is to become a long board, a \"brick\" that is needed by others.

\n

After experiencing the products equipped with the Boundless Ark Large Model, we found that its greatest advantage is that it can truly achieve various capabilities such as audio and video multimodal interaction, ultra-low latency feedback, emotional and personalized expression. This also makes its hands-on experience excellent, with no cost and no obstacles. As long as the user can communicate, they can have a smooth conversation, as if facing a real person.

\n

In order to more clearly reflect the advantages of the Boundless Ark Large Model in terms of capabilities, we have made a chart:

\n

\n

The Boundless Ark Large Model (ArkModel 2.0) is a multimodal end-to-end model that can simultaneously process data of text, audio, and image, and achieve the conversion of cross-modal tasks. Specifically, the model receives different forms of input. For example, audio is encoded through the Audio Encoder, and images are encoded through the Image Encoder. These encoded information is uniformly processed in ArkModel, and the model generates output through the prediction of the next token, so it can stream output text or audio in real time.

\n

\n

A significant feature of the model is its end-to-end optimization design, emphasizing the full-process learning directly from input to output. Synthetic data is the key in the optimization process, mainly used to generate large-scale training data, including data enhancement of various types such as generating text and speech from pictures or speech, and generating text from speech. This approach effectively improves the generalization ability and task adaptability of the model.

\n

The Boundless Ark Large Model has surpassed well-known industry models such as GPT-4o in multiple multimodal evaluations:

\n

\n

\n

\n

\n

As shown in the chart, the Boundless Ark Large Model has the following five significant advantages: (The following videos are all real shots without any post-editing)

\n

• Advantage 1:

\n

Realize an ultra-low latency feedback of 300 milliseconds. This achievement is not only for pure voice, but also in the case of audio and video interaction. In the current horizontal comparison in the domestic market, there is almost no competitor;

\n

\n

Advantage 2:

\n

Can achieve audio and video multimodal interaction, can \"see\" users, can be interrupted at any time, and has reasoning ability;

\n

\n

Advantage 3:

\n

Has a rich emotional system, the interaction is natural, and the sense of AI is removed, which is very suitable for the companion scene;

\n