OpenAI has re - opened its source code and released two inference models late at night. They are at the level of O4 - mini and can run on laptops and mobile phones.
Finally, OpenAI's new release has arrived.
Although it's not the long - awaited GPT - 5, it's still "something big - but - small today."
That is, a new open - source language model.
You know, this is the first time in recent years (since GPT - 2) that OpenAI has open - sourced a model again.
According to Weng Jiayi, an alumnus of Tsinghua University who works as a research scientist at OpenAI, discussions about model open - sourcing within OpenAI started in 2022, and they came close to the goal of "open - sourcing" several times, but it wasn't until today that it was realized.
This time, two inference models have been open - sourced at once.
GitHub address: https://github.com/openai/gpt - oss Hugging Face address: https://huggingface.co/openai/gpt - oss - 20b Hugging Face address: https://huggingface.co/openai/gpt - oss - 120b Blog address: https://openai.com/index/introducing - gpt - oss/
Sam Altman claims that the performance of gpt - oss is comparable to that of o4 - mini, and it can run on high - end laptops (WTF!!) (There's also a smaller one that can run on mobile phones).
The comparison of the benchmark results between the two open - source models and o3, o4 - mini is as follows:
To sum it up, the highlights of these two open - source models include:
Lenient Apache 2.0 license: Build freely without copyright restrictions or patent risks - ideal for experimentation, customization, and commercial deployment.
Adjustable inference intensity: Easily adjust the inference intensity (low, medium, high) according to specific use cases and latency requirements.
Complete Chain of Thought (CoT): Fully access the model's inference process, making it easier to debug and enhancing trust in the output. It's not planned to be shown to end - users.
Finetunable: The model can be fully customized for specific use cases through parameter fine - tuning.
Agentic features: Use the model's functions for function calls, web browsing, Python code execution, and structured output.
Native MXFP4 quantization: The model is trained with native MXFP4 precision for the MoE layer, allowing gpt - oss - 120b to run on a single H100 GPU and gpt - oss - 20b to run within 16GB of memory.
OpenAI has also created a playground for developers to simply try out these two open - source models on the web. Interested readers can go and have a try.
Trial address: https://www.gpt - oss.com/
In the past few hours, the overseas AI community has been in an uproar, and everyone has started downloading and trying out the new models. So much so that the CTO of Hugging Face had to request online that not everyone download at once, or the server would crash!
Next, let's take a look at the technical details of these two latest open - source models.
A new height for open - source models
As two state - of - the - art open - source language models, gpt - oss - 120b and gpt - oss - 20b can provide powerful practical application performance and have the advantage of low cost.
These two models outperform open - source models of the same scale in inference tasks, demonstrate strong tool - using abilities, and are optimized to be efficiently deployed on consumer - grade hardware. The training process combines reinforcement learning and techniques inspired by OpenAI's most advanced internal models, including o3 and other cutting - edge models.
Among them, the gpt - oss - 120b model is almost on par with o4 - mini in core inference benchmark tests and can run efficiently on a single 80GB GPU. The gpt - oss - 20b model performs similarly to o3 - mini in common benchmark tests and only requires 16GB of memory to run. It is suitable for edge devices and is ideal for local inference, device - side use, or rapid iteration without expensive infrastructure.
Both models perform very well in tool use, few - shot function calls, CoT inference, and HealthBench tests, even surpassing proprietary models such as o1 and GPT - 4o.
These two models also have very strong system compatibility. They are suitable for intelligent agent workflows that require excellent instruction following, tool use (such as web search or Python code execution), and inference capabilities. They can adjust the inference intensity according to the complexity of the task, thus adapting to tasks that do not require complex reasoning and/or target very low - latency final output. Both models are fully customizable, provide a complete CoT, and support structured output.
Of course, security is the foundation for all models released by OpenAI, especially crucial for open - source models. Therefore, in addition to comprehensive security training and evaluation tests, OpenAI has also tested an adversarial fine - tuned version of gpt - oss - 120b based on its own Preparedness Framework and introduced an additional evaluation layer. From the results, the gpt - oss model performs similarly to OpenAI's cutting - edge models in internal security benchmark tests and provides the same security standards as its recent proprietary models.
OpenAI has collaborated with early partners such as AI Sweden, Orange, and Snowflake to understand the performance of the two open - source models in real - world applications, including hosting them locally to ensure data security and fine - tuning them on professional datasets.
Pretraining and model architecture
The gpt - oss model adopts OpenAI's most advanced pretraining and post - training techniques, with a particular focus on inference, efficiency, and real - world usability in various deployment environments.
Both models use the Transformer architecture and leverage Mixture of Experts (MoE) to reduce the number of active parameters required to process inputs. Among them, gpt - oss - 120b activates 5.1B parameters per token, while gpt - oss - 20b activates 3.6B parameters. The total parameters of the two models are 117B and 21B respectively.
In addition, the two models use an alternating dense and locally banded sparse attention pattern, similar to GPT - 3. To improve inference and memory efficiency, the models also use grouped multi - query attention with a group size of 8. They also use Rotary Position Encoding (RoPE) for position encoding and natively support a context length of up to 128k.
For the training set, OpenAI trained the two models on a mainly English text dataset, focusing on STEM, programming, and common - sense content. They used a more extensive tokenizer (tokenizer) than that used by o4 - mini and GPT - 4o to tokenize the data —— o200k_harmony, which is also open - sourced.
Post - training
OpenAI claims that the open - source models adopt a post - training process similar to that of o4 - mini, including supervised fine - tuning and a high - computation reinforcement learning stage. In addition, OpenAI also trains the models to perform chain - of - thought reasoning and tool calls before outputting answers. By using the same techniques as OpenAI's proprietary inference models, these models show excellent capabilities after post - training.
Similar to the OpenAI o - series inference models in the API, these two open - source models support three levels of inference intensity adjustment: "low, medium, high." Developers can easily set it by adding a line of instruction to the system message to achieve a balance between latency and performance.
Performance evaluation
OpenAI compared and tested GPT - OSS - 120B/20B with OpenAI inference models such as o3, o3 - mini, and o4 - mini on standard academic benchmarks, covering dimensions such as programming, competitive mathematics, medical, and intelligent agent tool use:
A series of test results show that GPT - OSS - 120B surpasses o3 - mini in programming competitions (Codeforces), comprehensive problem - solving (MMLU and HLE), and tool calls (TauBench), reaching or even exceeding the level of o4 - mini.
It performs even better than O4 - mini in the fields of medical queries (HealthBench) and competitive mathematics (AIME 2024 & 2025). Despite being small in size, GPT - OSS - 20B is still on par with or surpasses o3 - mini in these tests, especially in competitive mathematics and the medical field.
Codeforces Competition programming benchmark
Human Last Exam —— interdisciplinary expert - level questions
HealthBench benchmark test
AIME 2024 and AIME 2025 benchmarks (using tools)
GPQA Diamond (without using tools) and MMLU benchmarks
AIME mathematics competition
GPQA Diamond (using tools) doctoral - level scientific questions
The complete evaluation results are shown in the following table: