본문 바로가기
자료수집/IT 기술분석

DeepSeek's AI breakthrough bypasses industry-standard CUDA

by 3604 2025. 1. 31.
728x90

출처: https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead

DeepSeek's AI breakthrough bypasses industry-standard CUDA for some functions, uses Nvidia's assembly-like PTX programming instead

News
 published 28 January 2025

Dramatic optimizations do not come easy.

  •  
  •  
  •  
  •  
  •  
  •  
 Comments (41)

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

(Image credit: Nvidia)

DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions, according to an analysis from Mirae Asset Securities Korea cited by @Jukanlosreve.

What You Need to Know

Today: OpenAI boss Sam Altman calls DeepSeek 'impressive.' In 2023 he called competing nearly impossible.

Jan. 28, 2025: Investors panic: Nvidia stock loses $589B in value.

Dec. 27, 2024: DeepSeek is unveiled to the world.

Nvidia's PTX (Parallel Thread Execution) is an intermediate instruction set architecture designed by Nvidia for its GPUs. PTX sits between higher-level GPU programming languages (like CUDA C/C++ or other language frontends) and the low-level machine code (streaming assembly, or SASS). PTX is a close-to-metal ISA that exposes the GPU as a data-parallel computing device and, therefore, allows fine-grained optimizations, such as register allocation and thread/warp-level adjustments, something that CUDA C/C++ and other languages cannot enable. Once PTX is into SASS, it is optimized for a specific generation of Nvidia GPUs. 

For example, when training its V3 model, DeepSeek reconfigured Nvidia's H800 GPUs: out of 132 streaming multiprocessors, it allocated 20 for server-to-server communication, possibly for compressing and decompressing data to overcome connectivity limitations of the processor and speed up transactions. To maximize performance, DeepSeek also implemented advanced pipeline algorithms, possibly by making extra fine thread/warp-level adjustments. 

 
 

These modifications go far beyond standard CUDA-level development, but they are notoriously difficult to maintain. Therefore, this level of optimization reflects the exceptional skill of DeepSeek's engineers. The global GPU shortage, amplified by U.S. restrictions, has compelled companies like DeepSeek to adopt innovative solutions, and DeepSeek has made a breakthrough. However, it is unclear how much money DeepSeek had to invest in development to achieve its results. 

 

The breakthrough disrupted the market as some investors believed that the need for high-performance hardware for new AI models would get lower, hurting the sales of companies like Nvidia. Industry veterans, such as Intel Pat Gelsinger, ex-chief executive of Intel, believe that applications like AI can take advantage of all computing power they can access. As for DeepSeek's breakthrough, Gelsinger sees it as a way to add AI to a broad set of inexpensive devices in the mass market. 

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.
 
 
 
TOPICS
CUDA
 
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

728x90
반응형