DeepSeek's AI breakthrough bypasses industry-standard CUDA for some functions, uses Nvidia's assembly-like PTX programming instead

News

By Anton Shilov

published 28 January 2025

Dramatic optimizations do not come easy.

Comments (41)

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

(Image credit: Nvidia)

DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions, according to an analysis from Mirae Asset Securities Korea cited by @Jukanlosreve.

What You Need to Know

Today: OpenAI boss Sam Altman calls DeepSeek 'impressive.' In 2023 he called competing nearly impossible.

Jan. 28, 2025: Investors panic: Nvidia stock loses $589B in value.

Dec. 27, 2024: DeepSeek is unveiled to the world.

Nvidia's PTX (Parallel Thread Execution) is an intermediate instruction set architecture designed by Nvidia for its GPUs. PTX sits between higher-level GPU programming languages (like CUDA C/C++ or other language frontends) and the low-level machine code (streaming assembly, or SASS). PTX is a close-to-metal ISA that exposes the GPU as a data-parallel computing device and, therefore, allows fine-grained optimizations, such as register allocation and thread/warp-level adjustments, something that CUDA C/C++ and other languages cannot enable. Once PTX is into SASS, it is optimized for a specific generation of Nvidia GPUs.

For example, when training its V3 model, DeepSeek reconfigured Nvidia's H800 GPUs: out of 132 streaming multiprocessors, it allocated 20 for server-to-server communication, possibly for compressing and decompressing data to overcome connectivity limitations of the processor and speed up transactions. To maximize performance, DeepSeek also implemented advanced pipeline algorithms, possibly by making extra fine thread/warp-level adjustments.

These modifications go far beyond standard CUDA-level development, but they are notoriously difficult to maintain. Therefore, this level of optimization reflects the exceptional skill of DeepSeek's engineers. The global GPU shortage, amplified by U.S. restrictions, has compelled companies like DeepSeek to adopt innovative solutions, and DeepSeek has made a breakthrough. However, it is unclear how much money DeepSeek had to invest in development to achieve its results.

The breakthrough disrupted the market as some investors believed that the need for high-performance hardware for new AI models would get lower, hurting the sales of companies like Nvidia. Industry veterans, such as Intel Pat Gelsinger, ex-chief executive of Intel, believe that applications like AI can take advantage of all computing power they can access. As for DeepSeek's breakthrough, Gelsinger sees it as a way to add AI to a broad set of inexpensive devices in the mass market.

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

TOPICS

CUDA

See all comments (41)

Anton Shilov

Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

728x90

저작자표시 비영리 변경금지

'자료수집 > IT 기술분석' 카테고리의 다른 글

오픈소스 LLM 씬의 라이징 스타! 빠르게 업계의 선두로 나서는 DeepSeek의 혁신적 모델 개발 과정과 접근법 (3)	2024.12.05
과기정통부, ‘제로트러스트 가이드라인 2.0’ 발표 (2)	2024.12.04
[IT트렌드] 기업의 혁신을 주도하는 인공지능 전환(AX) (2)	2024.11.23
디지털 전환(DX) 뜻과 생산성 AI 툴 알아보자 (2024년 최신) (4)	2024.11.14
기업 내 Apple 기기 관리 어떻게 해야 할까? MDM과 ABM 활용하기 (1)	2023.10.09