What is the difference between pre-training and post-training in large language models?

Pre-training builds a model's foundational knowledge by processing massive text corpora, while post-training refines behavior through instruction following, safety alignment, and task-specific data. Post-training updates all model weights to shape how the model interacts with users.

How does the Ascend 910C compare to Nvidia H100 processors?

Earlier testing indicated that the Ascend 910C delivered approximately sixty percent of the inference performance found in Nvidia H100 processors. The chip is designed to handle complex computational workloads while operating within domestic supply chain constraints.

Why is the CANN software stack important for domestic AI hardware?

CANN serves as Huawei's alternative to the widely adopted CUDA ecosystem. It provides the necessary libraries and optimization tools for training frameworks to communicate efficiently with domestic accelerators, though developers often face compatibility challenges.

What are the main technical obstacles to domestic AI training?

Key challenges include slow chip-to-chip interconnects, unstable performance during gradient synchronization, and gaps in software driver optimization. Overcoming these bottlenecks requires sustained investment in both hardware architecture and system-level integration.

News

Huawei Team Claims Full Parameter Post Training On Ascend 910C Chips

Christopher Holloway

Jun 06, 2026 - 13:00

Updated: 1 month ago

0 3

Huawei Ascend 910C chips used to train the DeepSeek V4-Pro model

A Huawei-affiliated research team asserts that it completed full-parameter post-training for DeepSeek V4-Pro using a cluster of one thousand Ascend 910C processors. The announcement underscores progress in domestic AI infrastructure but lacks independent verification or performance benchmarks. Industry observers note that while the claim highlights technical advancement, it does not yet demonstrate the ability to pre-train frontier models from scratch. The broader implications for supply chain independence remain significant.

A recent announcement from a research consortium backed by Huawei Technologies has drawn attention to a specific milestone in artificial intelligence infrastructure. The group reports that it successfully executed full-parameter post-training for DeepSeek V4-Pro, a massive model containing 1.6 trillion parameters. This effort reportedly relied on a distributed cluster comprising at least one thousand Ascend 910C processors. The disclosure highlights ongoing efforts to transition critical computational workloads onto domestically produced silicon.

What is the significance of this post-training milestone?

The reported achievement centers on a specific phase of model development that sits between initial learning and final deployment. Post-training functions as a critical refinement stage where foundational capabilities are adjusted to align with human expectations. Unlike the initial pre-training phase, which consumes vast textual corpora to establish baseline knowledge, post-training focuses on instruction following and safety alignment. The consortium claims that every weight within the 1.6 trillion parameter architecture was updated during this process. This distinction matters because updating all weights requires substantially more computational coordination than applying lightweight adapter layers.

The shift from theoretical capability to practical implementation on domestic hardware represents a notable engineering hurdle. Researchers must ensure that gradient calculations remain stable across thousands of interconnected processors. Any synchronization delay or memory bottleneck can corrupt the entire training cycle. The announcement suggests that the Ascend 910C cluster managed to maintain coherence throughout this demanding procedure. This milestone indicates that domestic silicon can now handle the complex memory bandwidth requirements associated with full-parameter updates.

Understanding the Ascend 910C Architecture

The Ascend 910C serves as Huawei's current flagship artificial intelligence accelerator. The processor utilizes a dual-die design intended to maximize computational density while managing thermal constraints within standard server racks. Previous testing conducted by DeepSeek indicated that the chip delivered approximately sixty percent of the inference performance found in Nvidia H100 processors. Inference workloads involve running a finished model to generate responses, which relies heavily on memory bandwidth and parallel processing efficiency. Chinese manufacturers have historically found inference easier to optimize than training workloads.

Training requires continuous weight updates, frequent gradient synchronization, and massive temporary storage for intermediate calculations. The transition from inference dominance to training capability demands a fundamentally different architectural approach. The hardware must support high-speed data movement between nodes without introducing latency that stalls computation. The successful execution of post-training suggests that the physical design has reached a level of maturity capable of sustaining these demanding iterative processes.

How does post-training differ from pre-training in large models?

Pre-training establishes the core knowledge base of a language model by processing trillions of tokens from diverse sources. DeepSeek documentation states that the V4-Pro architecture processed more than thirty-two trillion tokens during this initial stage. The computational intensity of pre-training makes it the most expensive phase of model development. Post-training then shapes the model's behavior through targeted datasets that teach it to follow instructions and adhere to safety guidelines. Completing this phase on domestic silicon demonstrates that the hardware can sustain the iterative weight updates required for alignment.

It does not prove that the chips can handle the heavier burden of pre-training a frontier model from scratch. The distinction between these two phases remains crucial for evaluating the true readiness of domestic AI infrastructure. Researchers must carefully separate marketing claims from technical reality when assessing hardware capabilities. The successful post-training run provides a valuable data point for future architectural planning.

The Technical Hurdles of Domestic Silicon

The path to domestic training capability has been fraught with technical obstacles. Earlier reports indicated that DeepSeek struggled to complete a single successful training run for its R2 model using Ascend processors. Engineers attributed those failures to unstable performance, slow chip-to-chip interconnects, and gaps in the CANN software stack. CANN serves as Huawei's substitute for the widely adopted CUDA ecosystem. Training workloads demand flawless communication between accelerators to synchronize gradients and update weights simultaneously.

When interconnect speeds lag or software drivers fail to optimize memory allocation, the entire cluster can stall. The company ultimately fell back on Nvidia GPUs for training while reserving the Ascend chips for inference tasks. Overcoming these interconnect and software limitations requires sustained investment in both hardware design and system-level integration. The recent milestone suggests that these foundational issues are being systematically addressed.

Why does the software ecosystem matter for training workloads?

Hardware performance alone cannot guarantee successful model training. The surrounding software ecosystem dictates how efficiently processors utilize their theoretical capabilities. CUDA has established itself as the industry standard because it provides mature libraries, optimized kernels, and extensive developer support. Huawei's CANN framework attempts to replicate this functionality for domestic silicon. The gap between the two ecosystems often manifests as reduced throughput and increased debugging complexity. Researchers must frequently rewrite code or adapt frameworks to bypass missing features.

The successful post-training run suggests that the software stack has reached a level of stability sufficient for weight updates. However, scaling this stability to pre-training workloads requires continuous refinement. The ecosystem must handle dynamic memory allocation, fault tolerance, and distributed optimization without introducing bottlenecks. Software maturity will ultimately determine whether domestic hardware can compete with established foreign alternatives.

Bridging the Gap Between Hardware and Frameworks

The collaboration between Huawei, the Shenzhen Loop Area Institute, and academic partners highlights a coordinated approach to overcoming these barriers. Academic institutions bring theoretical research capabilities, while industrial partners provide manufacturing scale and deployment infrastructure. The Shenzhen Research Institute of Big Data likely contributed expertise in distributed computing architectures. This multi-institutional effort reflects a broader strategy to build an end-to-end domestic pipeline. By combining hardware development with academic research, the consortium aims to identify and resolve bottlenecks before they impact commercial deployment.

The integration of these resources allows for rapid iteration on both physical components and software drivers. Such partnerships are essential for maintaining momentum in a highly competitive technological landscape. The convergence of academic theory and industrial engineering creates a feedback loop that accelerates innovation. Continued collaboration will be necessary to sustain progress in this rapidly evolving field.

What are the broader implications for the AI hardware market?

Geopolitical restrictions on advanced semiconductor exports have accelerated domestic substitution efforts. Chinese technology firms face strict limitations on acquiring cutting-edge foreign processors. This regulatory environment has forced local manufacturers to prioritize self-reliance over convenience. The shift toward domestic silicon requires substantial capital investment and long-term strategic planning. Companies must balance immediate performance requirements with future scalability goals. The recent announcement reflects this broader industrial strategy.

It demonstrates that local infrastructure can support complex computational tasks. The broader technology sector continues to monitor these developments closely, noting how summer gaming trends and hardware ecosystem shifts often precede broader changes in professional computing infrastructure. Domestic silicon adoption will depend on long-term reliability, software maturity, and total cost of ownership. While the recent claim lacks independent benchmarks or efficiency metrics, it contributes to a growing body of evidence regarding local capabilities.

Evaluating Long-Term Viability

The long-term viability of domestic AI infrastructure depends on sustained investment and continuous improvement. Short-term successes must be weighed against the broader challenges of ecosystem development. Software compatibility, developer tooling, and global supply chain dependencies all play crucial roles. The recent claim represents one step in a much longer journey toward full training capability. Industry observers will track subsequent announcements to assess whether these milestones translate into scalable, reliable production environments.

The convergence of hardware innovation and academic research will ultimately determine the pace of adoption. This transition will reshape global hardware procurement strategies, paralleling the essential adjustments for modern desktop productivity that users now expect across all computing platforms. The industry must balance innovation with realistic expectations regarding current capabilities.

How does this development influence future research directions?

The reported milestone invites further investigation into the practical limits of domestic training infrastructure. Researchers will likely focus on scaling cluster sizes, improving interconnect bandwidth, and refining software optimization techniques. The absence of performance benchmarks in the original report leaves many questions unanswered. Independent verification would clarify how the Ascend cluster compares to established foreign alternatives in terms of speed and energy efficiency. Future studies may examine how different parameter counts affect training stability on domestic silicon.

The findings will guide hardware engineers in designing the next generation of accelerators. Understanding these limitations will help prioritize engineering efforts and guide policy decisions. The industry must balance innovation with realistic expectations regarding current capabilities. The evaluation of domestic training infrastructure requires careful scrutiny of both hardware specifications and software optimization. Researchers must examine how different cluster configurations affect gradient synchronization and memory allocation.

Market Dynamics and Strategic Shifts

The evaluation of domestic training infrastructure requires careful scrutiny of both hardware specifications and software optimization. Researchers must examine how different cluster configurations affect gradient synchronization and memory allocation. The absence of independent benchmarks leaves many practical details unverified. Future studies will likely focus on scaling these architectures to larger parameter counts. Understanding these limitations will help prioritize engineering efforts and guide policy decisions. The industry must balance innovation with realistic expectations regarding current capabilities.

The trajectory of this technology will likely influence global supply chain strategies and future research funding priorities. Continued progress will depend on bridging the gap between theoretical capability and practical deployment. Researchers and engineers must address software limitations and interconnect bottlenecks to achieve broader adoption. The reported post-training achievement highlights the ongoing evolution of domestic artificial intelligence infrastructure. While the announcement lacks independent verification and detailed performance metrics, it underscores a clear direction in hardware development. The focus remains on building resilient, self-sufficient computing ecosystems capable of handling complex computational demands. Continued investment in research and development will determine whether these capabilities can scale to meet global standards.

AMD EXPO ULL Explained: Lower Latency Memory Profiles

What's Your Reaction?

Like 0

Dislike 0

Love 0

Funny 0

Wow 0

Sad 0

Angry 0

Christopher Holloway

Christopher Holloway is the founder and director of Progressive Robot, a UK-based technology company. A full-stack engineer with more than two decades of experience, he works across PHP development, ecommerce, Linux infrastructure, technical SEO and AI automation, and writes here on technology, AI, hardware and software.

Hackers weaponize legitimate remote access tools to establish stealthy backdoors.

NVIDIA Blackwell Dominates MLPerf Training...

HPE and NVIDIA Expand AI Infrastructure...

Benchmarking Agentic AI Infrastructure:...

Why Artificial Intelligence Has Not...

Asus ROG Ally X20 Review: OLED Refinement...

Gran Turismo World Series Singapore:...

007 First Light Sets New Sales Record...

Summer Game Fest 2026: Industry Shifts...

iPhone 18 Pro Color Confirmed: Dark...

The Complete Guide to MagSafe and Magnetic...

Understanding the Reality Behind the...

Mobile Document Scanning: Evaluating...

Apple Launches New Accessories And Thinnest...

Beats Studio Buds Firmware Update Addresses...

Apple Updates AirPods Pro and Beats...

Apple Distributes Routine Firmware Updates...

Apple A22 Pro Chipset and the 1.4nm...

Apple 2027 Roadmap: Camera AirPods and...

HPE and NVIDIA Expand AI Infrastructure...

NVIDIA Blackwell Sets New Standards...

Why Storage Infrastructure Is Essential...

HPE Updates AI Infrastructure for Agentic...

HPE Expands Self-Driving Networks for...

HPE Broadens Quantum Partnerships to...

AMD AGESA 1.3.0.1b BIOS Update Improves...

MSI MPG 271KRAW18 5K Mini LED Monitor...

AMD Warranty Dispute Highlights Evolving...

MSI Forecasts Persistent Memory And...

Domestic 24 Gb Chips Enable 48 GB DDR5...

DDR5 Memory Prices Surge in Germany,...

Intel Raptor Lake Next Desktop CPUs...

Intel Extends Raptor Lake Lifecycle...

Arctic Computex 2026 Cooling and Chassis...

Adata XPG Computex 2026 Hardware Lineup...

Compact NCase P1 ATX Chassis for Multi-GPU...

Lian Li Computex 2026 Hardware Innovations...

Mini PC Buying Guide: Performance, Value,...

Compact Desktop Systems: Architecture,...

PC Hardware Transition Guide: Migration,...

Asus ROG Edition 20 Desktop Balances...

MSI Unveils Pro Max Desktops and Monitors...

Intel Core-X Series and X299 Platform...

Intel Core i9-7980XE Benchmarks Reveal...

MSI Introduces Vigor GK80 and GK70 Keyboards...

Optimizing Chiplet Cooling With Adjustable...

How Modern Security Suites Replace Multiple...

Red Hat NPM Channel Compromised in Supply...

How Malvertising Campaigns Exploit Trusted...

AI doesn't break security. Complexity...

Meta AI Chatbot Exploit Compromises...

Scientific Insights From Overlooked...

Space Market Correction as SpaceX IPO...

Negative Time in Quantum Optics: Peer-Reviewed...

How Underwater Technology Is Reshaping...

Why Night Driving Poses Unique Risks...

Anker Prime 250W Charging Station Review...

Tesla Model 3 Pricing Shift in Canada...

How AI and Machine Learning Are Reshaping...

Singapore Airlines Brings Live World...

Dolby Atmos Changed Movie Audio: Why...

Clarkson's Farm Season 5 Release Schedule...

Masters of the Universe Director Addresses...

Google Engineer Charged With Insider...

Fake downloads of popular PC utilities...

Pearl Cryptocurrency Mining Rush Fades...

Physical Attacks Against Major Cryptocurrency...

Coinbase and Kalshi introduce perpetual...

Welcome!