At the Computing Industry Development Consortium meeting in Hefei, Huawei introduced its Star River AI Computing Data Center Network. With DF+ StarMesh architecture, NSLB load balancing, and FlashStart technology, the network improves AI training efficiency by 10% and reliability tenfold, supporting high-quality growth in computing.
Recently, the Full Assembly of the Computing Industry Development Consortium (hereafter referred to as the "Consortium") was successfully held in Hefei. This Consortium was established under the guidance of the Ministry of Industry and Information Technology, led by the China Academy of Information and Communications Technology. Lin Yihong, Senior Architect of Huawei Data Center Network Solutions, was invited to participate in the Consortium’s Computing Network Working Group meeting, where he delivered a keynote speech titled "Huawei Star River AI Computing Data Center Network: Unlocking High Computing Power in the AI Era." In his speech, Lin discussed new challenges facing networks in the era of artificial intelligence and the directions for technological innovation with leaders from academia, industry, and research sectors.
Lin Yihong, Senior Architect of Huawei’s Data Center Network Solutions in the Data Communication Product Line, highlighted in his keynote that as the number of parameters for large-scale model training grows rapidly, cluster sizes are expanding. Ultra-large-scale cluster data center networks face three major challenges: the bottleneck of single-POD network scale, wasted computing power due to computation card wait times, and insufficient network reliability that causes interruptions during training.
The Huawei Star River AI Computing Data Center Network is designed for the intelligence era, creating a new network infrastructure that enables ultra-large cluster scales, high computing efficiency, and high computing availability, supporting high-quality growth in the computing industry. This solution, based on the new DF+ StarMesh network architecture, breaks through AI cluster scalability limits. By leveraging remote cluster training across extended distances, it enables rapid scaling of large clusters. The exclusive network-level load balancing NSLB algorithm allows for a 95% network throughput rate and boosts AI training efficiency by over 10%. Additionally, with optical module channel loss resilience, contamination detection, and Huawei's exclusive FlashStart technology, network reliability increases more than tenfold, ensuring uninterrupted training during device reboot or upgrade.
It is noteworthy that this solution recently won the “Innovation Pioneer” award at the 2024 China Computing Power Conference, standing out from nearly a hundred entries for its exceptional innovation and customer value.
Computing power is becoming a crucial driver of high-quality social and economic development. With upgraded transport capabilities boosting computing power flow, the Huawei Star River AI Computing Data Center Network empowers the computing industry by enhancing network strength to support high computing power in the AI era, accelerating industry-wide intelligence.