Files
heterogeneous-distributed-t…/README.md
qihuiz93 0976b87199 update sub group intro and release plan
Change-Id: Ieca800a76a942368d0a713b879a411d56da19d27
2025-07-11 11:16:52 +08:00

38 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# **Heterogeneous Distributed Training Framework**
## Project Facts
Project Creation Date: *Mar.20th, 2025*
Primary Contact: Lei Huang, huangleiyjy@chinamobile.com; Zhengwei Chen, chenzhengwei@sd.chinamobile.com
Project Lead: Lei Huang, huangleiyjy@chinamobile.com
Committers:
* Zhengwei Chen, chenzhengwei@sd.chinamobile.com
* Yutong Tian, tianyutongcxy@sd.chinamobile.com
Mailing List: [computing-force-network@lists.opendev.org](computing-force-network@lists.opendev.org)
Meetings: No sub-group meeting time. Use bi-weekly meeting of CFN WG.
Repository: [https://opendev.org/cfn/heterogeneous-distributed-training-framework](https://opendev.org/cfn/heterogeneous-distributed-training-framework)
StoryBoard: N/A
Open Bugs: N/A
## Introduction
Currently, the “resource wall” between different GPUs makes it difficult to build one heterogeneous resource pool for Large-scale models training. Heterogeneous distributed training becomes a pressing challenge for the industry to solve. We brought up the key technologies named Heterogeneous Distributed Training Technology(HDT).With the goal of generalization, this technology realizes the industry's first cross-architecture unified heterogeneous training framework.
The training framework enables multiple LLMs deployed and trained on multiple types of GPUs. The Inhomogeneous Task Distribution(ITD) algorithm for heterogeneous training task splitting is innovatively proposed, which supports heterogeneous data parallelism and heterogeneous pipeline parallelism, and realizes the adaptive adjustment of parameters such as microbatches size, quantity, and parallelism of DP on heterogeneous GPUs.
Currently, weve verified our capability on LLaMA2 7B & 13B model composed of Nvidia and other 4 types GPUs. The acceleration ratio reached 95%, loss converges to 1.8 and PPL curve converged normally.
## Documentation & Training
N/A
## Release Planning & Release Notes
For release of year 2025
1.Heterogeneous Distributed Training Technology Solution: introduction of HDT technology solution, including user guide, architecture description, software,etc.
2.Others, TBD