[yongggg's] DDP

GPUTraining & View 2023. 12. 14. 11:17

안녕하세요 이번 장에서는 one node / multi gpu 환경에서 Distributed Data Parallel 기술 구현의 방법에 대해 설명하고자합니다.

이 장에서는 기본적으로 본인이 직접 모델 custom을 할 수 있는 정도가 되어야 이해하시기 편하실 겁니다.

먼저 DDP를 구동하기 위해서는, 여러 개의 GPU가 제대로 인식되는 지 확인을 하셔야합니다.

Nvdi-smi
nvcc -V

등의 명령어로 먼저 장비가 GPU를 잘 인식하고 있는지 확인합니다.

맥북을 쓰는 저의 경우에... 메모장에 CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7'를 복사했는데 따옴표가 바뀌어서 저장이되는 경우도 있었습니다.... 따옴표 기호가 맥북의 따옴표 기호로 들어가서 정상적으로 인식이 안돼서 오류가 나더라고요... 하지만 출력 오류 상에서는... 그냥 process signal error 만 출력되었습니다.

잘 인식되는데, signal error만 뜬다?! 하면 먼저 저러한 사소한 것 부터 확인하시길 바랍니다!

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.2.0+cu121 documentation

Getting Started with Distributed Data Parallel Author: Shen Li Edited by: Joe Zhu Note View and edit this tutorial in github. Prerequisites: DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machine

pytorch.org

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    dist.destroy_process_group()

if __name__ == "__main__":
    demo_basic()

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py

'GPUTraining & View' 카테고리의 다른 글

[yongggg's] FSDP details and options (2)	2024.03.13
[yongggg's] FSDP: Fully Sharded Data Parallel (1)	2024.02.06
[yongggg's] Big-size image dataset load Tip (0)	2023.02.01
[yongggg's] 서버에 딥러닝 환경 설치하기 (Ubuntu 20.04) (0)	2022.11.24
[yongggg's] Nvidia-smi(Failed to initialize NVML: Driver/library version mismatch) (1)	2022.08.16

ABOUT ME

Yong's Blog Yong's Blog

'GPUTraining & View' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'GPUTraining & View' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바