Kubernetes 1.33 GPU 스케줄링 실전 가이드 — AI 워크로드 비용 60% 절감 | 기술노트

들어가며

AI 모델 서빙이 Kubernetes의 주류 워크로드가 되면서, GPU 리소스 관리가 CPU/메모리 관리만큼 중요해졌습니다. 문제는 GPU가 비쌉니다. A100 80GB 한 장의 월 클라우드 비용이 약 $2,000. 이 비싼 GPU를 효율적으로 사용하지 않으면 비용이 폭발합니다.

Kubernetes 1.33에서 GA(General Availability)로 졸업한 DRA(Dynamic Resource Allocation)는 이 문제의 해답입니다.

1. 기존 GPU 스케줄링의 문제

device plugin의 한계

# 기존 방식: GPU를 통째로 할당
resources:
  limits:
    nvidia.com/gpu: 1  # GPU 1장 전체를 독점

기존 device plugin 방식은 GPU를 통째로 할당합니다. 7B 모델 추론에 A100 80GB를 할당하면, 70GB 이상의 GPU 메모리가 낭비됩니다.

2. DRA (Dynamic Resource Allocation)

DRA는 GPU를 동적으로, 세밀하게 할당합니다. 마치 CPU에서 밀리코어(m) 단위로 요청하듯, GPU도 메모리와 컴퓨팅 유닛을 세밀하게 요청할 수 있습니다.

ResourceClaim 정의

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: llm-inference-gpu
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      selectors:
      - cel:
          expression: |
            device.attributes["memory"].compareTo(quantity("20Gi")) >= 0 &&
            device.attributes["compute-capability"].compareTo(quantity("8.0")) >= 0

Pod에서 사용

apiVersion: v1
kind: Pod
metadata:
  name: llm-server
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    args: ["--model", "meta-llama/Llama-3-8B", "--gpu-memory-utilization", "0.9"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimName: llm-inference-gpu

3. GPU 공유 (MIG + Time-Slicing)

하나의 GPU를 여러 Pod가 공유하는 방법:

MIG (Multi-Instance GPU)

A100/H100에서 지원. GPU를 물리적으로 분할:

# NVIDIA MIG 설정
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: gpu-mig-3g20gb
spec:
  selectors:
  - cel:
      expression: |
        device.attributes["mig-profile"] == "3g.20gb"

A100 80GB → 7개의 독립 GPU 인스턴스로 분할 가능
각 인스턴스가 독립적인 메모리/컴퓨트 보유
인스턴스 간 완전한 격리 (메모리, 장애)

Time-Slicing

MIG를 지원하지 않는 GPU(T4, V100 등)에서 사용:

# time-slicing 설정 (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
data:
  config: |
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # GPU 1장을 4개 Pod가 시분할

4. 비용 절감 실전 사례

시나리오	기존 (독점 할당)	DRA + MIG	절감률
7B 모델 추론 x 4	A100 x 4	A100 x 1 (MIG 4분할)	75%
임베딩 서버 x 8	T4 x 8	T4 x 2 (Time-Slicing)	75%
70B 모델 + 7B x 3	A100 x 4	A100 x 2 (혼합)	50%

5. 모니터링

# GPU 사용률 모니터링 (Prometheus + dcgm-exporter)
helm install dcgm-exporter nvidia/dcgm-exporter

# Grafana 대시보드 쿼리 예시
# GPU 메모리 사용률
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# GPU 컴퓨트 사용률
DCGM_FI_DEV_GPU_UTIL

# 전력 소비
DCGM_FI_DEV_POWER_USAGE

마무리

DRA의 GA 졸업으로 Kubernetes의 GPU 관리가 CPU/메모리 관리만큼 성숙해졌습니다. MIG와 Time-Slicing을 적절히 조합하면 동일한 GPU 인프라에서 2~4배 더 많은 AI 워크로드를 서빙할 수 있습니다. 비싼 GPU를 놀리지 마세요.