8. Operational Risks and Debugging Guide

개요 / Overview

한국어

ENN-PyTorch의 운영 리스크는 모델 자체보다 실행 경로의 복잡성에서 주로 발생한다.

이 프로젝트는 모델 구조, 데이터 staging, 커널 선택, 정밀도 제어, 학습·예측 워커, 분산 처리, 체크포인트, 내보내기를 하나의 런타임 안에서 다룬다. 이 구조는 대용량 데이터와 다양한 실행 환경에 대응하기 위한 장점이 있지만, 동시에 디버깅 지점이 많아진다는 뜻이기도 하다.

따라서 문제가 생겼을 때는 “모델이 틀렸다” 또는 “GPU가 느리다”처럼 단순하게 접근하기보다, 실행 경로를 계층별로 나눠서 확인해야 한다.

데이터 준비
  → 모델 구조
  → 커널 선택
  → 정밀도/autocast
  → 학습·예측 워커
  → 분산 처리
  → 체크포인트
  → 결과 저장 또는 내보내기

English

Operational risks in ENN-PyTorch mainly come from the complexity of the execution path, not only from the model itself.

This project handles model structure, data staging, kernel selection, precision control, training and prediction workers, distributed execution, checkpoints, and export inside a single runtime. This structure is useful for large-scale data and varied execution environments, but it also creates many debugging points.

Therefore, when a problem occurs, it is safer to inspect the execution path layer by layer rather than assuming simply that “the model is wrong” or “the GPU is slow.”

Data preparation
  → model structure
  → kernel selection
  → precision/autocast
  → training and prediction workers
  → distributed execution
  → checkpoints
  → result saving or export

운영 리스크 전체 지도 / Operational Risk Map

한국어

운영 리스크는 대부분 단일 원인으로 끝나지 않는다. 예측 collapse는 모델 구조 문제일 수도 있지만, BF16 tail quantization, calibration, denormalization, cudagraph static buffer, batch-level broadcast 문제일 수도 있다. 성능 저하 역시 모델 연산 자체가 아니라 data prefetch, filesystem, backend fallback에서 발생할 수 있다.

English

Most operational risks do not have a single cause. Prediction collapse may come from the model structure, but it can also come from BF16 tail quantization, calibration, denormalization, cudagraph static buffers, or batch-level broadcast issues. Performance degradation may also come from data prefetch, filesystem behavior, or backend fallback rather than model computation itself.

flowchart TD
    A["문제 발생<br/>Problem Occurs"] --> B{"어떤 증상인가?<br/>What symptom?"}

    B -->|예측이 거의 동일함<br/>Predictions are almost identical| C["Prediction collapse<br/>low diversity<br/>scaler mismatch"]
    B -->|속도가 느림<br/>Slow performance| D["Kernel fallback<br/>math path<br/>compile/cudagraph disabled"]
    B -->|NaN / Inf| E["Precision issue<br/>autocast<br/>optimizer state<br/>nonfinite tensor"]
    B -->|OOM| F["Batch / microbatch<br/>prefetch<br/>checkpoint<br/>device memory"]
    B -->|I/O 지연<br/>I/O delay| G["memmap<br/>DCP checkpoint<br/>prediction chunks<br/>export files"]
    B -->|분산 실행 멈춤<br/>Distributed hang| H["process group<br/>control lane<br/>accelerator lane<br/>barrier / DCP"]
    B -->|export 실패<br/>Export failure| I["export-safe path<br/>operator support<br/>dynamic shape<br/>dependency"]

    C --> J["raw / calibrated / denorm 후보 비교<br/>Compare raw / calibrated / denorm candidates"]
    D --> K["실제 backend 경로 확인<br/>Check actual backend path"]
    E --> L["dtype / scale / nonfinite dump 확인<br/>Check dtype / scale / nonfinite dumps"]
    F --> M["Governor / OOMHandler / microbatch 확인<br/>Check Governor / OOMHandler / microbatch"]
    G --> N["filesystem / tmp / cache 위치 확인<br/>Check filesystem / tmp / cache location"]
    H --> O["rank / world size / backend / lane 확인<br/>Check rank / world size / backend / lane"]
    I --> P["fast path 비활성화와 backend별 제약 확인<br/>Check fast path disabling and backend constraints"]

개요 / Overview

운영 리스크 전체 지도 / Operational Risk Map

주요 리스크 요약 / Risk Summary