C. Operational Configuration Reference

개요 / Overview

한국어

이 페이지는 ENN-PyTorch를 실행할 때 자주 확인하게 되는 운영 설정과 환경변수 범주를 요약한 참조 자료다.

본문에서는 모델 구조와 런타임 흐름을 설명한다. 이 페이지는 문제가 생겼을 때 어떤 설정 축을 먼저 확인해야 하는지 빠르게 찾기 위한 용도다. 전체 환경변수 사전이 아니라, 운영과 디버깅에서 자주 보는 범주와 대표 설정만 정리한다.

환경변수는 실행 경로와 재현성에 영향을 줄 수 있다. 값을 바꿨다면 실행 로그나 실험 기록에 함께 남기는 것이 좋다.

English

This page summarizes operational settings and environment variable categories that are often checked when running ENN-PyTorch.

The main chapters explain model structure and runtime flow. This page is for quickly finding which configuration axis should be checked first when a problem occurs. It is not a complete environment variable dictionary; it summarizes the categories and representative settings most often used during operations and debugging.

Environment variables can affect execution paths and reproducibility. If a value is changed, it should be recorded together with execution logs or experiment notes.

사용 원칙 / Usage Principles

한국어

운영 설정은 문제를 좁히기 위한 도구다. 여러 설정을 한 번에 바꾸면 원인을 파악하기 어려워진다.

English

Operational settings are tools for narrowing down a problem. If multiple settings are changed at once, it becomes difficult to identify the cause.

1. 먼저 기본 설정으로 문제를 재현한다.
2. 증상이 어느 계층에서 발생하는지 좁힌다.
3. 한 번에 하나의 설정 축만 바꿔 비교한다.
4. 변경한 환경변수와 실행 결과를 기록한다.
5. 설정 변경 후 raw output, calibrated output, checkpoint, manifest 같은 artifact를 함께 확인한다.

1. Reproduce the problem with the default settings first.
2. Narrow down which layer the symptom comes from.
3. Change only one configuration axis at a time.
4. Record the changed environment variables and execution results.
5. After changing settings, inspect artifacts such as raw output, calibrated output, checkpoints, and manifests together.

빠른 진단표 / Quick Diagnosis Table

증상 / Symptom	먼저 확인할 설정 축 / First Configuration Axis	같이 확인할 것 / Check Together
예측값이 거의 동일함
Predictions are almost identical	prediction fallback, precision tail	raw/calibrated/denorm output, fp32 retry
학습 중 NaN/Inf
NaN/Inf during training	precision/autocast, nonfinite diagnostics	scale statistics, loss, gradient, optimizer state
attention이 느림
Attention is slow	attention backend, KernelManager	실제 backend path, math fallback 여부
actual backend path, whether math fallback is used
CUDA OOM	OOM/autobatch, prefetch, checkpoint staging	effective batch size, microbatch, pinned memory
GPU 사용률이 낮음
Low GPU utilization	data pipeline, prefetch, H2D transfer	loader throughput, filesystem, CPU quota
checkpoint가 느림
Checkpointing is slow	DCP writer, filesystem, lane	`.done/.failed`, writer threads, storage path
distributed hang	rank/world/local rank, process group lane	barrier, DCP participation, backend timeout
export 실패
Export failure	export-safe path, dynamic shape, backend dependency	sample input, sidecar metadata, ONNX graph
tmpfs/RAM 압박
tmpfs/RAM pressure	temp/cache directory, memmap path	`/tmp`, Inductor/Triton cache, checkpoint path