6. Training and Prediction Runtime

개요 / Overview

한국어

ENN-PyTorch의 학습 및 예측 런타임은 데이터를 먼저 staging하고, 모델 상태를 checkpoint로 준비한 뒤, 워커 프로세스에서 학습 또는 예측을 실행하고 결과 artifact를 회수하는 구조다.

train()과 predict()는 현재 Python process 안에서 단순히 model.forward()를 반복 호출하는 방식이 아니다. 사용자 입력은 memmap source로 변환되고, 모델 상태는 checkpoint로 준비되며, 실제 실행은 elastic worker 안에서 이루어진다.

사용자 API 호출
  → 데이터 memmap staging
  → model checkpoint 준비
  → RuntimeConfig 구성
  → elastic worker 실행
  → ProcessBroker bootstrap
  → distributed process group 초기화
  → Session / model / optimizer / loss 구성
  → train 또는 infer 실행
  → checkpoint 또는 prediction artifact 회수

이 구조 때문에 학습, 예측, 분산 처리, 체크포인트는 서로 분리된 기능이 아니라 하나의 런타임 흐름 안에서 함께 봐야 한다.

English

The ENN-PyTorch training and prediction runtime first stages data, prepares model state as checkpoints, runs training or prediction inside worker processes, and then collects result artifacts.

train() and predict() do not simply call model.forward() repeatedly inside the current Python process. User input is converted into a memmap source, model state is prepared as checkpoints, and actual execution happens inside elastic workers.

User API call
  → data memmap staging
  → model checkpoint preparation
  → RuntimeConfig construction
  → elastic worker execution
  → ProcessBroker bootstrap
  → distributed process group initialization
  → Session / model / optimizer / loss construction
  → train or infer execution
  → checkpoint or prediction artifact collection

Because of this structure, training, prediction, distributed execution, and checkpoints should be read together as one runtime flow rather than as separate features.

이 장의 구성 / Chapter Map

섹션 / Section	한국어	English
공통 런타임 구조
Common runtime structure	`train()`과 `predict()`가 공유하는 staging, worker, checkpoint 흐름을 설명한다.	Explains the staging, worker, and checkpoint flow shared by `train()` and `predict()`.
ProcessBroker와 worker bootstrap
ProcessBroker and worker bootstrap	워커 실행 전 환경 정리와 process group 초기화를 설명한다.	Explains environment cleanup and process group initialization before worker execution.
분산 실행 구조
Distributed execution structure	control lane, accelerator lane, HSDP/FSDP wrapping을 설명한다.	Explains control lanes, accelerator lanes, and HSDP/FSDP wrapping.
학습 실행 흐름
Training execution flow	epoch loop, optimizer, loss, validation, checkpoint를 설명한다.	Explains epoch loops, optimizer, loss, validation, and checkpoints.
예측 실행 흐름
Prediction execution flow	infer loop, prediction chunk, manifest, output assembly를 설명한다.	Explains infer loops, prediction chunks, manifests, and output assembly.
prediction collapse fallback	raw, calibrated, denorm, fp32, per-sample 후보 비교를 설명한다.	Explains comparison among raw, calibrated, denorm, fp32, and per-sample candidates.

공통 런타임 구조 / Common Runtime Structure

한국어

학습과 예측은 목적은 다르지만 기본 실행 구조를 공유한다. 사용자-facing workflow는 데이터를 staging하고 checkpoint를 준비한 뒤 worker를 실행한다. worker 내부에서는 device, backend, process group, Session, Loader, model이 구성되고, mode에 따라 학습 또는 추론이 실행된다.

English

Training and prediction serve different purposes, but they share the same basic execution structure. The user-facing workflow stages data, prepares checkpoints, and launches workers. Inside the worker, device, backend, process group, Session, Loader, and model are configured, and then either training or inference is executed depending on mode.