5. Data Pipeline

개요 / Overview

한국어

ENN-PyTorch의 데이터 파이프라인은 사용자 입력 데이터를 memmap 기반 중간 표현으로 준비하고, Sampler와 Stream prefetch를 통해 학습·예측 런타임에 device batch를 공급하는 구조다.

핵심은 memmap-first다. 입력 데이터를 메모리에 한 번에 올려서 학습 loop에 넣는 방식이 아니라, features.mmt, labels.mmt, meta.json 같은 파일 기반 표현으로 staging한 뒤 필요한 batch를 읽어 device로 전달한다.

이 구조는 대용량 데이터와 분산 실행을 고려한 설계다. 대신 파일시스템 성능, 임시 디렉터리, pinned memory, CPU quota, accelerator stream 상태에 민감하다.

English

The ENN-PyTorch data pipeline prepares user input data as memmap-based intermediate representations and supplies device batches to the training and prediction runtime through Sampler and Stream prefetch.

The core idea is memmap-first. Instead of loading all input data into memory and feeding it directly into the training loop, ENN-PyTorch stages it into file-based representations such as features.mmt, labels.mmt, and meta.json, then reads the required batches and transfers them to the device.

This design targets large-scale data and distributed execution. In return, it becomes sensitive to filesystem performance, temporary directories, pinned memory, CPU quota, and accelerator stream state.

전체 데이터 흐름 / Overall Data Flow

한국어

train()과 predict()는 사용자 입력을 바로 모델에 넘기지 않는다. 먼저 데이터를 memmap dataset으로 staging하고, worker runtime은 그 결과물을 다시 열어 batch 단위로 읽는다. 이때 meta.json과 scale statistics는 Sampler, Scaler, precision policy가 같은 데이터 구조를 바라보게 만드는 기준점이 된다.

English

train() and predict() do not pass user input directly to the model. They first stage the data as a memmap dataset, and the worker runtime reopens the staged result and reads it in batches. At this point, meta.json and scale statistics become the shared reference that allows Sampler, Scaler, and precision policy to operate on the same data structure.

flowchart TD
    A["사용자 입력 데이터<br/>User Input Data"] --> B["canonical key 정리<br/>Canonical Key Cleanup<br/>X / Y / row_ids"]
    B --> C["stream_memmap<br/>2-pass staging"]

    C --> D["features.mmt"]
    C --> E["labels.mmt"]
    C --> F["meta.json"]
    C --> G["scale statistics"]

    D --> H["Sampler<br/>slice / gather<br/>train-val split<br/>distributed shard"]
    E --> H
    F --> H

    H --> I["Governor<br/>batch scale adjustment"]
    I --> H

    H --> J["Multiplexer<br/>multi-source sampling"]
    J --> K["Mapper<br/>parallel map / collate"]
    K --> L["Loader"]
    L --> M["Stream prefetch<br/>pinned memory<br/>accelerator stream"]
    M --> N["device batch"]
    N --> O["Model.forward"]

핵심 설계 / Core Design

설계 / Design	한국어	English
`memmap-first`	대용량 데이터를 한 번에 메모리에 올리지 않고 파일 기반으로 staging한다.	Stages large data on disk instead of loading everything into memory at once.
2-pass 변환
2-pass conversion	첫 번째 pass에서 shape와 scale statistics를 파악하고, 두 번째 pass에서 실제 memmap을 쓴다.	Uses the first pass to inspect shape and scale statistics, then writes the actual memmap in the second pass.
canonical key	다양한 입력 형태를 `X`, `Y`, `row_ids` 중심으로 정리한다.	Normalizes different input forms around `X`, `Y`, and `row_ids`.
scale-aware dtype	데이터 scale 통계를 보고 저장 dtype과 runtime precision에 필요한 정보를 남긴다.	Records information needed for storage dtype and runtime precision based on data scale statistics.
adaptive batch	`Governor`가 batch scale을 조정할 수 있다.	`Governor` can adjust batch scale.
distributed shard	worker/rank별로 dataset range를 나눠 읽을 수 있다.	Dataset ranges can be divided by worker or rank.
prefetch pipeline	Loader와 Stream이 host-to-device 전송을 겹쳐 실행한다.	Loader and Stream overlap host-to-device transfers.
pinned memory reuse	`TensorPagePool`을 통해 pinned CPU memory page를 재사용할 수 있다.	`TensorPagePool` can reuse pinned CPU memory pages.

한국어