Triton Model Repository
Why This Matters
Section titled “Why This Matters”The Triton model repository is the deployment contract. It describes what is served, which versions exist, which backend executes them, what tensor shapes are valid, how batching works, and how instances are placed.
If this contract is sloppy, operations becomes guesswork.
flowchart LR Train[Training pipeline] --> Export[Export model artifact] Export --> Validate[Validate config and shapes] Validate --> Package[Package artifact bundle] Package --> Registry[Registry or object store] Registry --> Promote[Promotion workflow] Promote --> Repo[Model repository] Repo --> Triton[Triton load] Triton --> Synthetic[Synthetic inference gate]
Repository Layout
Section titled “Repository Layout”Canonical shape:
model_repository/
resnet50/
config.pbtxt
1/
model.plan
2/
model.plan
The version directories are part of the contract. Do not use mutable “latest” semantics inside production paths.
Minimal Config
Section titled “Minimal Config”A minimal model config must identify:
- Backend/platform.
max_batch_size.- Input tensors: name, data type, dimensions.
- Output tensors: name, data type, dimensions.
Conceptual example:
name: "image_classifier"
backend: "tensorrt"
max_batch_size: 16
input [
{
name: "input"
data_type: TYPE_FP32
dims: [3, 224, 224]
}
]
output [
{
name: "probabilities"
data_type: TYPE_FP32
dims: [1000]
}
]
Interview trap:
max_batch_sizechanges the expected tensor contract. If the model was not exported with a batch dimension, enabling batching can break requests or produce shape errors.
Config Fields To Know
Section titled “Config Fields To Know”| Field | Senior meaning |
|---|---|
backend / platform | Runtime path and dependency surface. |
max_batch_size | Whether Triton can batch requests and what first dimension means. |
input / output | API contract and validation boundary. |
dynamic_batching | Queueing and throughput/p99 tradeoff. |
instance_group | How many model instances and where they run. |
version_policy | Which model versions can be served. |
optimization | Backend/runtime optimization knobs. |
parameters | Backend-specific behavior. |
Version Policy
Section titled “Version Policy”Use version policy intentionally:
- Serve only the current production version.
- Serve latest N versions for rollback.
- Serve specific versions during migration/canary.
Pitfall:
Serving multiple versions is not a rollout strategy by itself. You still need traffic routing, validation, and rollback control.
Instance Groups
Section titled “Instance Groups”Instance groups control model execution instances:
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [0]
}
]
Why it matters:
- More instances can improve concurrency.
- More instances consume memory.
- Too many instances can hurt p99 through contention.
- Placement interacts with MIG, whole GPU scheduling, and Kubernetes pod layout.
Senior answer:
I would choose instance count empirically with perf_analyzer/Model Analyzer against representative traffic, then validate under production-like co-tenancy.
Dynamic Shapes
Section titled “Dynamic Shapes”Dynamic dimensions are common for NLP, images, and variable request shapes. Operationally:
- Wider shape flexibility often means harder batching.
- Large shapes can dominate p99 and GPU memory.
- Input validation and admission control matter.
- Metrics should break down by request shape or cost class.
Model Repository Storage
Section titled “Model Repository Storage”Options:
- Baked into image: reproducible but huge images and slower build/promotion.
- Mounted volume: simple for shared clusters but can be bottlenecked.
- Object store sync/init container: flexible but needs caching, checksums, and failure handling.
- Sidecar/prefetch agent: good for large artifacts and warm pools.
Production stance:
I would avoid every pod independently downloading a massive model from object storage during scale-out. That creates a storage-induced outage exactly when capacity is needed.
Config As Code
Section titled “Config As Code”Treat config.pbtxt like production code:
- Reviewed.
- Versioned.
- Diffed.
- Tested.
- Validated against target GPU SKU.
- Linked to model artifact checksum.
Pre-promotion checks:
- Repository layout valid.
- Model loads.
- Metadata endpoint matches expected IO.
- Synthetic inference passes.
- Resource envelope matches measured memory/latency.
- Metrics labels are sane.
Artifact Contract
Section titled “Artifact Contract”Version together:
- Model file/engine.
config.pbtxt.- Tokenizer or pre/postprocessing config.
- Runtime image.
- Backend version.
- Resource envelope.
- Validation report.
Do not version together unless necessary:
- GPU driver rollout.
- Kubernetes upgrade.
- CNI change.
- Unrelated model config changes.
flowchart TD Artifact[Artifact bundle] --> Model[Model file or engine] Artifact --> Config[config.pbtxt] Artifact --> Runtime[Runtime image digest] Artifact --> Aux[Tokenizer and preprocess assets] Artifact --> Envelope[Resource envelope] Artifact --> Tests[Golden input tests] Artifact --> Owner[Owner and rollback metadata] Config --> Admission[Admission validation] Envelope --> Admission Tests --> Admission Admission --> Deployable[Deployable model version]
Common Config Bugs
Section titled “Common Config Bugs”| Bug | Symptom |
|---|---|
| Wrong tensor name | Client gets validation error. |
| Wrong dims | Shape mismatch or bad inference. |
Bad max_batch_size | Unexpected batch dimension errors. |
| Too many instances | GPU memory OOM or p99 regression. |
| Missing version policy | Wrong model version served. |
| Backend mismatch | Model load failure. |
| Mutable artifact path | Reproducibility and rollback failure. |
Staff-Level Close
Section titled “Staff-Level Close”The model repository is the production API between ML and platform teams. I would enforce it with schema checks, load tests, synthetic inference, artifact checksums, and rollout gates before a model is allowed to receive traffic.