Triton Model Repository

Why This Matters

The Triton model repository is the deployment contract. It describes what is served, which versions exist, which backend executes them, what tensor shapes are valid, how batching works, and how instances are placed.

If this contract is sloppy, operations becomes guesswork.

flowchart LR
  Train[Training pipeline] --> Export[Export model artifact]
  Export --> Validate[Validate config and shapes]
  Validate --> Package[Package artifact bundle]
  Package --> Registry[Registry or object store]
  Registry --> Promote[Promotion workflow]
  Promote --> Repo[Model repository]
  Repo --> Triton[Triton load]
  Triton --> Synthetic[Synthetic inference gate]

Repository Layout

Canonical shape:

model_repository/
  resnet50/
    config.pbtxt
    1/
      model.plan
    2/
      model.plan

The version directories are part of the contract. Do not use mutable “latest” semantics inside production paths.

Minimal Config

A minimal model config must identify:

Backend/platform.
max_batch_size.
Input tensors: name, data type, dimensions.
Output tensors: name, data type, dimensions.

Conceptual example:

name: "image_classifier"
backend: "tensorrt"
max_batch_size: 16

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "probabilities"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

Interview trap:

max_batch_size changes the expected tensor contract. If the model was not exported with a batch dimension, enabling batching can break requests or produce shape errors.

Config Fields To Know

Field	Senior meaning
`backend` / `platform`	Runtime path and dependency surface.
`max_batch_size`	Whether Triton can batch requests and what first dimension means.
`input` / `output`	API contract and validation boundary.
`dynamic_batching`	Queueing and throughput/p99 tradeoff.
`instance_group`	How many model instances and where they run.
`version_policy`	Which model versions can be served.
`optimization`	Backend/runtime optimization knobs.
`parameters`	Backend-specific behavior.

Version Policy

Use version policy intentionally:

Serve only the current production version.
Serve latest N versions for rollback.
Serve specific versions during migration/canary.

Pitfall:

Serving multiple versions is not a rollout strategy by itself. You still need traffic routing, validation, and rollback control.

Instance Groups

Instance groups control model execution instances:

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

Why it matters:

More instances can improve concurrency.
More instances consume memory.
Too many instances can hurt p99 through contention.
Placement interacts with MIG, whole GPU scheduling, and Kubernetes pod layout.

Senior answer:

I would choose instance count empirically with perf_analyzer/Model Analyzer against representative traffic, then validate under production-like co-tenancy.

Dynamic Shapes

Dynamic dimensions are common for NLP, images, and variable request shapes. Operationally:

Wider shape flexibility often means harder batching.
Large shapes can dominate p99 and GPU memory.
Input validation and admission control matter.
Metrics should break down by request shape or cost class.

Model Repository Storage

Options:

Baked into image: reproducible but huge images and slower build/promotion.
Mounted volume: simple for shared clusters but can be bottlenecked.
Object store sync/init container: flexible but needs caching, checksums, and failure handling.
Sidecar/prefetch agent: good for large artifacts and warm pools.

Production stance:

I would avoid every pod independently downloading a massive model from object storage during scale-out. That creates a storage-induced outage exactly when capacity is needed.

Config As Code

Treat config.pbtxt like production code:

Reviewed.
Versioned.
Diffed.
Tested.
Validated against target GPU SKU.
Linked to model artifact checksum.

Pre-promotion checks:

Repository layout valid.
Model loads.
Metadata endpoint matches expected IO.
Synthetic inference passes.
Resource envelope matches measured memory/latency.
Metrics labels are sane.

Artifact Contract

Version together:

Model file/engine.
config.pbtxt.
Tokenizer or pre/postprocessing config.
Runtime image.
Backend version.
Resource envelope.
Validation report.

Do not version together unless necessary:

GPU driver rollout.
Kubernetes upgrade.
CNI change.
Unrelated model config changes.

flowchart TD
  Artifact[Artifact bundle] --> Model[Model file or engine]
  Artifact --> Config[config.pbtxt]
  Artifact --> Runtime[Runtime image digest]
  Artifact --> Aux[Tokenizer and preprocess assets]
  Artifact --> Envelope[Resource envelope]
  Artifact --> Tests[Golden input tests]
  Artifact --> Owner[Owner and rollback metadata]

  Config --> Admission[Admission validation]
  Envelope --> Admission
  Tests --> Admission
  Admission --> Deployable[Deployable model version]

Common Config Bugs

Bug	Symptom
Wrong tensor name	Client gets validation error.
Wrong dims	Shape mismatch or bad inference.
Bad `max_batch_size`	Unexpected batch dimension errors.
Too many instances	GPU memory OOM or p99 regression.
Missing version policy	Wrong model version served.
Backend mismatch	Model load failure.
Mutable artifact path	Reproducibility and rollback failure.

Staff-Level Close

The model repository is the production API between ML and platform teams. I would enforce it with schema checks, load tests, synthetic inference, artifact checksums, and rollout gates before a model is allowed to receive traffic.