Skip to content

Triton Model Repository

The Triton model repository is the deployment contract. It describes what is served, which versions exist, which backend executes them, what tensor shapes are valid, how batching works, and how instances are placed.

If this contract is sloppy, operations becomes guesswork.

flowchart LR
  Train[Training pipeline] --> Export[Export model artifact]
  Export --> Validate[Validate config and shapes]
  Validate --> Package[Package artifact bundle]
  Package --> Registry[Registry or object store]
  Registry --> Promote[Promotion workflow]
  Promote --> Repo[Model repository]
  Repo --> Triton[Triton load]
  Triton --> Synthetic[Synthetic inference gate]

Canonical shape:

model_repository/
  resnet50/
    config.pbtxt
    1/
      model.plan
    2/
      model.plan

The version directories are part of the contract. Do not use mutable “latest” semantics inside production paths.

A minimal model config must identify:

  • Backend/platform.
  • max_batch_size.
  • Input tensors: name, data type, dimensions.
  • Output tensors: name, data type, dimensions.

Conceptual example:

name: "image_classifier"
backend: "tensorrt"
max_batch_size: 16

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "probabilities"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

Interview trap:

max_batch_size changes the expected tensor contract. If the model was not exported with a batch dimension, enabling batching can break requests or produce shape errors.

FieldSenior meaning
backend / platformRuntime path and dependency surface.
max_batch_sizeWhether Triton can batch requests and what first dimension means.
input / outputAPI contract and validation boundary.
dynamic_batchingQueueing and throughput/p99 tradeoff.
instance_groupHow many model instances and where they run.
version_policyWhich model versions can be served.
optimizationBackend/runtime optimization knobs.
parametersBackend-specific behavior.

Use version policy intentionally:

  • Serve only the current production version.
  • Serve latest N versions for rollback.
  • Serve specific versions during migration/canary.

Pitfall:

Serving multiple versions is not a rollout strategy by itself. You still need traffic routing, validation, and rollback control.

Instance groups control model execution instances:

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

Why it matters:

  • More instances can improve concurrency.
  • More instances consume memory.
  • Too many instances can hurt p99 through contention.
  • Placement interacts with MIG, whole GPU scheduling, and Kubernetes pod layout.

Senior answer:

I would choose instance count empirically with perf_analyzer/Model Analyzer against representative traffic, then validate under production-like co-tenancy.

Dynamic dimensions are common for NLP, images, and variable request shapes. Operationally:

  • Wider shape flexibility often means harder batching.
  • Large shapes can dominate p99 and GPU memory.
  • Input validation and admission control matter.
  • Metrics should break down by request shape or cost class.

Options:

  • Baked into image: reproducible but huge images and slower build/promotion.
  • Mounted volume: simple for shared clusters but can be bottlenecked.
  • Object store sync/init container: flexible but needs caching, checksums, and failure handling.
  • Sidecar/prefetch agent: good for large artifacts and warm pools.

Production stance:

I would avoid every pod independently downloading a massive model from object storage during scale-out. That creates a storage-induced outage exactly when capacity is needed.

Treat config.pbtxt like production code:

  • Reviewed.
  • Versioned.
  • Diffed.
  • Tested.
  • Validated against target GPU SKU.
  • Linked to model artifact checksum.

Pre-promotion checks:

  • Repository layout valid.
  • Model loads.
  • Metadata endpoint matches expected IO.
  • Synthetic inference passes.
  • Resource envelope matches measured memory/latency.
  • Metrics labels are sane.

Version together:

  • Model file/engine.
  • config.pbtxt.
  • Tokenizer or pre/postprocessing config.
  • Runtime image.
  • Backend version.
  • Resource envelope.
  • Validation report.

Do not version together unless necessary:

  • GPU driver rollout.
  • Kubernetes upgrade.
  • CNI change.
  • Unrelated model config changes.
flowchart TD
  Artifact[Artifact bundle] --> Model[Model file or engine]
  Artifact --> Config[config.pbtxt]
  Artifact --> Runtime[Runtime image digest]
  Artifact --> Aux[Tokenizer and preprocess assets]
  Artifact --> Envelope[Resource envelope]
  Artifact --> Tests[Golden input tests]
  Artifact --> Owner[Owner and rollback metadata]

  Config --> Admission[Admission validation]
  Envelope --> Admission
  Tests --> Admission
  Admission --> Deployable[Deployable model version]
BugSymptom
Wrong tensor nameClient gets validation error.
Wrong dimsShape mismatch or bad inference.
Bad max_batch_sizeUnexpected batch dimension errors.
Too many instancesGPU memory OOM or p99 regression.
Missing version policyWrong model version served.
Backend mismatchModel load failure.
Mutable artifact pathReproducibility and rollback failure.

The model repository is the production API between ML and platform teams. I would enforce it with schema checks, load tests, synthetic inference, artifact checksums, and rollout gates before a model is allowed to receive traffic.