Integration reference

Version 2026-04-01 · Example base URL https://infer.your-tenant.backend-dev.example (your deployment will differ).

Overview

After we install your model stack, your application talks to an OpenAI-compatible or plain JSON inference layer (we agree which during the project). This reference describes the control plane we often leave in place: keys, usage headers, batch evaluation jobs, and artefact pins (adapters, indexes).

All timestamps are RFC 3339 in UTC. Request bodies are UTF-8 JSON unless noted.

Authentication

Send the deployment secret in the Authorization header. Rotate keys from the operator console we hand over.

Authorization: Bearer bd_infer_************************

Read-only analytics tokens use the bd_read_ prefix and cannot enqueue jobs or change artefacts.

Errors

Errors return JSON with a stable code, human message, and optional details.

{
  "error": {
    "code": "context_length_exceeded",
    "message": "prompt + max_tokens exceeds model window",
    "request_id": "req_8f2a9c1d4e",
    "details": { "model": "bd-custom-7b", "max_input_tokens": 8192 }
  }
}
StatusMeaning
400Validation—bad JSON, missing field, policy block
401Missing or invalid token
403Token cannot perform this action
404Model or artefact ID unknown
429Quota or concurrency limit; Retry-After set
503GPU pool warming or overload—retry with backoff

Inference

POST /v1/chat/completions

OpenAI-style chat payload. model is the logical name we registered for your tuned weights.

curl -s https://infer.your-tenant…/v1/chat/completions \
  -H "Authorization: Bearer bd_infer_***" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bd-support-7b-2026-03",
    "messages": [
      {"role":"system","content":"You answer using only the provided context."},
      {"role":"user","content":"Where is my shipment?"}
    ],
    "temperature": 0.2,
    "max_tokens": 256
  }'

Streaming (stream: true) uses SSE chunks with the same schema as upstream OpenAI clients expect.

Batch & tuning jobs

POST /v1/jobs

Enqueue offline evaluation, re-indexing, or a scheduled adapter refresh. Returns 202 with a job id.

FieldTypeRequiredDescription
typestringyese.g. eval_regression, rebuild_index, merge_lora
payloadobjectyesType-specific; validated server-side
idempotency_keystringnoDedupes within 24h
{
  "id": "job_7Qk2Lp9",
  "status": "queued",
  "created_at": "2026-04-05T14:02:11Z"
}
GET /v1/jobs/{job_id}

Poll for completion; result may include metric summaries and artefact ids to pin in production.

{
  "id": "job_7Qk2Lp9",
  "status": "succeeded",
  "finished_at": "2026-04-05T14:18:02Z",
  "result": {
    "exact_match": 0.41,
    "human_preferred_rate": 0.78,
    "artefact_id": "art_lora_20260405a"
  }
}

Artefacts

GET /v1/artefacts/{artefact_id}

Metadata for adapters, merged weights, or snapshotted vector indexes—hashes, size, and compatibility matrix. Binary blobs are fetched from signed URLs in the response.

{
  "id": "art_lora_20260405a",
  "kind": "lora_adapter",
  "sha256": "b3c1…9f",
  "compatible_models": ["bd-support-7b-2026-03"],
  "created_at": "2026-04-05T14:17:40Z"
}

Questions about your deployment? Contact us.