Now in early access — the Johi Inference Gateway Learn more ›

The inference control plane for Kubernetes.

The Johi Inference Gateway is the AI Gateway for Kubernetes. One control plane that brings production-grade inference reliability to the gateway you already run — fast, cost-predictable at scale, and recovering from failures automatically, with no new tooling to learn. Your gateway, your cluster, your data.

Live requests Streaming
  • POST llama-3.1-70b /v1/chat Recovered 842ms
  • POST mixtral-8x7b /v1/chat Cached 12ms
  • POST qwen2-72b /v1/completions Guarded 38ms
  • POST gemma-2-27b /v1/chat Metered 91ms
Routing, caching, guardrails, and metering — live on every request.
One layer. Both directions.

Inspect, route, and recover — without rebuilding your stack.

Johi works on the way out and the way back — adding multi-provider routing, token-aware rate limiting, automatic fallback, and full observability your gateway was never designed to carry.
Connection failed — what re-runs?
Semantic cacheSkip
GuardrailsSkip
Semantic routingSkip
Endpoint pickerRe-run
Credential injectionIf expired

When a call fails, re-run only what changed.

The pipeline already ran. Re-running the cache lookup, guardrails, and routing is wasted work — the prompt is unchanged, so the cache key, the model, and the verdict are all the same. Johi re-runs only the state-dependent steps: a fresh endpoint pick where the last one just failed, plus credentials if the token has expired.

See how it works
0%
budget drift — counted at 500, sent at 420, reconciled to the token
Order-aware countingPer-workload budgetsAnomaly detection

Token budgets that don't drift.

Count tokens before a step rewrites the request — PII redaction, say — and the budget drifts: you commit 500 tokens when only 420 were sent, and over thousands of requests that drift compounds. Johi counts after every mutation, meters real usage per workload, and keeps each budget accurate.

See how it works
What an AI Gateway actually does

Powered by Payload Processors.

As an AI Gateway, Johi inspects the full payload — not just the headers. Payload Processors are features capable of processing the full payload of requests and/or responses (including headers and body), composed into one ordered pipeline. They do two categories of work.

Observe & meter

Token-level cost attribution, usage tracking, and anomaly detection across workloads.

Allow or deny

Pass the request through, or reject it. No mutation.

guardrails · AuthN / AuthZ · token rate limits

Respond

Short-circuit the pipeline and return a response directly.

semantic cache

Mutate

Rewrite the request before it reaches the upstream.

semantic routing · credential injection

Built for your whole team

One control plane, every role.

Real needs from the people who run inference in production — from the developers writing applications to the operators, security, and compliance teams who keep them safe.
Application developer

Semantic routing

Processes the prompt of every request for semantics so the backend target adapts dynamically — identifying a "math request", for instance, and sending it to the most appropriate model.

Application developer

Declarative failure modes

Configure failure modes for every processing step — fail-open, fail-closed, fallback, and more — to ensure safe and efficient runtime behavior of your application.

Application developer

Predictable ordering

Predictable ordering of all payload processing steps, so your pipeline behaves safely and consistently at runtime — every time.

Agentic AI platform developer

MCP payload processing

Processes the payload of Model Context Protocol (MCP) requests to make routing and security decisions for your agents.

Agentic AI platform developer

Payload-driven headers

Sets or modifies request headers from payload attributes — looking up a session header from a store by tool, and routing the request to the correct backend MCP server.

Security engineer

Threat detection engine

Adds a detection engine that scans requests to identify malicious or anomalous payloads and blocks, sanitizes, and/or reports them before they reach your backends.

Cluster admin

Semantic caching

Detects repeated requests and returns cached results, reducing overall inference costs and improving latency for common requests.

Compliance officer

PII protection on requests

Examines inference requests for personally identifiable information (PII), so any PII can be blocked, sanitized, or reported before the request reaches the inference backend.

Compliance officer

Response inspection

Examines inference responses for malicious or misaligned results, so they can be dropped, sanitized, or reported before the response is sent back to the requester.

Kubernetes-native, production-ready

The reliability layer your inference traffic has been missing.

Purpose-built for AI inference at scale — high throughput, low latency, and high availability on both sides of your gateway, with every prompt staying inside your Kubernetes cluster. Ship the insights out; keep the payloads in.
<50ms

Added p99 latency for inline response checks — throughput your users never feel.

99.99%

Inference availability with automatic, multi-provider failover and failure isolation.

Questions & Answers

What is the Johi Inference Gateway?
It is an AI Gateway for Kubernetes — a gateway that speaks AI protocols. In practice that means an egress gateway that inspects the request payload, because in inference the body carries every routing, security, and caching decision while the headers are decorative.
Why inspect the body and not just the headers?
In a traditional API request the headers carry the decision — host, auth, content type. In an inference request the body does: which model to route to, the prompt to guard and cache, the parameters that set policy. Johi reads the payload so it can route, secure, cache, and meter on what actually matters.
How does Johi keep token budgets accurate?
Token limits are subtler than they look: budgets differ by model, streamed responses can overshoot before you cut them off, and output tokens cost several times more than input. Count before a step rewrites the request — PII redaction, say — and the budget drifts. Johi counts after every mutation and meters input and output separately.
Does the order of policies matter?
Yes — ordering is about correctness, not just efficiency. Guardrails run first, so a bad request never reaches the tokenizer or the model. A cache lookup comes after guardrails — you never serve a cached answer to a request that should be rejected — but before counting, since a hit costs zero inference. Credential injection runs last, once the model and endpoint are resolved.
When a request fails, does the whole pipeline re-run?
No. The cache lookup, guardrails, and routing are validate-only: the prompt is unchanged, so the verdict, the model, and the cache key are all the same, and re-running them is wasted work. Johi re-runs only the state-dependent steps — a fresh endpoint pick where the last one just failed, and credentials if the token has expired.
How does Johi handle traffic leaving the cluster?
Egress flips the usual TLS assumptions: your workload dials a cluster-local service name while the upstream expects its external hostname, so the Host header and the TLS SNI disagree and the handshake breaks. Johi reconciles the two identities — and scopes credentials per destination, so a provider token never leaks onto a health check or a non-AI route.

Put Johi in front of your models.

Tell us which gateway and models you run, and we'll show you exactly how Johi fits — automatic failover, token budgets, guardrails, and metering, all inside your own Kubernetes cluster.