Observe & meter
Token-level cost attribution, usage tracking, and anomaly detection across workloads.
The Johi Inference Gateway is the AI Gateway for Kubernetes. One control plane that brings production-grade inference reliability to the gateway you already run — fast, cost-predictable at scale, and recovering from failures automatically, with no new tooling to learn. Your gateway, your cluster, your data.
The pipeline already ran. Re-running the cache lookup, guardrails, and routing is wasted work — the prompt is unchanged, so the cache key, the model, and the verdict are all the same. Johi re-runs only the state-dependent steps: a fresh endpoint pick where the last one just failed, plus credentials if the token has expired.
Count tokens before a step rewrites the request — PII redaction, say — and the budget drifts: you commit 500 tokens when only 420 were sent, and over thousands of requests that drift compounds. Johi counts after every mutation, meters real usage per workload, and keeps each budget accurate.
Token-level cost attribution, usage tracking, and anomaly detection across workloads.
Pass the request through, or reject it. No mutation.
guardrails · AuthN / AuthZ · token rate limits
Short-circuit the pipeline and return a response directly.
semantic cache
Rewrite the request before it reaches the upstream.
semantic routing · credential injection
Processes the prompt of every request for semantics so the backend target adapts dynamically — identifying a "math request", for instance, and sending it to the most appropriate model.
Configure failure modes for every processing step — fail-open, fail-closed, fallback, and more — to ensure safe and efficient runtime behavior of your application.
Predictable ordering of all payload processing steps, so your pipeline behaves safely and consistently at runtime — every time.
Processes the payload of Model Context Protocol (MCP) requests to make routing and security decisions for your agents.
Sets or modifies request headers from payload attributes — looking up a session header from a store by tool, and routing the request to the correct backend MCP server.
Adds a detection engine that scans requests to identify malicious or anomalous payloads and blocks, sanitizes, and/or reports them before they reach your backends.
Detects repeated requests and returns cached results, reducing overall inference costs and improving latency for common requests.
Examines inference requests for personally identifiable information (PII), so any PII can be blocked, sanitized, or reported before the request reaches the inference backend.
Examines inference responses for malicious or misaligned results, so they can be dropped, sanitized, or reported before the response is sent back to the requester.
Added p99 latency for inline response checks — throughput your users never feel.
Inference availability with automatic, multi-provider failover and failure isolation.
Tell us which gateway and models you run, and we'll show you exactly how Johi fits — automatic failover, token budgets, guardrails, and metering, all inside your own Kubernetes cluster.