Release date 2026/04/07
Feature
-
Added native passthrough support for Vertex AI multimodal embeddings. Introduced new
llm/v1/multimodal-embeddingsroute type. -
Added support for databricks provider.
-
Added support for DeepSeek provider.
-
Added the Ollama provider, for native support for all Ollama-hosted models.
-
Added ACL support that blocks requests matching deny rules, requires allow rules to pass when present, and resolves consumer/model values once per request.
-
Updated to use /v2/chat when proxying text generation request to cohere provider.
-
Added the ability to select target models by request models in same AI Proxy Advanced plugin.
-
Added support for proxying cohere embeddings models on bedrock.
-
Added support for vLLM provider.
Bugfix
-
Fixed structured output and tool calls for Ollama (also Llama2 and Mistral) providers.
-
Updated mappings for Ollama (also Llama2 and Mistral) providers.
-
Fixed an issue where attributes for GenAI spans were incorrect according to https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans.
-
Fixed excessive DNS warning logs for template variable endpoints like
$(headers.x_gcp_endpoint). These warnings appeared on every configuration change because template variables cannot be resolved until request time. -
Fixed
usage.costfor OpenAIllm/v1/files/{file_id}/contentbatch output analytics. -
Fixed an issue where metrics were missing and v2 API was unavailable in Cohere native format.
-
Fixed an issue where id/created fields were missing in multiple drivers.
-
Fixed an issue where gzip streaming response in OpenAI format was broken.
-
Fixed an issue where content-type header, from non-200 response of streaming request, was overridden.
-
Fixed an issue where GCP tokens would not be proactively refreshed before expiration.
-
Fixed zero usage statistics for OpenAI
llm/v1/responsesin streaming mode.
Performance
-
decoded OpenAI response only when necessary.