Orq.ai Documentation - AI Gateway & LLM Collaboration Platform

Quick Start

Distribute requests across multiple providers using weighted routing.

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  load_balancer: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

Parameter	Type	Required	Description
`load_balancer`	Object	Yes	Load balancer configuration (top-level)
`load_balancer.type`	string	Yes	Strategy type (`weight_based`)
`load_balancer.models`	Array	Yes	List of models with weights
`models[].model`	string	Yes	Model identifier
`models[].weight`	number	No	Relative weight (0.001 - 1.0, default 0.5)

Weight Calculation:

Weights are normalized: [0.4, 0.8] → [33%, 67%]
Higher weight = more traffic
Minimum weight: 0.001
Default weight: 0.5

Common Patterns

// Equal distribution
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 1.0 },
    { model: "anthropic/claude-3", weight: 1.0 },
  ],
}

// Cost optimization (cheap model primary)
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-3.5-turbo", weight: 0.8 },
    { model: "openai/gpt-4o", weight: 0.2 },
  ],
}

// A/B testing
load_balancer: {
  type: "weight_based",
  models: [
    { model: "current-model", weight: 0.9 },
    { model: "experimental-model", weight: 0.1 },
  ],
}

// Multi-provider redundancy
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 0.5 },
    { model: "anthropic/claude-3", weight: 0.3 },
    { model: "azure/gpt-4o", weight: 0.2 },
  ],
}

Use Cases

Scenario	Weight Strategy	Example
Cost optimization	Heavy on cheaper models	80% GPT-3.5, 20% GPT-4
Performance testing	Small traffic to new model	95% current, 5% experimental
Provider redundancy	Split across providers	60% OpenAI, 40% Anthropic
Capacity management	Distribute during peaks	Even split across models

Code examples

curl -X POST https://api.orq.ai/v3/router/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Write a creative marketing slogan for an eco-friendly coffee brand"
      }
    ],
    "load_balancer": {
      "type": "weight_based",
      "models": [
        {
          "model": "openai/gpt-3.5-turbo",
          "weight": 0.4
        },
        {
          "model": "anthropic/claude-3-haiku-20240307",
          "weight": 0.6
        }
      ]
    }
  }'

Monitoring

Track these metrics for optimal load balancing:

// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};

Key Metrics:

Traffic distribution: Actual vs expected percentages
Cost per model: Monitor spending across providers
Response times: Compare latency by model
Error rates: Track failures by provider

Troubleshooting

Uneven distribution

Check if weights are normalized correctly
Verify sufficient request volume (min 100 requests for accuracy)
Monitor over longer time periods

Unexpected costs

Track actual vs expected cost distribution
Monitor for expensive model overuse
Set up cost alerts per provider

Performance issues

Check latency differences between models
Monitor for provider-specific slowdowns
Adjust weights based on performance data

Limitations

Probabilistic routing: Short-term traffic may not match exact weights
Minimum volume needed: Requires sufficient requests for statistical accuracy
Response variations: Different models may return varying output quality
Cost complexity: Managing billing across multiple providers
Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:

const weights = {
  development: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
    ],
  },
  production: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
      { model: "anthropic/claude-3", weight: 0.3 }, // Backup
    ],
  },
};

Dynamic weight adjustment:

// Adjust weights based on performance
const adjustWeights = (metrics) => ({
  type: "weight_based",
  models: models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  })),
});

With other features:

{
  "model": "openai/gpt-4o",
  "load_balancer": {
    "type": "weight_based",
    "models": [
      { "model": "openai/gpt-4o", "weight": 0.6 },
      { "model": "anthropic/claude-3", "weight": 0.4 }
    ]
  },
  "orq": {
    "retry": { "count": 2, "on_codes": [429] },
    "timeout": { "call_timeout": 15000 }
  }
}

Get Started

AI Router

Observe

Build

Optimize

Manage Prompts

Load balancing across providers

Quick Start

Configuration

Common Patterns

Use Cases

See also: Organization-level load balancing

Code examples

Monitoring

Troubleshooting

Limitations

Advanced Usage

Get Started

AI Router

Observe

Build

Optimize

Manage Prompts

Documentation Index

​Quick Start

​Configuration

​Common Patterns

​Use Cases

See also: Organization-level load balancing

​Code examples

​Monitoring

​Troubleshooting

​Limitations

​Advanced Usage

Quick Start

Configuration

Common Patterns

Use Cases

Code examples

Monitoring

Troubleshooting

Limitations

Advanced Usage