Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.orq.ai/llms.txt

Use this file to discover all available pages before exploring further.

Quick Start

Distribute requests across multiple providers using weighted routing.
const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini", // Primary model (ignored when load balancing)
  messages: [{ role: "user", content: "Write a marketing slogan" }],
  load_balancer: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-3.5-turbo", weight: 0.7 }, // 70% of requests
      { model: "anthropic/claude-3-haiku", weight: 0.3 }, // 30% of requests
    ],
  },
});

Configuration

ParameterTypeRequiredDescription
load_balancerObjectYesLoad balancer configuration (top-level)
load_balancer.typestringYesStrategy type (weight_based)
load_balancer.modelsArrayYesList of models with weights
models[].modelstringYesModel identifier
models[].weightnumberNoRelative weight (0.001 - 1.0, default 0.5)
Weight Calculation:
  • Weights are normalized: [0.4, 0.8][33%, 67%]
  • Higher weight = more traffic
  • Minimum weight: 0.001
  • Default weight: 0.5

Common Patterns

// Equal distribution
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 1.0 },
    { model: "anthropic/claude-3", weight: 1.0 },
  ],
}

// Cost optimization (cheap model primary)
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-3.5-turbo", weight: 0.8 },
    { model: "openai/gpt-4o", weight: 0.2 },
  ],
}

// A/B testing
load_balancer: {
  type: "weight_based",
  models: [
    { model: "current-model", weight: 0.9 },
    { model: "experimental-model", weight: 0.1 },
  ],
}

// Multi-provider redundancy
load_balancer: {
  type: "weight_based",
  models: [
    { model: "openai/gpt-4o", weight: 0.5 },
    { model: "anthropic/claude-3", weight: 0.3 },
    { model: "azure/gpt-4o", weight: 0.2 },
  ],
}

Use Cases

ScenarioWeight StrategyExample
Cost optimizationHeavy on cheaper models80% GPT-3.5, 20% GPT-4
Performance testingSmall traffic to new model95% current, 5% experimental
Provider redundancySplit across providers60% OpenAI, 40% Anthropic
Capacity managementDistribute during peaksEven split across models

See also: Organization-level load balancing

To apply load balancing across your organization without changing request code, use Routing Rules to configure Fallback, Weighted, and Round Robin strategies at the workspace level.

Code examples

curl -X POST https://api.orq.ai/v3/router/chat/completions \
  -H "Authorization: Bearer $ORQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Write a creative marketing slogan for an eco-friendly coffee brand"
      }
    ],
    "load_balancer": {
      "type": "weight_based",
      "models": [
        {
          "model": "openai/gpt-3.5-turbo",
          "weight": 0.4
        },
        {
          "model": "anthropic/claude-3-haiku-20240307",
          "weight": 0.6
        }
      ]
    }
  }'

Monitoring

Track these metrics for optimal load balancing:
// Example monitoring setup
const metrics = {
  requestsByModel: {}, // Count per model
  costsByModel: {}, // Cost per model
  latencyByModel: {}, // Response time per model
  errorsByModel: {}, // Error rate per model
};
Key Metrics:
  • Traffic distribution: Actual vs expected percentages
  • Cost per model: Monitor spending across providers
  • Response times: Compare latency by model
  • Error rates: Track failures by provider

Troubleshooting

Uneven distribution
  • Check if weights are normalized correctly
  • Verify sufficient request volume (min 100 requests for accuracy)
  • Monitor over longer time periods
Unexpected costs
  • Track actual vs expected cost distribution
  • Monitor for expensive model overuse
  • Set up cost alerts per provider
Performance issues
  • Check latency differences between models
  • Monitor for provider-specific slowdowns
  • Adjust weights based on performance data

Limitations

  • Probabilistic routing: Short-term traffic may not match exact weights
  • Minimum volume needed: Requires sufficient requests for statistical accuracy
  • Response variations: Different models may return varying output quality
  • Cost complexity: Managing billing across multiple providers
  • Provider dependencies: Requires API access to all models

Advanced Usage

Environment-specific weights:
const weights = {
  development: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-3.5-turbo", weight: 1.0 }, // Cheap for dev
    ],
  },
  production: {
    type: "weight_based",
    models: [
      { model: "openai/gpt-4o", weight: 0.7 }, // Quality primary
      { model: "anthropic/claude-3", weight: 0.3 }, // Backup
    ],
  },
};
Dynamic weight adjustment:
// Adjust weights based on performance
const adjustWeights = (metrics) => ({
  type: "weight_based",
  models: models.map((model) => ({
    model: model.name,
    weight: calculateWeight(model.latency, model.cost, model.quality),
  })),
});
With other features:
{
  "model": "openai/gpt-4o",
  "load_balancer": {
    "type": "weight_based",
    "models": [
      { "model": "openai/gpt-4o", "weight": 0.6 },
      { "model": "anthropic/claude-3", "weight": 0.4 }
    ]
  },
  "orq": {
    "retry": { "count": 2, "on_codes": [429] },
    "timeout": { "call_timeout": 15000 }
  }
}