Managed Services for Serverless and Dedicated Model Endpoints - Simplifying AI Model Deployment & Operations.
Exceptional Price-Performance Inference at Scale - Seamless OpenAI API Compatibility.
Seamlessly Use OpenAI API-Compatible APIs Alongside Top-Tier Open Source Foundation Models: Llama, Qwen, and DeepSeek.
Utilize Private Endpoints to Boost Reliability and Protect Privacy - Ideal for Both Open-Weight and Custom Private Fine-Tuned Models.
Collaboration with Intel to Deliver Enterprise-Ready Inference, Powered by Cost-Effective Intel Xeon and Gaudi AI Accelerators Tailored for Enterprise Use.
Readily Accessible Open-Weight Foundation Models, Paired with API Endpoints to Enable Seamless Rapid Integration.
| MODEL NAME | PARAMS | CONTEXT | PRECISION |
|---|---|---|---|
| Llama 3.3 | 70B | 32k | BF16 |
| Llama 3.2 | 1B, 3B | 32k | BF16 |
| Llama 3.1 | 8B, 70B | 32k | BF16 |
| Llama 3.1 (soon) | 405B | 32k | FP8 |
| DeepSeek R1 (soon) | 671B | 32k | FP8 |
| Mistral v0.1 | 7B, 8×7B | 32k | BF16 |
| Qwen 2.5 | 7B, 14B, 32B, 72B | 32k | BF16 |
| Falcon 3 | 7B, 10B | 32k | BF16 |
| ALLam-AI Preview | 7B | 32k | BF |
| BGE M3 Embedder | 108M | 8k | BF16 |
| BGE M3 Reranker | 568M | 1k | BF16 |
Seamless Native OpenAI API Compatibility accelerates rapid model migration and streamlined inference deployment.
Cost-Effective Managed Services reduce your hosting and operational expenses.
Model serving optimized to prioritize first-token latency or batch throughput.
Serverless endpoints are limited to published models and up to 60 requests per second.
Valuable Inference with zero unnecessary overhead.
Facilitates easy access and deployment of popular ready-to-use AI models.
Model deployment removes the need for hardware management, maintenance, or operational burdens.
Facilitates hosting and deployment of tailored models.