AIT
Last updated
Last updated
Claigrid’s AI Inference Termination (AIT) service offers scalable, efficient infrastructure for deploying and running AI models with high performance and flexibility.
GPU Splitting with CIMS: The Cloud Instance Management Service (CIMS) can partition a GPU attached to a worker node into 4–126 isolated instances (pods), based on model size, optimization and compute resource requirements (e.g. GPU model, available memory, etc.). This allows for efficient resource allocation and scaling, ensuring optimal performance even with large numbers of concurrent requests.
Model Optimization Recommendations: For best performance, users are encouraged to optimize their models. Common optimization methods include:
Reduces model precision (e.g., from FP16 to INT8) to save memory and improve inference speed.
Trains a smaller model to mimic a larger one, retaining high performance with lower resource usage.
Eliminates redundant parameters in large models, reducing the size without significantly affecting accuracy.
Used in some Large Language Models (LLMs) to reuse weights across different layers, reducing memory footprint.
Optimized models or proven open-source models can significantly improve efficiency in AIT, especially for large language models (LLMs) and diffusion models.
Recommended Model Sizes: Based on our tests, model sizes between 6–12GB tend to provide the best balance of performance and efficiency. This range supports robust model functionality while maintaining manageable resource consumption.
Sample Resource Configurations:
LLMs (e.g., GPT-2, BERT): For large language models that require significant computational power, we recommend:
p3.8xlarge: 4 NVIDIA V100 GPUs (16GB each), 32 vCPUs, and 244GB RAM.
G5g.16xlarge: Powered by NVIDIA T4G Tensor Core GPUs, with 4 GPUs (16GB each), 64 vCPUs, and 128GB RAM, suitable for optimized LLM workloads.
p6i.24xlarge (if available): NVIDIA A100 GPUs (40GB each), providing even higher memory and efficiency for advanced inference tasks.
Diffusion Models (e.g., Stable Diffusion): For image generation models requiring strong GPU and memory support, ideal instances include:
p3.2xlarge: 1 NVIDIA V100 GPU (16GB), 8 vCPUs, and 61GB RAM.
g4dn.12xlarge: 1 NVIDIA T4 GPU (16GB), 48 vCPUs, and 192GB RAM.
g5g.xlarge: 1 T4G GPU (16GB), 4 vCPUs, and 16GB RAM—providing a balance of GPU power and scalability for image generation.
Smaller Models (e.g., BERT-base): For lightweight inference models, the following instances offer cost-effective options:
g4dn.xlarge: 1 T4 GPU (16GB), 4 vCPUs, and 16GB RAM.
g5g.large: 1 T4G GPU (16GB), 2 vCPUs, and 8GB RAM—an efficient choice for smaller models.
p6i.large: A next-generation choice with 1 A100 GPU, 2 vCPUs, and 8GB RAM, providing efficient resource use for lightweight tasks.
These are our recommended configurations, but users are highly encouraged to consult the AWS documentation and assess their own budgeting and scaling requirements to select the most suitable instance types.
Cold and Warm Start Times: Current cold start times range from 1–3 minutes, with improvements in the next update expected to reduce this to only a few seconds. Warm start times are already optimized at 600ms–2 seconds, depending on model and code size.
Supported Libraries and Frameworks: AIT supports a wide range of Python libraries, including , , and .
TensorFlow compatibility is coming soon, with support for popular models like and .