Model Serving

Definition

Making trained AI models available to applications through APIs or services for making predictions, facilitating real-time decision-making.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Model Serving and Model Training?
Model training is the process of learning model parameters from data (building the model). Model serving is putting that trained model behind an API or service so applications can send new inputs and get predictions back in real time or asynchronously.
When should I use Model Serving?
Use model serving when you need predictions inside an application workflow—such as recommendations, fraud checks, translation, or document classification—especially when you need low latency, consistent scaling, version control, and secure access. If you only need periodic predictions on large datasets, batch inference may be simpler and cheaper than always-on serving.
How much does Model Serving cost?
Cost depends mainly on (1) compute type (CPU vs GPU/accelerators), (2) how many instances or replicas you run, (3) autoscaling behavior and idle time, (4) memory/VRAM requirements, (5) request volume and payload sizes, and (6) networking and logging/monitoring. Managed services typically charge for provisioned compute time (or per-request for serverless options) plus any attached storage and data transfer.

Category: ai-ml

Difficulty: advanced

Related Terms

See Also