Model Serving
Definition
Making trained AI models available to applications through APIs or services for making predictions, facilitating real-time decision-making.
Use Cases
- Netflix: Personalized recommendations and ranking to decide which titles to show each user — Deploys machine learning models into production services so applications can request predictions during browsing and playback; models are integrated into online serving systems to return ranked results with low latency (Improves personalization and user engagement by showing more relevant content, supporting large-scale, real-time experiences)
- Uber: Estimated time of arrival (ETA) predictions used in rider and driver apps — Serves trained prediction models behind internal APIs so multiple products can request ETA predictions in real time; the serving layer supports frequent model updates and high request volume (More accurate ETAs improve marketplace efficiency and customer experience by setting better expectations and aiding operational decisions)
- Spotify: Music recommendations and personalization (e.g., ranking tracks for playlists and home feed) — Uses online model inference within backend services so client applications can retrieve personalized rankings and recommendations via APIs (Better personalization increases listening time and satisfaction by helping users discover relevant music faster)
Provider Equivalents
- AWS: Amazon SageMaker (Real-Time Inference, Serverless Inference, Asynchronous Inference)
- Azure: Azure Machine Learning (Online Endpoints, Managed Online Endpoints)
- GCP: Vertex AI (Online Prediction Endpoints)
- OCI: OCI Data Science (Model Deployment)
Frequently Asked Questions
- What's the difference between Model Serving and Model Training?
- Model training is the process of learning model parameters from data (building the model). Model serving is putting that trained model behind an API or service so applications can send new inputs and get predictions back in real time or asynchronously.
- When should I use Model Serving?
- Use model serving when you need predictions inside an application workflow—such as recommendations, fraud checks, translation, or document classification—especially when you need low latency, consistent scaling, version control, and secure access. If you only need periodic predictions on large datasets, batch inference may be simpler and cheaper than always-on serving.
- How much does Model Serving cost?
- Cost depends mainly on (1) compute type (CPU vs GPU/accelerators), (2) how many instances or replicas you run, (3) autoscaling behavior and idle time, (4) memory/VRAM requirements, (5) request volume and payload sizes, and (6) networking and logging/monitoring. Managed services typically charge for provisioned compute time (or per-request for serverless options) plus any attached storage and data transfer.
Category: ai-ml
Difficulty: advanced
Related Terms
See Also