Question 1

What's the difference between Model Serving and Model Training?

Accepted Answer

Model training is the process of creating a model by learning from data (it’s compute-heavy and usually done offline). Model serving is what happens after training: you deploy the trained model so applications can send requests (inputs) and get predictions (outputs) through an API, typically with low latency and high availability.

Question 2

When should I use Model Serving instead of batch predictions?

Accepted Answer

Use model serving when you need predictions immediately (milliseconds to seconds) as part of an application workflow—like fraud checks during checkout, real-time translation, or chat responses. Use batch predictions when timing is flexible and you can score many records at once—like nightly churn scoring or weekly demand forecasts.

Question 3

How much does Model Serving cost?

Accepted Answer

Cost depends on (1) compute type and size (CPU vs GPU, memory), (2) how many replicas you run and whether autoscaling is enabled, (3) request volume and payload size, (4) uptime requirements (24/7 endpoints cost more than on-demand), (5) networking/data transfer, and (6) operational features like monitoring and logging. Managed serverless inference can reduce idle costs for spiky traffic, while always-on GPU endpoints can be expensive but necessary for low-latency large models.

Model Serving

Definition

Real-World Example

Related Terms

Cloud Provider Equivalencies

Explore More Cloud Computing Terms