Model Inference
Definition
Using a trained AI model to make predictions or decisions on new data. Like applying learned knowledge to solve new problems.
Use Cases
- Netflix: Personalized content recommendations (ranking and selection of titles shown to each member) — Netflix uses machine learning models in production to score and rank content for users; inference is performed at serving time to generate personalized rows and recommendations based on user behavior and context signals. (Improves personalization and member engagement by showing more relevant titles, which supports retention and viewing satisfaction.)
- Uber: Estimated Time of Arrival (ETA) predictions for rides and deliveries — Uber runs ML models that take real-time signals (traffic, route, historical trip data, pickup conditions) and performs inference to predict ETAs that are surfaced in the rider and driver apps. (More accurate ETAs improve user trust and operational efficiency by helping riders plan and helping the platform coordinate supply and demand.)
- Google: Spam and phishing detection in Gmail — Gmail applies trained classification models to incoming messages; inference scores emails for spam/phishing likelihood and routes them to inbox or spam folders accordingly. (Reduces unwanted and malicious email exposure, improving user safety and inbox quality at large scale.)
Provider Equivalents
- AWS: Amazon SageMaker (Real-Time Inference, Asynchronous Inference, Batch Transform)
- Azure: Azure Machine Learning (Online Endpoints, Batch Endpoints)
- GCP: Vertex AI (Online Prediction, Batch Prediction)
- OCI: OCI Data Science (Model Deployment)
Frequently Asked Questions
- What's the difference between model inference and model training?
- Training is when you feed lots of labeled or historical data into an algorithm to learn model parameters (it’s compute-heavy and happens periodically). Inference is when you use the already-trained model to make a prediction on new data (it’s usually latency-sensitive and happens continuously in production).
- When should I use model inference (real-time vs batch)?
- Use real-time (online) inference when you need an immediate response, such as fraud checks during checkout, chatbot replies, or image recognition in an app. Use batch inference when you can process many records at once on a schedule, such as scoring all customers nightly for churn risk or generating weekly demand forecasts.
- How much does model inference cost?
- Cost depends on (1) compute type and size (CPU vs GPU/accelerators), (2) how long endpoints run (always-on vs scale-to-zero options where available), (3) request volume and payload size, (4) latency/throughput targets that drive overprovisioning, and (5) extras like load balancing, monitoring, and data transfer. Batch inference is often cheaper for non-urgent workloads because you pay for job runtime rather than keeping an endpoint running.
Category: ai-ml
Difficulty: intermediate
Related Terms
See Also