Speech-to-Text

Definition

Google's service for converting audio to text with high accuracy, empowering applications to transcribe spoken language into written format.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Speech-to-Text and voice recognition?
Speech-to-Text converts spoken words into written text (a transcript). Voice recognition (speaker recognition) tries to identify who is speaking. You can use both together—for example, transcribe a call and also label which speaker is the customer vs. the agent.
When should I use Speech-to-Text?
Use it when you need searchable, editable text from audio—like meeting captions, call center transcripts, voice notes, podcast indexing, or generating subtitles. It’s especially useful when you want real-time captions (streaming) or when you need to process many recordings automatically (batch).
How much does Speech-to-Text cost?
Pricing is usually based on audio duration (per second or per minute). Costs vary by features such as streaming vs. batch, model type (e.g., phone call vs. video), language, and optional add-ons like diarization (speaker separation) or custom models. Also consider indirect costs like storing audio/transcripts and any network egress charges if audio moves between clouds or regions.

Category: ai-ml

Difficulty: basic

See Also