Transcription

Configure Whisper models, language, Voice Activity Detection, and streaming modes.

All voice transcription in Vocoding is processed 100% locally using Whisper -- no audio data ever leaves your device.

Whisper Model Selection

Choose which Whisper model to use for voice transcription.

Model	Size	Speed	Accuracy	Languages
tiny	75 MB	Fastest	Basic	Multi
tiny.en	75 MB	Fastest	Basic	English only
base	142 MB	Fast	Good	Multi
base.en	142 MB	Fast	Good	English only
small	466 MB	Medium	Better	Multi
small.en	466 MB	Medium	Better	English only
medium	1.5 GB	Slow	High	Multi
large-v3	2.9 GB	Slowest	Best	Multi
large-v3-turbo	1.5 GB	Fast	High	Multi

Each model shows a status badge:

All models are stored locally at ~/.vocoding/models/.

Select the transcription language:

Option	Description
Auto Detect	Whisper detects the language automatically
English	Force English recognition
Espanol	Force Spanish recognition
Francais	Force French recognition
Deutsch	Force German recognition
Italiano	Force Italian recognition
Portugues	Force Portuguese recognition
Chinese	Force Chinese recognition
Japanese	Force Japanese recognition
Korean	Force Korean recognition

VAD detects when you start and stop speaking, making the recording experience more natural.

Setting	Description	Default
Enable VAD	Turn VAD on or off	ON
Sensitivity	How sensitive to voice (0-100%)	50%
Auto-Stop	Automatically stop recording when silence detected	OFF
Auto-Stop Delay	How long to wait after silence (0.5s - 5s)	1.5s

Auto-Stop Delay slider only appears when Auto-Stop is enabled.

Level	Best For
Low (0-30%)	Noisy environments -- only picks up clear, loud speech
Medium (30-70%)	Most environments -- good balance
High (70-100%)	Quiet environments -- picks up soft speech

Short delay (0.5-1s): For quick commands and short phrases
Medium delay (1.5-2.5s): For normal speaking with natural pauses
Long delay (3-5s): For thoughtful speaking with extended pauses between sentences

Controls how transcription results are delivered in real-time during recording.

Mode	Description	CPU Usage	Best For
Quiet	Minimal streaming -- shows final result only	Lower	Battery saving, background use
Balanced	VAD-gated streaming -- shows partial results while speaking	Medium	Daily use (recommended)
Max Performance	Continuous streaming -- real-time partial transcription	Higher	Live demos, immediate visual feedback

Quiet: Choose this if you don't need to see partial results and want to minimize CPU usage
Balanced: The default -- shows partial transcription only when you are actively speaking
Max Performance: Shows every word as it's recognized, but uses more CPU. Great for presentations or when you want immediate visual feedback