Vocoding
Vocoding Docs
Settings

Transcription

Configure Whisper models, language, Voice Activity Detection, and streaming modes.

All voice transcription in Vocoding is processed 100% locally using Whisper -- no audio data ever leaves your device.


Whisper Model Selection

Choose which Whisper model to use for voice transcription.

ModelSizeSpeedAccuracyLanguages
tiny75 MBFastestBasicMulti
tiny.en75 MBFastestBasicEnglish only
base142 MBFastGoodMulti
base.en142 MBFastGoodEnglish only
small466 MBMediumBetterMulti
small.en466 MBMediumBetterEnglish only
medium1.5 GBSlowHighMulti
large-v32.9 GBSlowestBestMulti
large-v3-turbo1.5 GBFastHighMulti

Status Badges

Each model shows a status badge:

  • Active -- Currently selected and ready
  • Activate -- Downloaded but not selected (click to activate)
  • Download -- Not yet downloaded (click to download)

Model Storage Location

All models are stored locally at ~/.vocoding/models/.


Language

Select the transcription language:

OptionDescription
Auto DetectWhisper detects the language automatically
EnglishForce English recognition
EspanolForce Spanish recognition
FrancaisForce French recognition
DeutschForce German recognition
ItalianoForce Italian recognition
PortuguesForce Portuguese recognition
ChineseForce Chinese recognition
JapaneseForce Japanese recognition
KoreanForce Korean recognition

Voice Activity Detection (VAD)

VAD detects when you start and stop speaking, making the recording experience more natural.

SettingDescriptionDefault
Enable VADTurn VAD on or offON
SensitivityHow sensitive to voice (0-100%)50%
Auto-StopAutomatically stop recording when silence detectedOFF
Auto-Stop DelayHow long to wait after silence (0.5s - 5s)1.5s

Auto-Stop Delay slider only appears when Auto-Stop is enabled.

Sensitivity Tuning

LevelBest For
Low (0-30%)Noisy environments -- only picks up clear, loud speech
Medium (30-70%)Most environments -- good balance
High (70-100%)Quiet environments -- picks up soft speech

Auto-Stop Delay Tips

  • Short delay (0.5-1s): For quick commands and short phrases
  • Medium delay (1.5-2.5s): For normal speaking with natural pauses
  • Long delay (3-5s): For thoughtful speaking with extended pauses between sentences

Streaming Modes

Controls how transcription results are delivered in real-time during recording.

ModeDescriptionCPU UsageBest For
QuietMinimal streaming -- shows final result onlyLowerBattery saving, background use
BalancedVAD-gated streaming -- shows partial results while speakingMediumDaily use (recommended)
Max PerformanceContinuous streaming -- real-time partial transcriptionHigherLive demos, immediate visual feedback

Choosing the Right Mode

  • Quiet: Choose this if you don't need to see partial results and want to minimize CPU usage
  • Balanced: The default -- shows partial transcription only when you are actively speaking
  • Max Performance: Shows every word as it's recognized, but uses more CPU. Great for presentations or when you want immediate visual feedback