No description
Find a file
2026-02-02 08:37:21 -08:00
onnx Initial commit 2026-01-18 16:51:29 -08:00
.gitattributes initial commit 2026-01-19 00:47:05 +00:00
CODE-LICENSE Initial commit 2026-01-18 16:51:29 -08:00
EventEmitter.js Initial commit 2026-01-18 16:51:29 -08:00
index.html Add: increased max generation length 2026-02-02 08:37:21 -08:00
inference-worker.js Add: increased max generation length 2026-02-02 08:37:21 -08:00
onnx-streaming.js Add: increased max generation length 2026-02-02 08:37:21 -08:00
PCMPlayerWorklet.js Initial commit 2026-01-18 16:51:29 -08:00
README.md Add COOP/COEP headers for multi-threading 2026-01-18 18:26:56 -08:00
sentencepiece-browser.js Initial commit 2026-01-18 16:51:29 -08:00
sentencepiece.js Initial commit 2026-01-18 16:51:29 -08:00
server.py Initial commit 2026-01-18 16:51:29 -08:00
style.css Initial commit 2026-01-18 16:51:29 -08:00
tokenizer.model Initial commit 2026-01-18 16:51:29 -08:00
voices.bin Initial commit 2026-01-18 16:51:29 -08:00

title emoji colorFrom colorTo sdk app_file pinned license short_description models custom_headers
Pocket TTS ONNX Web Demo 🌖 yellow pink static index.html false cc-by-4.0 Real-time voice cloning entirely in your browser! (CPU)
KevinAHM/pocket-tts-onnx
cross-origin-embedder-policy cross-origin-opener-policy cross-origin-resource-policy
require-corp same-origin cross-origin

Pocket TTS Web Demo

Real-time neural text-to-speech with voice cloning, running entirely in your browser.

Features

  • Voice Cloning: Clone any voice from a short audio sample
  • Predefined Voices: 3 bundled voices (Cosette, Jean, Fantine)
  • Streaming Audio: Real-time audio generation with low latency
  • Pure Browser: No server required, runs entirely in WebAssembly

Model Files

The demo requires the following ONNX models in the onnx/ directory:

File Size Purpose
mimi_encoder.onnx ~70 MB Voice audio → embeddings
text_conditioner.onnx ~16 MB Text tokens → embeddings
flow_lm_main_int8.onnx ~73 MB AR transformer (INT8)
flow_lm_flow_int8.onnx ~10 MB Flow matching network (INT8)
mimi_decoder_int8.onnx ~22 MB Latents → audio decoder (INT8)

Additional files:

  • tokenizer.model - SentencePiece tokenizer (~60 KB)
  • voices.bin - Predefined voice embeddings (~1.5 MB)

Browser Requirements

  • Modern browser with WebAssembly support
  • Chrome, Edge, Firefox, or Safari (latest versions)
  • ~200 MB RAM for model loading

Voice Cloning

  1. Click "Upload Voice" or select "Custom (Upload)" from the dropdown
  2. Upload an audio file (WAV, MP3, etc.) with clear speech
  3. Best results with 3-10 seconds of clean audio
  4. The voice will be encoded and used for all subsequent generations

File Structure

pocket-tts-web/
├── index.html              # Main HTML page
├── onnx-streaming.js       # Main thread controller
├── inference-worker.js     # Web Worker for ONNX inference
├── PCMPlayerWorklet.js     # Audio playback worklet
├── EventEmitter.js         # Event utilities
├── sentencepiece.js        # SentencePiece tokenizer library
├── style.css               # Styles
├── tokenizer.model         # SentencePiece model
├── voices.bin              # Predefined voice embeddings
└── onnx/
    ├── mimi_encoder.onnx
    ├── text_conditioner.onnx
    ├── flow_lm_main_int8.onnx
    ├── flow_lm_flow_int8.onnx
    └── mimi_decoder_int8.onnx

License

  • Models & Voice Embeddings: CC BY 4.0 (inherited from kyutai/pocket-tts)
  • Code: Apache 2.0