mirror of
https://huggingface.co/spaces/RobinsAIWorld/pocket-tts-web
synced 2026-05-20 20:11:42 -04:00
No description
| onnx | ||
| .gitattributes | ||
| CODE-LICENSE | ||
| EventEmitter.js | ||
| index.html | ||
| inference-worker.js | ||
| onnx-streaming.js | ||
| PCMPlayerWorklet.js | ||
| README.md | ||
| sentencepiece-browser.js | ||
| sentencepiece.js | ||
| server.py | ||
| style.css | ||
| tokenizer.model | ||
| voices.bin | ||
| title | emoji | colorFrom | colorTo | sdk | app_file | pinned | license | short_description | models | custom_headers | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pocket TTS ONNX Web Demo | 🌖 | yellow | pink | static | index.html | false | cc-by-4.0 | Real-time voice cloning entirely in your browser! (CPU) |
|
|
Pocket TTS Web Demo
Real-time neural text-to-speech with voice cloning, running entirely in your browser.
Features
- Voice Cloning: Clone any voice from a short audio sample
- Predefined Voices: 3 bundled voices (Cosette, Jean, Fantine)
- Streaming Audio: Real-time audio generation with low latency
- Pure Browser: No server required, runs entirely in WebAssembly
Model Files
The demo requires the following ONNX models in the onnx/ directory:
| File | Size | Purpose |
|---|---|---|
mimi_encoder.onnx |
~70 MB | Voice audio → embeddings |
text_conditioner.onnx |
~16 MB | Text tokens → embeddings |
flow_lm_main_int8.onnx |
~73 MB | AR transformer (INT8) |
flow_lm_flow_int8.onnx |
~10 MB | Flow matching network (INT8) |
mimi_decoder_int8.onnx |
~22 MB | Latents → audio decoder (INT8) |
Additional files:
tokenizer.model- SentencePiece tokenizer (~60 KB)voices.bin- Predefined voice embeddings (~1.5 MB)
Browser Requirements
- Modern browser with WebAssembly support
- Chrome, Edge, Firefox, or Safari (latest versions)
- ~200 MB RAM for model loading
Voice Cloning
- Click "Upload Voice" or select "Custom (Upload)" from the dropdown
- Upload an audio file (WAV, MP3, etc.) with clear speech
- Best results with 3-10 seconds of clean audio
- The voice will be encoded and used for all subsequent generations
File Structure
pocket-tts-web/
├── index.html # Main HTML page
├── onnx-streaming.js # Main thread controller
├── inference-worker.js # Web Worker for ONNX inference
├── PCMPlayerWorklet.js # Audio playback worklet
├── EventEmitter.js # Event utilities
├── sentencepiece.js # SentencePiece tokenizer library
├── style.css # Styles
├── tokenizer.model # SentencePiece model
├── voices.bin # Predefined voice embeddings
└── onnx/
├── mimi_encoder.onnx
├── text_conditioner.onnx
├── flow_lm_main_int8.onnx
├── flow_lm_flow_int8.onnx
└── mimi_decoder_int8.onnx
License
- Models & Voice Embeddings: CC BY 4.0 (inherited from kyutai/pocket-tts)
- Code: Apache 2.0