mirror of https://huggingface.co/spaces/RobinsAIWorld/pocket-tts-web synced 2026-05-20 20:11:42 -04:00

No description

Find a file

Ubuntu 294ea8d966 Add: increased max generation length		2026-02-02 08:37:21 -08:00
onnx	Initial commit	2026-01-18 16:51:29 -08:00
.gitattributes	initial commit	2026-01-19 00:47:05 +00:00
CODE-LICENSE	Initial commit	2026-01-18 16:51:29 -08:00
EventEmitter.js	Initial commit	2026-01-18 16:51:29 -08:00
index.html	Add: increased max generation length	2026-02-02 08:37:21 -08:00
inference-worker.js	Add: increased max generation length	2026-02-02 08:37:21 -08:00
onnx-streaming.js	Add: increased max generation length	2026-02-02 08:37:21 -08:00
PCMPlayerWorklet.js	Initial commit	2026-01-18 16:51:29 -08:00
README.md	Add COOP/COEP headers for multi-threading	2026-01-18 18:26:56 -08:00
sentencepiece-browser.js	Initial commit	2026-01-18 16:51:29 -08:00
sentencepiece.js	Initial commit	2026-01-18 16:51:29 -08:00
server.py	Initial commit	2026-01-18 16:51:29 -08:00
style.css	Initial commit	2026-01-18 16:51:29 -08:00
tokenizer.model	Initial commit	2026-01-18 16:51:29 -08:00
voices.bin	Initial commit	2026-01-18 16:51:29 -08:00

README.md

title

emoji

colorFrom

colorTo

sdk

app_file

pinned

license

short_description

models

custom_headers

Pocket TTS ONNX Web Demo

🌖

yellow

pink

static

index.html

false

cc-by-4.0

Real-time voice cloning entirely in your browser! (CPU)

KevinAHM/pocket-tts-onnx

cross-origin-embedder-policy	cross-origin-opener-policy	cross-origin-resource-policy
require-corp	same-origin	cross-origin

Pocket TTS Web Demo

Real-time neural text-to-speech with voice cloning, running entirely in your browser.

Features

Voice Cloning: Clone any voice from a short audio sample
Predefined Voices: 3 bundled voices (Cosette, Jean, Fantine)
Streaming Audio: Real-time audio generation with low latency
Pure Browser: No server required, runs entirely in WebAssembly

Model Files

The demo requires the following ONNX models in the onnx/ directory:

File	Size	Purpose
`mimi_encoder.onnx`	~70 MB	Voice audio → embeddings
`text_conditioner.onnx`	~16 MB	Text tokens → embeddings
`flow_lm_main_int8.onnx`	~73 MB	AR transformer (INT8)
`flow_lm_flow_int8.onnx`	~10 MB	Flow matching network (INT8)
`mimi_decoder_int8.onnx`	~22 MB	Latents → audio decoder (INT8)

Additional files:

tokenizer.model - SentencePiece tokenizer (~60 KB)
voices.bin - Predefined voice embeddings (~1.5 MB)

Browser Requirements

Modern browser with WebAssembly support
Chrome, Edge, Firefox, or Safari (latest versions)
~200 MB RAM for model loading

Voice Cloning

Click "Upload Voice" or select "Custom (Upload)" from the dropdown
Upload an audio file (WAV, MP3, etc.) with clear speech
Best results with 3-10 seconds of clean audio
The voice will be encoded and used for all subsequent generations

File Structure

pocket-tts-web/
├── index.html              # Main HTML page
├── onnx-streaming.js       # Main thread controller
├── inference-worker.js     # Web Worker for ONNX inference
├── PCMPlayerWorklet.js     # Audio playback worklet
├── EventEmitter.js         # Event utilities
├── sentencepiece.js        # SentencePiece tokenizer library
├── style.css               # Styles
├── tokenizer.model         # SentencePiece model
├── voices.bin              # Predefined voice embeddings
└── onnx/
    ├── mimi_encoder.onnx
    ├── text_conditioner.onnx
    ├── flow_lm_main_int8.onnx
    ├── flow_lm_flow_int8.onnx
    └── mimi_decoder_int8.onnx

License

Models & Voice Embeddings: CC BY 4.0 (inherited from kyutai/pocket-tts)
Code: Apache 2.0