GPT-SoVITS: Powerful Few-Shot Voice Cloning and TTS Tool

GPT-SoVITS is an advanced few-shot voice cloning and text-to-speech (TTS) tool that achieves high-quality voice cloning with just 1 minute of reference audio. Developed by the RVC-Boss team, this project combines GPT semantic modeling and diffusion acoustic modeling techniques, gaining widespread attention in the open-source community. Its core strengths lie in extremely low training data requirements, cross-lingual voice generation capabilities, and a user-friendly WebUI interface, making professional-grade speech synthesis technology accessible.

Core Features

GPT-SoVITS’s core competitiveness is embodied in its revolutionary few-shot learning capability. Traditional voice cloning systems typically require hours of audio data, whereas GPT-SoVITS can clone a voice with just 1 minute of reference audio. This breakthrough feature stems from its innovative two-stage generation architecture.

Key features of the project include:

  • Zero-shot and few-shot TTS: Generate new voices without additional training
  • Cross-lingual voice generation: Generate English, Japanese, and other languages using Chinese training data
  • High-quality audio output: Maintains natural prosody and emotional expression
  • Complete training and inference pipeline: Integrates data preprocessing, ASR alignment, model training, and inference

What makes it unique is its hybrid architecture design. GPT-SoVITS separates content generation from timbre features, using a GPT model for semantic understanding and prosody prediction, and then SoVITS (an improved version based on VITS) to generate specific acoustic features. This design not only enhances sound quality but also achieves better controllability and cross-lingual capabilities. It supports multiple languages including Chinese, English, and Japanese, and can convert timbres between different languages.

Installation and Deployment Methods

System Requirements

Software Environment

  • Windows, Linux, or macOS operating system
  • Python 3.9 or 3.10 (3.10 recommended)
  • pip, conda, or uv package manager
  • FFmpeg

Hardware Configuration

  • Nvidia GPU

Manual Installation Process using uv

The uv package manager offers a faster installation experience than traditional pip, making it one of the recommended installation methods.

Complete installation steps are as follows:

# 1. Clone the repository
git clone https://github.com/RVC-Boss/GPT-SoVITS.git
cd GPT-SoVITS

# 2. Create a virtual environment using uv
uv venv --python 3.10

# 3. Activate the virtual environment
# Windows: .venv\Scripts\activate
# Linux/Mac: source .venv/bin/activate

# 4. Install dependencies (note order and parameters)
uv pip install -r extra-req.txt --no-deps
uv pip install -r requirements.txt

Docker Installation Method

Prerequisites

  • Docker and NVIDIA Container Toolkit (for GPU support) must be installed
  • Ensure NVIDIA drivers are working correctly

The Docker method provides a ready-to-use, pre-configured environment, suitable for users who want to deploy quickly.

Deployment

Use the official docker compose

WebUI Usage Guide

Launching the WebUI

After installation, choose the corresponding launch command based on your installation method:

# Standard Python environment
python webui.py

# Using uv environment
uv run python webui.py

# Windows batch file
go-webui.bat

After launching, access http://localhost:9874/ to see the WebUI interface.

The WebUI includes three main tabs:

  • 0-Preprocessing Dataset Acquisition Tool: Data preparation
  • 1-GPT-SoVITS-TTS: Model training and inference
  • 2-GPT-SoVITS-Voice Changer: Real-time voice changing (under development)

0-Preprocessing Dataset Acquisition Tool

/en/posts/gpt-sovits/webui.png

This tool group provides a complete data preprocessing pipeline, including four steps:

0a-UVR5 Vocal Separation

Separates vocals from background music in audio, ensuring clean training data. Suitable for processing videos or audio materials containing background music.

0b-Audio Splitting

Automatically splits long audio into short segments suitable for training (typically 2-10 seconds).

Key Parameters:

  • threshold (volume threshold): Default -34 dB, controls silence detection sensitivity
  • min_length: Minimum segment length, default 4000 ms
  • max_sil_kept: Length of silence to keep, default 500 ms

Adjust parameters according to audio quality; if noise is high, the threshold can be increased appropriately.

0c-Speech Recognition

Automatically generates text annotations using an ASR model. Output format is a .list file, with each line containing the audio path and corresponding text.

Model Selection Suggestions:

  • Chinese: Damu ASR (Chinese)
  • English/Japanese: Select the ASR model for the corresponding language
  • Model size: Large for higher accuracy but slower speed, small for faster speed but potentially less accurate

0d-Speech Text Proofreading

Provides a visual interface for manual proofreading of ASR recognition results.

Proofreading Focus:

  • Correcting specialized terms, names of people, and places
  • Adjusting punctuation to match prosody
  • Deleting low-quality audio samples
  • Maintaining consistent annotation style

Data quality directly impacts training results, so careful proofreading of each annotation is recommended.

1-GPT-SoVITS-TTS

/en/posts/gpt-sovits/webui-2.png

Core training and inference interface, containing three sub-tabs.

Top Configuration

  • Experiment/Model Name: Set a unique identifier name for the training experiment
  • Training Model Version: Recommended to choose v2Pro (best compatibility and stability)

1A-Training Set Formatting Tool

Converts preprocessed data into model training format, including three steps:

1Aa-Text Tokenization and Feature Extraction

Uses the BERT model to extract text semantic features. The right text box allows viewing and editing the training sample list; it is recommended to delete low-quality samples.

1Ab-Audio Self-Supervised Feature Extraction

Extracts deep acoustic features (timbre, prosody, pitch, etc.) from audio. This step is computationally intensive, and processing time depends on the total audio duration and GPU performance.

1Ac-Semantic Token Extraction

Generates semantic tokens using a SoVITS pre-trained model. Ensure that the pre-trained model version matches the training version selected at the top.

One-Click Triple Action

For small datasets (5-10 minutes of audio), the “One-Click Triple Action” button can be used to automatically complete the above three steps.

1B-Fine-tuning Training

/en/posts/gpt-sovits/webui-3.png

1Ba-SoVITS Training

Trains the acoustic model, learning target timbre features.

Core Parameters:

  • batch_size: Adjust according to GPU VRAM
  • total_epoch: Default 8, range 6-15
  • save_every_epoch: Default 4, recommended to set to 2-3 for small datasets

Storage Options:

  • For first training, it is recommended not to check “Only save latest weights” to compare effects at different stages
  • Check “Save small model to weights folder” for quick testing

Training Time Reference:

  • 3 minutes audio + RTX 2070: approximately 20-40 minutes
  • 5 minutes audio + RTX 3090: approximately 15-30 minutes
1Bb-GPT Training

Trains the semantic model, learning text-to-speech mapping and prosody patterns.

Parameter Settings:

  • total_epoch: Default 15 (requires more epochs than SoVITS)
    • Small dataset: 12-18 epochs
    • Medium dataset: 10-15 epochs
  • batch_size: GPT has higher VRAM requirements; if VRAM is insufficient, prioritize reducing this parameter
  • DPO training mode: Experimental feature, not recommended for first training

Training Order:

  1. Complete SoVITS training first
  2. Test SoVITS effect to ensure satisfactory timbre quality
  3. Then proceed with GPT training
  4. Test jointly using both models

1C-Inference

/en/posts/gpt-sovits/webui-4.png

Speech synthesis interface, supporting both zero-shot and fine-tuned model modes.

Model Selection

Zero-shot inference: Select the “Infer directly without training” base model for both dropdowns

Fine-tuned model inference: Select the trained custom model (click “Refresh model path” button after training to update the list)

Inference WebUI

/en/posts/gpt-sovits/webui-5.png

Click the “Close TTS Inference WebUI” button to launch an independent inference interface (usually at http://localhost:9872/).

Left: Reference Audio and Text

  • Reference Audio: Upload 3-10 seconds of clear target timbre audio
    • Supports drag and drop, click to upload, or direct recording
    • Audio quality directly affects cloning effect
  • Synthesized Text: Enter the text to be converted to speech
    • Pay attention to punctuation (affects prosody and pauses)
    • Long text will be automatically segmented

Right: Synthesis Parameters

  • Language Settings:
    • Reference audio language: Set the language of the reference audio
    • Synthesis language: Set the language of the output speech
    • Supports cross-lingual cloning (e.g., generating English speech with a Chinese timbre)
  • Splitting Method (how to split):
    • Default “Split every four sentences” suitable for most scenarios
    • For short text, “Do not split” can be selected
    • For long text, it is recommended to split by punctuation
  • Speech Rate Adjustment: Default 1.0, range 1.0-1.5
    • Keeping default is most natural
    • Not recommended to exceed 1.3
  • Pause between sentences: Default 0.3 seconds
    • Normal conversation: 0.2-0.4 seconds
    • Reading style: 0.4-0.6 seconds

GPT Sampling Parameters (Advanced):

  • top_k (default 15): Controls sampling conservatism
    • Lower (5-10): More stable and consistent
    • Higher (20-50): More natural and diverse
  • top_p (default 1): Nucleus sampling threshold, recommended to keep default or 0.8-0.95
  • temperature (default 1): Controls randomness
    • 0.6-0.9: More stable
    • 1.1-1.3: More variation
    • 1.0 is most balanced

After configuration, click the “Synthesize Speech” button. The generated audio will be displayed at the bottom of the interface and can be played and downloaded directly.

Complete Usage Workflow

Quick Experience (Zero-shot)

  1. Launch WebUI
  2. Go to “1C-Inference” tab
  3. Select default base model
  4. Upload 3-10 seconds of reference audio
  5. Enter synthesis text
  6. Click “Synthesize Speech”

Fine-tuning Training (High Quality)

Phase One: Data Preparation

  1. Prepare 3-10 minutes of high-quality audio in the target timbre
  2. Use “0a-Vocal Separation” to clean audio (if there is background music)
  3. Use “0b-Audio Splitting” to split audio
  4. Use “0c-Speech Recognition” to generate text annotations
  5. Use “0d-Proofread Annotations” to manually correct errors

Phase Two: Data Formatting

In “1A-Training Set Formatting Tool”:

  1. Execute “1Aa-Text Tokenization and Feature Extraction”
  2. Execute “1Ab-Audio Self-Supervised Feature Extraction”
  3. Execute “1Ac-Semantic Token Extraction”

Or directly use “One-Click Triple Action”

Phase Three: Model Training

In “1B-Fine-tuning Training”:

  1. Configure SoVITS training parameters, click “Start SoVITS Training”
  2. Wait for training to complete (monitor loss value changes)
  3. Configure GPT training parameters, click “Start GPT Training”
  4. Wait for training to complete

Phase Four: Inference Testing

  1. Go to “1C-Inference”
  2. Refresh and select the trained model
  3. Upload reference audio (can be shorter after fine-tuning)
  4. Enter test text
  5. Adjust parameters and generate speech
  6. Evaluate results, adjust parameters and regenerate if necessary

Optimization Suggestions

Inaccurate timbre:

  • Increase training data volume (5-10 minutes)
  • Ensure training data is clean and timbre consistent
  • Adjust SoVITS training epochs

Unnatural prosody:

  • Check accuracy of punctuation in text annotations
  • Try different splitting methods
  • Adjust GPT training epochs

Unclear audio quality:

  • Improve original audio quality
  • Check splitting parameters (avoid overly short segments)
  • Reduce speech rate during synthesis

Practical Application Scenarios

Content Creation:

  • Audiobook and podcast production
  • Video dubbing and narration
  • Game character voices
  • Multilingual content localization

Assistive Technology:

  • Personalized voice assistants
  • Language learning tools
  • Accessibility aids

Enterprise Applications:

  • Virtual customer service systems
  • Corporate training materials
  • Brand voice uniformity

Summary

GPT-SoVITS brings professional-grade voice cloning technology to a wider user base, with its innovative two-stage architecture achieving a balance of high quality and low data requirements.

Getting Started Paths:

  • Quick Test: Docker + Zero-shot Inference
  • Professional Production: Manual Installation + Fine-tuning Training

The project is continuously updated. Visit the GitHub repository for the latest documentation and pre-trained models.

0%