SO-VITS-SVC 4.0 and 4.1 local inference on Intel/Apple Silicon Mac

Introduction

The SO-VITS-SVC project represents a cutting-edge initiative in the field of voice synthesis and conversion, specifically tailored for applications in singing voice transformation. Leveraging the capabilities of the Variational Inference with adversarial learning (VITS) models, this project offers a platform for users to convert spoken or sung audio into the voice of a different character or person.

Primarily targeted at enthusiasts in deep learning and voice synthesis, as well as researchers and hobbyists interested in voice manipulation and anime character voice generation, SO-VITS-SVC serves as a practical tool for applying theoretical knowledge in deep learning to real-world scenarios. The project enables users to experiment with various aspects of voice conversion, including timbre, pitch, and rhythm alterations.

This tutorial will talk about how to running this project using the CPU under the Mac platform.

This tutorial is based on videos and practice from https://www.bilibili.com.
Below are the reference videos and documents:

So-VITS-SVC 4.1 Integration Package Complete Guide, created by bilibili@羽毛布団. It’s very good.

https://www.yuque.com/umoubuton/ueupp5
Detailed usage record of so-vits-svc 4.1, source: csdn

https://blog.csdn.net/qq_17766199/article/details/132436306
Don’t think about trainning on Mac yet, It’s good enough if they can preprocess and infer. Running LLM might be possible, but if anyone has successfully trained on a Mac (with MPS), please let me know.
This tutorial mainly talks about the inference process after training and downloading the model to the local machine. I have tested it, and it all works.
Training related information can be found in the reference videos above, which are very detailed. The dataset is the key, and patience is needed for training.
Project link: https://github.com/svc-develop-team/so-vits-svc

This tutorial is for communication and learning purposes only. Please do not use it for illegal, immoral, or unethical purposes.

Please ensure that you address any authorization issues related to the dataset on your own. You bear full responsibility for any problems arising from the usage of non-authorized datasets for training, as well as any resulting consequences. The repository and its maintainer, svc develop team, disclaim any association with or liability for the consequences.

It is strictly forbidden to use it for any political-related purposes.

Software requirements:

Homebrew https://brew.sh/
VScode (optional)
Python3

For SO-VITS-SVC 4.0, install So-Vits-SVC-Fork to prevent errors due to missing packages:

1	brew install python-tk@3.11

For SO-VITS-SVC 4.1, to prevent incompatibility issues with Python 3.11 when using WebUI:

1	brew install python3.10

Choose the model you need; you don’t need to install both versions.

SO-VITS-SVC 4.0 Inference

1. Create venv

Create a virtual environment

1	python3 -m venv myenv #change 'myenv' to a different name

2. Enter venv

cd myenv

1	source bin/activate

3. Install packages

python -m pip install -U pip setuptools wheel 

pip3 install torch torchvision torchaudio 

pip install -U so-vits-svc-fork

4. Start the service

svcg

Turn off “Use GPU.”
Click “infer” to start inference.
Try “F0 predict.”
The 4.1 model was not successfully tested here, so use webui.

e7210520dc3be23ac200f1cf276f8702238256624.png@1256w_780h_!web-article-pic.png

SO-VITS-SVC 4.1 Inference

Use the official repository’s WebUI: https://github.com/svc-develop-team/so-vits-svc

1. Create venv

Create a virtual environment

1	python3.10 -m venv myenv #myvenv 自己换个名字好了,python3.9也是可以的

2. Enter venv

cd myenv

1	source bin/activate

3. Clone the repository

1	git clone https://github.com/svc-develop-team/so-vits-svc.git

4. Enter the directory

1	cd so-vits-svc

5. Install packages

1	pip install -r requirements.txt

6. Start WebUI

(it’s normal if you can’t load models after entering WebUI)

1	python webUI.py

In case of WebUI related errors, limit dependency versions: fastapi==0.84.0, gradio==3.41.2, pydantic==1.10.12. Use the following commands to update the packages:

1
2
3

pip install --upgrade fastapi==0.84.0  
pip install --upgrade gradio==3.41.2  
pip install --upgrade pydantic==1.10.12

7. Download some missing files

(different files may be missing depending on the model)

The main missing files are in the pretrain folder. Add files according to the command line errors. You can download the necessary files from the cloud training server or download them from the provided links.

Configuration used currently:

# By default, use 768 with volume embedding
!python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug  
!rsync -v pre_trained_model/768l12/vol_emb/* logs/44k  
# By default, use rmvpe with shallow diffusion
!python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff  
%cp pre_trained_model/diffusion/768l12/* logs/44k/diffusion  
##### Train shallow diffusion model
!python train_diff.py -c configs/diffusion.yaml   
##### Train main model
!python train.py -c configs/config.json -m 44k

Source: kiss丿冷鸟鸟

56f7b12fd3ca84754f095cf04aa4f5a8238256624.png@1256w_164h_!web-article-pic.png

The main missing files are in the pretrain folder. Follow the command line errors to add files.

Example of meta.py file structure:

.  
├── checkpoint_best_legacy_500.pt  
├── chinese-hubert-large-fairseq-ckpt.pt  
├── fcpe.pt  
├── hubert-soft-0d54a1f4.pt  
├── __init__.py  
├── medium.pt  
├── meta.py  
├── nsf_hifigan  
│   ├── config.json  
│   ├── model  
│   ├── NOTICE.txt  
│   ├── NOTICE.zh-CN.txt  
│   └── put_nsf_hifigan_ckpt_here  
├── put_hubert_ckpt_here  
└── rmvpe.pt

ff2d049c0d81a40a8f57ce2a6d7b2d0c238256624.png@1256w_762h_!web-article-pic.png

Summary

On Mac, there are still some issues with MPS, so it’s currently running on CPU. However, it’s at least working. Due to limited skills, this is the extent of the capability. Those with Windows and Nvidia cards will have a more comfortable experience.
Model: The training results are pretty good. You don’t need to train for a long time. A few hours on the server is enough (depending on the dataset). All datasets used for training were synthetic data generated from Elevenlab, of decent quality (for text inference). For TTS, Fish-Speech and Bert-Vits2 (good for Chinese) are recommended.

Thank you for reading. Please point out any issues or better methods in this tutorial.

Version: 1.0

Banner: OPPO Reno 11 Pro Wallpaper