Stars
Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach
[ICASSP 2025] FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion
VoiceBench: Benchmarking LLM-Based Voice Assistants
MiMo-Audio: Audio Language Models are Few-Shot Learners
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
Long-form streaming TTS system for multi-speaker dialogue generation
Official code for EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion [EMNLP 2025 Findings]
Readable Implementation of Ministral 8B from Mistral
Customize Soundstorm for voice conversion use case
Try to replicate the architecture of MiniMaxTTS mentioned in it's technical report
DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech tokens. A compact, open framework for fast voice synthesis.🐙
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Readable implementation of Qwen3 0.6B model
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…
A Unified Framework for Expressive Speech Synthesis with Voice Cloning
Official Pytorch Implementation for "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion" (AAAI 2024)
Text-audio foundation model from Boson AI
Self-Supervised Speech Pre-training and Representation Learning Toolkit
PyTorch Implementation of TCSinger 2(ACL 2025): Customizable Multilingual Zero-shot Singing Voice Synthesis
A repo that builds text to music datasets from scratch, used in MuseContorlLite [ICML2025]
The repoduction codes for Qwen-Audio Fine-tuning
Extending the StableTTS project for the application of updating recordings to content which has changed.
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
repo for stableTTS consistency flow matching