神经codec模型相关论文

简介: 本文汇总了近年来在神经音频编解码器和语音语言模型领域的多项重要研究,涵盖从2020年到2024年的最新进展。这些研究包括端到端的音频编解码器、高效音频生成、高保真音频压缩、多模态表示学习等。每项研究都提供了详细的论文链接、代码和演示页面,方便读者深入了解和实验。例如,SoundStream(2021)提出了一种端到端的神经音频编解码器,而AudioLM(2022)则通过语言建模方法生成音频。此外,还有多个项目如InstructTTS、AudioDec、HiFi-Codec等,分别在表达性TTS、开源高保真音频编解码器和高保真音频压缩方面取得了显著成果。
  • [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
  • [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] :heavy_check_mark:
  • [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ :heavy_check_mark:
  • [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] :heavy_check_mark:
  • [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC :heavy_check_mark:
  • [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
  • [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec :heavy_check_mark:
  • [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] :heavy_check_mark:
  • [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec :heavy_check_mark:
  • [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] :heavy_check_mark:
  • [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] :heavy_check_mark:
  • [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco :heavy_check_mark:
  • [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] :heavy_check_mark:
  • [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] :heavy_check_mark:
  • [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
  • [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder :heavy_check_mark:
  • [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
  • [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
  • [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
  • [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse :heavy_check_mark:
  • [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec :heavy_check_mark:
  • [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] :heavy_check_mark:
  • [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] :heavy_check_mark:
  • [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
  • [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] :heavy_check_mark:
  • [2024/07] dMel: Speech Tokenization made Simple [paper] Code Comming Soon
  • [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] :heavy_check_mark:
  • [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] :heavy_check_mark:
  • [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] :heavy_check_mark:
  • [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec | Code Comming Soon
  • [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
  • [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space :heavy_check_mark:
  • [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] :heavy_check_mark:
  • [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec :heavy_check_mark:
  • [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] :heavy_check_mark:
  • [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
  • [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code] Code Comming Soon
  • [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking :heavy_check_mark:
  • [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec :heavy_check_mark:
  • [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform :heavy_check_mark:
  • [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
  • [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer :heavy_check_mark:
  • [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
  • [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec :heavy_check_mark:
  • [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC :heavy_check_mark:
  • [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] :heavy_check_mark:
  • [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation :heavy_check_mark:
  • [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
  • [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation :heavy_check_mark:
  • [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
  • [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
  • [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues :heavy_check_mark:
  • [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
  • [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
  • [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
  • [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
  • [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
  • [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
  • [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
  • [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse :heavy_check_mark:
  • [2024/11] hertz-dev [code] WaveCodec :heavy_check_mark:
  • [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
  • [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper] Code Comming Soon | proposing several codec-LM co-design strategies
  • [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
  • [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free :heavy_check_mark:
  • [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo] Code Comming Soon | Music Tokenizer, Similar to MsCodec
  • [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo] Code Comming Soon | transformer-based and scale it into 1B parameter range
  • [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
  • [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo] Code Comming Soon | speaker encoder, content encoder and prosody encoder

注:以上论文集来自GitHub仓库Neural-Codec-and-Speech-Language-Models的一部分,欢迎star

目录
相关文章
|
机器学习/深度学习 存储 编解码
高效神经网络架构的正确打开方式! | EMO:结合 CNN 和 Transformer
高效神经网络架构的正确打开方式! | EMO:结合 CNN 和 Transformer
1244 0
Vision Transformer 图像分类识别 基于 ViT(Vision Transformer)的图像十分类 实战 完整代码 毕业设计
Vision Transformer 图像分类识别 基于 ViT(Vision Transformer)的图像十分类 实战 完整代码 毕业设计
150 0
Vision Transformer 图像分类识别 基于 ViT(Vision Transformer)的图像十分类 实战 完整代码 毕业设计
|
8月前
|
机器学习/深度学习 人工智能 自然语言处理
极智AI | 变形金刚大家族Transformer ViT CLIP BLIP BERT模型结构
大家好,我是极智视界,本文整理介绍一下 Transformer ViT CLIP BLIP BERT 模型结构。
384 0
|
机器学习/深度学习 编解码 人工智能
深度学习应用篇-计算机视觉-图像分类[3]:ResNeXt、Res2Net、Swin Transformer、Vision Transformer等模型结构、实现、模型特点详细介绍
深度学习应用篇-计算机视觉-图像分类[3]:ResNeXt、Res2Net、Swin Transformer、Vision Transformer等模型结构、实现、模型特点详细介绍
10718 1
 深度学习应用篇-计算机视觉-图像分类[3]:ResNeXt、Res2Net、Swin Transformer、Vision Transformer等模型结构、实现、模型特点详细介绍
【vision transformer】DETR原理及代码详解(二)
【vision transformer】DETR原理及代码详解
106 0
|
SQL API
【vision transformer】DETR原理及代码详解(四)
【vision transformer】DETR原理及代码详解
582 0
|
机器学习/深度学习 算法 PyTorch
【vision transformer】DETR原理及代码详解(一)
【vision transformer】DETR原理及代码详解
1491 0
|
机器学习/深度学习 自然语言处理 算法
从Transformer到ViT:多模态编码器算法原理解析与实现
从Transformer到ViT:多模态编码器算法原理解析与实现
668 0
|
自然语言处理 TensorFlow 算法框架/工具
一文速览EMNLP 2020中的Transformer量化论文
一文速览EMNLP 2020中的Transformer量化论文
119 0
|
机器学习/深度学习 编解码 计算机视觉
NeurIPS 2022 | 百度提出超快Transformer分割模型RTFormer,180FPS+81mIOU(二)
NeurIPS 2022 | 百度提出超快Transformer分割模型RTFormer,180FPS+81mIOU(二)
210 0