Pretrained Dasheng on 🤗 Hugging Face

Dasheng (Deep Audio-Signal Holistic Embeddings), or “大声” ("great sound"), is a general-purpose audio encoder trained on a large-scale self-supervised learning task. Dasheng is designed to capture rich audio information across various domains, including speech, music, and environmental sounds. The model is trained on 272,356 hours of diverse audio data with 1.2 billion parameters, and exhibits significant performance gains on the HEAR benchmark. Dasheng outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environmental sound classification tasks.

Model Cards: https://huggingface.co/models?search=mispeech%2Fdasheng
Original Repository: https://github.com/XiaoMi/dasheng

Usage

Install

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git

Inference

>>> model_name = "mispeech/dasheng-base"  # or "mispeech/dasheng-0.6B", "mispeech/dasheng-1.2B"

>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
>>> from dasheng_model.modeling_dasheng import DashengModel

>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> audio.shape
torch.Size([1, 16000])   # mono audio of 1 second

>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> inputs.input_values.shape
torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames

>>> import torch
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> outputs.hidden_states.shape
torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling

>>> outputs.logits.shape
torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)

Fine-tuning

example_finetune_esc50.ipynb demonstrates how to train a linear head on the ESC-50 dataset with the Dasheng encoder frozen.

Citation

If you find Dasheng useful in your research, please consider citing the following paper:

@inproceedings{dinkel2023scaling,
  title={Scaling up masked audio encoder learning for general audio classification},
  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
  booktitle={Interspeech 2024},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
dasheng_model		dasheng_model
pic		pic
resources		resources
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_finetune_esc50.ipynb		example_finetune_esc50.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretrained Dasheng on 🤗 Hugging Face

Usage

Install

Inference

Fine-tuning

Citation

About

Releases

Packages

Contributors 2

Languages

License

jimbozhang/hf_transformers_custom_model_dasheng

Folders and files

Latest commit

History

Repository files navigation

Pretrained Dasheng on 🤗 Hugging Face

Usage

Install

Inference

Fine-tuning

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages