Skip to content

Latest commit

 

History

History
55 lines (38 loc) · 6.08 KB

FAQs.md

File metadata and controls

55 lines (38 loc) · 6.08 KB

Frequently Asked Questions (FAQs)

  • Can I run musicnn_keras on a CPU? Yes, the models are already trained.

  • I miss a functionality. How can I get it? musicnn_keras is fairly simple. Feel free to expand it as you wish! Tell us if you think this new functionality is going to be useful for the rest of us.

  • Why musicnn_keras contains vgg models? Because they are a nice baseline, and because people like to use computer vision models for spectrograms. Hence, in this repository you can find musicnn-based models (musically motivated convolutional neural networks) and vggs (a computer vision architecture applied to audio).

  • Which is the architecture that musicnn-based models employ? They use a musically motivated CNN frontend, some dense layers in the mid-end, and a temporal-pooling back-end. In this jupyter notebook we provide further details about the model.

  • Which is the best musicnn_keras layer-output to pick for transfer learning? Although we haven't run exhaustive tests, throughout our visualisations and preliminary experiments we found the taggram and the max_pool layer to be the best for this purpose. The taggram because it already provides high-level music information, and the max_pool layer because it provides a relatively sparse acoustic representation of the music audio.

  • Which 50-tags does the MTT model predict? These are determined by the MagnaTagATune dataset, that is used for training the MTT models: guitar, classical, slow, techno, strings, drums, electronic, rock, fast, piano, ambient, beat, violin, vocal, synth, female, indian, opera, male, singing, vocals, no vocals, harpsichord, loud, quiet, flute, woman, male vocal, no vocal, pop, soft, sitar, solo, man, classic, choir, voice, new age, dance, male voice, female vocal, beats, harp, cello, no voice, weird, country, metal, female voice, choral.

  • Which 50-tags does the MSD model predict? These are determined by the Million Song Dataset dataset, that is used for training the MSD models: rock, pop, alternative, indie, electronic, female vocalists, dance, 00s, alternative rock, jazz, beautiful, metal, chillout, male vocalists, classic rock, soul, indie rock, Mellow, electronica, 80s, folk, 90s, chill, instrumental, punk, oldies, blues, hard rock, ambient, acoustic, experimental, female vocalist, guitar, Hip-Hop, 70s, party, country, easy listening, sexy, catchy, funk, electro, heavy metal, Progressive rock, 60s, rnb, indie pop, sad, House, happy.

  • Which are the typical cases where the model fails? When the input-audio has content that is out of the 50-tags vocabulary. Although in these cases the predictions are consistent and reasonable, the model cannot predict bass if this tag is not part of its vocabulary.

  • Why the MTT models predicts no vocals and no vocal? Because the vocabulary of the model is determined by the MagnaTagATune dataset and we used it as it is.

  • My model is slow, even with a GPU. Can I do something? Yes! In ./musicnn_keras/configuration.py you can set a bigger batch size. The dafult is BATCH_SIZE = 1, what can be slow – but safe computationally.

  • What are these songs you include in the repository?
    ./audio/joram-moments_of_clarity-08-solipsism-59-88.mp3 is an electronic music song from the test set of the MagnaTagATune dataset.
    ./audio/TRWJAZW128F42760DD_test.mp3 is an instrumental Muddy Waters song-excerpt from the test set of the Million Song Dataset called Screamin' And Cryin' (Live In Warsaw 1976).

  • Which audio formats does the musicnn_keras library support? We rely on librosa to read audio files. librosa uses soundfile and audioread for reading audio. As of v0.7, librosa uses soundfile by default, and falls back on audioread only when dealing with codecs unsupported by soundfile (notably, MP3, and some variants of WAV). For a list of codecs supported by soundfile, see the libsndfile documentation.

  • Which sampling rate, window and hop size were used to compute the log-mel spectrograms? We compute the STFT of a downsampled signal at 16kHz, with a Hanning window of length 512 (50% overlap). We use 96 mel-bands, and we apply a logarithmic compression to it (np.log10(10000·x + 1)).

  • I love this library! How can I get in touch? Find me on twitter @elio.elioo and drop me a line! You may also contact Jordi Pons who built the original musicnn.

Are you using musicnn_keras?

If you are using it for academic works, please put in a footnote with a link to the musicnn_keras repository and cite the musicnn papers:

@inproceedings{pons2018atscale,
  title={End-to-end learning for music audio tagging at scale},
  author={Pons, Jordi and Nieto, Oriol and Prockup, Matthew and Schmidt, Erik M. and Ehmann, Andreas F. and Serra, Xavier},
  booktitle={19th International Society for Music Information Retrieval Conference (ISMIR2018)},
  year={2018},
}

@inproceedings{pons2019musicnn,
  title={musicnn: pre-trained convolutional neural networks for music audio tagging},
  author={Pons, Jordi and Serra, Xavier},
  booktitle={Late-breaking/demo session in 20th International Society for Music Information Retrieval Conference (LBD-ISMIR2019)},
  year={2019},
}

If you use it for other purposes, let us know too!