sábado, 4 de fevereiro de 2017

SoundNet: Learning Sound Representations from Unlabeled Video

SoundNet

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual models (e.g. ImageNet and PlacesCNN) into the sound modality using unlabeled video as a bridge.
Visualization of learned conv1 filters: 

Requirements

  • torch7
  • torch7 audio (and sox)
  • torch7 hdf5 (only for feature extraction)
  • probably a GPU

Nenhum comentário:

Postar um comentário