sábado, 4 de fevereiro de 2017

Generating predictive videos using deep-learning

http://robohub.org/generating-predictive-videos-using-deep-learning/

SoundNet: Learning Sound Representations from Unlabeled Video

SoundNet

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual models (e.g. ImageNet and PlacesCNN) into the sound modality using unlabeled video as a bridge.
Visualization of learned conv1 filters: 

Requirements

  • torch7
  • torch7 audio (and sox)
  • torch7 hdf5 (only for feature extraction)
  • probably a GPU