TensorFlow implementation of "SoundNet" that learns rich natural sound representations.
Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016
- Linux
- NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
- Python 2.7 with numpy
- Tensorflow 0.12.1
- librosa
- Clone this repo:
git clone git@github.com:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
- Pretrained Model
I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. The model locates under ./models/sound8.npy in your folder.
- NOTE
If you found out that some audio with offset value start
in FFMPEG will cause a tremendous difference between torch audio
and librosa
, please convert it with following command.
sox {input.mp3} {output.mp3} trim 0
After this, the result might be much better.
To extract multiple features from a pretrained model with torch lua audio
loaded sound track:
The sound track ./data/demo.npy is equivalent with torch version.
python extract_feat.py -m {start layer number} -x {end layer numbe} -s
Or extract features from raw wave in demo.txt: The demo puts under ./data/demo.mp3
python extract_feat.py -m {start layer number} -x {end layer numbe} -s -t demo.txt
To extract multiple features from a pretrained model with downloaded mp3 dataset:
python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract
e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy
:
python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract
More details are in:
python extract_feat.py -h
To train from an existing model:
python main.py
To train from scratch:
python main.py -p train
To extract features:
python main.py -p extract -m {start layer number} -x {end layer numbe} -s
More details are in:
python main.py -h
- Change audio loader to soundnet format
- Fix conv8 padding issue in training phase
- Change all
config
intotf.app.flags
- Change dummy distribution of scene and object to useful placeholder
- Add sound and feature loader from Data section
- Loaded audio length is not consist in
torch7 audio
andlibrosa
. Here is the issue - Training with a short length audio will make conv8 complain about output size would be negative
- Why my loaded sound wave is different from
torch7 audio
tolibrosa
: Here is my WiKi
Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!
Hou-Ning Hu / @eborboihuc