MLM to parse through audio input to discern music from background ambience

Follow the full discussion on Reddit.
Hello all :) I'm working on an iOS audio app that processes microphone input to reactive visualizations synced to the music. I have an AVFoundation pipeline set up to record audio, perform realtime FFT analysis into frames of features like spectral centroid and chroma vectors, and map these to parameters driving the visuals. I would now like to train a model to distinguish music from ambient noise in the mic data to filter out segments with just irrelevant background talking or sounds. The app is built in Swift and needs to perform prediction very efficiently even on older iPhones (ideally lol, not the biggest concern). What type of neural network model would you recommend for classifying short sequences of audio features per frame? I'm considering LSTM or Transformer architectures. What feature sets provide the most discriminative signal for music vs noise? Should I quantify model accuracy by frame-level predictions or aggregate metrics like overall sequence accuracy? Any advice on optimizing models for real-time mobile inference? I have experience gathering and labeling timestamped training data pairs of raw audio and feature sets. Please let me know what other specifics around data, model configuration, metrics, or optimization would be useful to provide. Thanks in advance for any guidance!

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter