Classifying artist/album combinations by genre

Classifying artist/album combinations by genre
Photo by Brett Jordan / Unsplash

This is a writeup of a simple ML project I did this week. I'll walk through how I did it and the challenges I faced. Don't expect this to be wildly educational - it's written more for my benefit than for anyone else's.

After publication, I'm hoping to deploy this somewhere where you can put in various inputs and get a predicted genre (which, spoiler alert, probably won't be accurate).


My goal was to build a simple tool that you could use to guess what genre an album was based on the name and artist. I've used this as a way to explore various approaches to deep learning on my own. I use transfer learning - taking fastai's default language model and using it for classification. I've played around with various ways of framing the problem - with more or fewer genres, more or fewer samples, epochs, etc.

I knew that the accuracy was likely to stay pretty low since humans wouldn't be that accurate either. But I also guessed that there would be broad patterns that the model could pick out - maybe the words "Soundtrack" or "DJ" or "Greatest Hits" or "Quartet" might push the model in one direction or another.

I used the MusicBrainz database of recording information for this. MusicBrainz is a big database of music metadata. I became familiar with it when I used music players for Linux that grabbed album information from MusicBrainz in high school. Today, it's still going strong and it has a massive collection of albums, songs, artists, and cover art.

Also, its whole database is only 14 gigabytes uncompressed, so it's plausible that we could just stick most of it in memory while we're training. The documentation of the database is a little unclear, but with some guessing and debugging I was able to find the fields I needed and stick them in Pandas.

Attempt one: Rock vs electronic

I built a simple script to import the data and build a learner based on it. I used fastai, which made transfer learning super simple.

Here's how I structured the problem: The input will be a brief "description" string, and the output will be a set of genres that the learner predicts. I used a pretrained language model, and fastai automatically generated a layer between the pretrained body of the model and the new head of the model.

The description is generated like this:

def describe(r):
return f"Album: {r['name_release_group']}\nArtist: {r['name_artist']}"

In my first attempt, fastai helpfully told me that there were some genres that appeared in the validation set but didn't appear in the training set. I realized that some genres are super unpopular (and therefore likely very inaccurate) and decided to try with just two to begin with. The top two genres are "rock" and "electronic," which suits me fine. They're different enough to be able to have some obvious patterns that distinguish them (vs, say, "punk rock" vs "rock" vs "pop rock").

fastai makes it very easy. For example, for each dataset I just used the learning rate finder:

Notice that my M1 Macbook Air doesn't support GPU acceleration for PyTorch yet :-(

And then trained it using fastai's fine_tune function:

This one is GPU-accelerated using Kaggle

After running it a few times, it's been consistently topping out at around 72-75% accuracy. It also appears that having more than ~5 epochs doesn't improve accuracy that much, and it starts overfitting.

Running with more categories

MusicBrainz has 828 genres, and we're only trying to predict two of them. Can we predict 10 of them?

It starts out with high accuracy because it's using one-hot encoding (or, in this case, n-hot encoding I guess) - it's predicting a vector with a value for its confidence in each category in each element. Usually, the correct value for most of the categories is "false," so guessing "no" every time will give you pretty good results. It also seems not to learn very much despite loss declining in the training set, which implies that it's overfitting a bit.

I took a look at some examples to see if there was obvious cheating and overfitting, and it looks like the results are reasonable:

Results exported as a CSV

Luckily, that's not what we see. When we test the results on a few albums, the results aren't perfect, but they seem plausible given the information in the album and artist names. For example, it accurately labels "Poptopia! Power Pop Classics of the '80s" as pop + rock. Also, some of the source data seems a bit inconsistent. Why is "Iggy Pop" in particular considered pop, rock, and electronic?

What if we try the top 100 categories with 100k training samples? What kinds of results do we get then?

It learned a lot in the first epoch and nothing at all in later epochs. The accuracy metric is artificially high because it's easy to guess nothing in most cases, and, indeed, it's very conservative:

In most cases, it guesses "rock" or nothing, although it seems to have incorrectly learned that electronic, euro house, house, and trance are the same thing. This is not very accurate. Oddly, it's learned something called schlager accurately too. It seems to just guess that anything German sounding is schlager.

I'm guessing that the learning rate here is too low, because the loss is already so low that the function is probably very flat.

Running with a single category

Next I wondered if I could get higher accuracy if I predicted a single category.

I tried with 200 genres. When there were multiple genres, I picked the least common one so we had a good spread of genres.

While the accuracy peaked at around 18%, the results I saw were pretty reasonable guesses for the most part.

Is it overfitting?

One thing I wanted to learn was how much it was memorizing particular artists and albums vs how much it was learning features of their names. I tried to see the impact of having a well-known artist in a particular genre vs heavily implying the genre in the album name. The keys below are "Artist: Album name" and the value is the predicted genre:

It has strong opinions about Nirvana and Buddy Holly but is flexible on the Sex Pistols and very flexible on the Wu-Tang Clan. Maybe this is because of how I chose the genre labels - Wu-Tang Clan and the Sex Pistols may have more diverse genre labels - or maybe it just hasn't memorized those artist names yet.

I also ran it with just two genres to see how high we could get accuracy in that case.

The results weren't significantly better than multicategory, which was disappointing.

What I learned

Getting the first result - training for 4 epochs on 100k samples - took about a day of work. Most of that work was figuring out how to get the data from MusicBrainz's format to Fastai's format. The rest took another day or two, and most of that was waiting for models to fail in new, unexpected ways.

I tried to use a standard toolkit for this process. While I still struggle to use pandas fluently, I'm fully convinced that Jupyter Notebooks are useful for first passes. It's super useful to be able to stop your code and debug it while it's in-progress. I do wish, though, that it encouraged better code modularization so you could automatically convert it to a real Python script more easily. For example, make each cell a function that has to return an output for it to be used by the next cell. Then, you can just dump the Notebook out as a bunch of functions in a Python file. An annoying feature of hosted notebooks like Colab or Kaggle, though, is that they don't appear to support importing from Python files - you just have to copy & paste (or, at best, import from Github) in a cell at the top of the notebook to include the code you've written.

I was surprised to find that training was CPU-bound, not GPU-bound, on Kaggle. I tried profiling the loader and learned that Python's built-in profiling tools are not very good. A bit of Googling found that loading the data seems to be slow because of this issue. In short, converting the tokens into

Typical GPU utilization in Kaggle

I didn't end up messing around with hyperparameters because I wasn't sure what a theoretical maximum accuracy was, and because running individual experiments was time consuming. Is 75% as good as possible given this training data and this model size? If I'd had a standard dataset, or one that had obviously correct answers, I'd have a better idea of what was possible.

On a related note: having good infrastructure and workflow for running experiments is critical. You often make a tweak and need to rerun an experiment. If you're using a notebook, starting from scratch can be messy. Running multiple experiments (either in parallel or serially) isn't easy either, unless you just make a lot of notebooks. It's easy to accidentally throw away your model if it's not saved. And, unfortunately, saving models isn't trivial either since Pickle, which PyTorch uses for serialization, is picky about serializing anonymous or local functions. I think I could resolve some of this with a better workflow and better tooling - I imagine I'll pick that up as I get more familiar with the field.

What's next?

I'd like to expand on the album theme. Can I classify album artwork by genre? (I'm a bit concerned this would start racially profiling people featured on the album cover, like how this classifier stereotypes music genres by language.) Can I generate fake album reviews in various styles, a la Pitchfork? Can I generate a GAN that makes fake album art? (I've Googled around and it looks like people are already taking a stab at this idea. Great! A benchmark to compare myself to.)

I considered including more data - album year, track listings, etc. - to see if that might help it become more accurate. Another approach would be to do more to clean up the data I have - pull out outlier examples, etc. But I think I'm hitting diminishing marginal returns in how much I'm learning with this project, so I don't plan to invest too much more in this.

As an exercise, I may also try to deploy one or more of these models somewhere to practice setting up a model deployment pipeline

For now, you can check out the code at