The Recommended Music That I Don’t Like Are So-Called False Positives

Think about the music recommendation algorithm in a customer-centric way.

Jinhang Jiang
5 min readApr 30, 2022
Photo by John Tekeridis: https://www.pexels.com/photo/person-using-smartphone-340103/

Introduction

I was listening to music earlier this afternoon while reading. And the app started to randomly ‘recommend’ the music that really got on my nerves. At exactly that moment, a seemingly unrelated term came into my mind: False Positive.

For music platforms like Apple Music, Spotify, Amazon Music, etc., their recommendation teams must work so hard to build systems that identify my interests and personalize my experience. Therefore, I, as a customer, can be retained. Before a new piece of music comes into the queue for me to listen to, it probably has been analyzed and compared with the music that I earlier marked as LIKE. And the system believes that I would like this piece of music.

The real system behind the screen must be much more complex than what I described above. They probably have done some types of unsupervised learning as sub-modules to enhance the system, such as music clustering, user clustering, NLP analysis related to the lyrics, and so on. With all of those work, at the very last step, the system has to determine if I will like this piece of music or not. And clearly, the system had made a wrong decision this afternoon.

Well, probably I am too picky. As a customer, I don’t really need to think about the reasons and the algorithms. All I need to do is to mark it as “not interested / dislike,” trust the apps, and hope I don’t get things like this again. Yet as a data scientist, I can’t help and I do believe that the recommended music that I don’t like is one of the false positives generated from the algorithms. To fix it, we probably need to adjust our strategy from algorithm-centric to customer-centric temporarily when we analyze the issue.

I would rather not listen to the music that gets on my nerves than the music I probably like.

Data

I found a labeled Spotify dataset on Kaggle, which should fit our needs in this case. You can find the data here: https://www.kaggle.com/datasets/bricevergnou/spotify-recommendation?select=data.csv.

The original dataset has 100 pieces of music marked as LIKED, and 95 marked as DISLIKED. To mimic our situation, I dropped 35 DISLIKED pieces of music before fitting the model since the unbalanced class is more likely to lead to misclassifications.

I only took 15% of the data for training, and the rest was used as testing data. The available features are all numeric. Since we are not interested in the contribution of each feature, no further feature engineering was conducted.

Modeling

Base Model

Since we are interested in avoiding false-positive recommendations, I picked the precision score and the f1 score as the evaluation metrics. The smaller the false positive is, the higher the precision score is.

Code by author

I firstly built a base model with a Random Forest classifier. For our testing data, we found 8 false positives, the precision score is 0.90, and the f1 score is 0.87.

Manually Force to Reduce the FP

If we use the predict_proba to make predictions, instead of returning 0s and 1s, it will return the probabilities the classifier gives to each class. I manually raise the threshold from 0.5 to 0.9 as shown below. It is like we are telling the classifier only to recommend new music to us only if it is 90% confident in its decision. Otherwise, mark it as 0s.

Code by author

By adjusting the threshold, we do not see the false-positive recommendations anymore. However, the number of false-negative recommendations increased to 77, and the f1 score decreased to 0.19.

Let’s do a grid search to see how the threshold affects the scores. We increase the threshold by 0.025 each iteration and print the results:

Code by author

Simply increasing the threshold would for sure delete the false-positive recommendations. However, it also removed a large amount of the correct recommendations. So, let’s find a more “algorithmic” way to approach this probably.

Class_Weight in Random Forest

One of the parameters in the Random Forest is called class_weight. Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. This parameter also takes two strings of mode names as input. The first one is the “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). The other one is the “balanced_subsample” mode. It is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown. (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

To give the audience a better sense of how this parameter can help in our situation, I conducted another grid search and stored the results to a table:

Code by author

Let’s visualize the result:

Code by author
Image by author

According to the iterations, the optimal weight for class 1 is 1.25. By adjusting the weight to 1.25, we successfully removed 3 false-positive recommendations, the precision score increased to 0.93 with the minimum decrease in the f1 score (-0.0075).

Conclusion

This project was started with the question “What if I would rather not listen to the music that gets on my nerves than the music I probably like?” I tried to penalize the false-positive recommendations with the class_weight parameter in Random Forest. With the minimum loss in the f1 score, we had a good increase in the precision score. For Xgboost, it has a similar parameter called “scale_pos_weight”, which is supposed to serve the same problem.

Another example involving this issue will be Spam Email Identification. Each false positive classification stands for the chance that an important email is deleted without notice.

By walking through this project, the idea I really like to bring up is that the best score of our machine learning model probably will not lead to the best user experience in production. We work so hard every day to build the models that we believe will bring the best UX to our customers. However, regardless of how advanced the algorithm is, human beings deserve options. I’d really like to see a button in my app one day that lets me penalize the FP so that the likelihood of getting a goofy song will be pushed to an infinite 0. The same thing goes the other way around for the people who would prefer listening to new music anyway. Penalizing the FN should be their option.

Please feel free to connect with me on LinkedIn.

Reference

VERGNOU, B. (2021). Spotify Recommendation. https://www.kaggle.com/datasets/bricevergnou/spotify-recommendation?select=data.csv.

Pedregosa et al. (2011). JMLR 12, pp. 2825–2830. Scikit-learn: Machine Learning in Python.

--

--