In February 2020, The Economist published an article which compared the mood of different populations by assessing the valence of the music they listened to, i.e. by analysing the top 200 Spotify songs people listened to. Since then, the Covid-19 pandemic has gridlocked the world. Physical illness and repeated lockdowns have likely caused depression, loneliness and anxiety. But how can we assess these potential effects on mental health at scale? I hypothesised that these effects manifest in the music people preferred. I specifically speculated that the valence of the music people listened to during the pandemic deviated from the valence expected from the years prior to the pandemic. To this end, I compared the predicted valence scores for 2020 (based on the three preceding, non-Covid years) to the actual 2020 scores and visualised the observed differences.
I first scraped and queried the data. This was the most time consuming part of the project. I scraped the charts, i.e. the top 200 songs for 70 countries from Spotify’s website with a modified script found on GitHub. The modifications included (i) changing the time interval from daily to weekly charts which substantially reduced the dataset size; (ii) scraping the songs’ URLs as they included the song IDs which were necessary to query the audio features; (iii) adding a “memory” feature such that the scraping didn’t have to start from scratch whenever it was disrupted (e.g., due to connection issues). The scraping output included 70 CSV files which had to be merged and cleaned (for more details please check script “2_import_merge_charts.R”). This concluded the first part of the data gathering process.
(Update 2021-05-23: unfortunately, Spotify seem to have revoked permission for querying the source code of their webpage or at least through the “requests” library)
Country | Song | Artist | Date | Song.ID | Streams | Rank |
---|---|---|---|---|---|---|
Andorra | Shape of You | Ed Sheeran | 2017-01-06 | 7qiZfU4dY1lWllzX7mPBI3 | 1,588 | 1 |
Andorra | Chantaje (feat. Maluma) | Shakira | 2017-01-06 | 6mICuAdrwEjh6Y6lroV2Kg | 1,489 | 2 |
Andorra | Rockabye (feat. Sean Paul & Anne-Marie) | Clean Bandit | 2017-01-06 | 5knuzwU65gJK7IF5yJsuaW | 1,357 | 3 |
Andorra | Starboy | The Weeknd, Daft Punk | 2017-01-06 | 5aAx2yezTd8zXrkmtKl66Z | 1,099 | 4 |
Andorra | Castle on the Hill | Ed Sheeran | 2017-01-06 | 6PCUP3dWmTjcTtXY02oFdT | 1,040 | 5 |
Andorra | Reggaetón Lento (Bailemos) | CNCO | 2017-01-06 | 3AEZUABDXNtecAOSC1qTfo | 1,022 | 6 |
I next queried the audio features for all songs. Spotify offers an API which is accessible to registered developers. Again, a developer offered a template to query the audio features. Spotify only allows a limited number of requests, so that theoretically one request results in one set of audio features for a limited number of songs. There is a workaround however. By creating chunks (of 100 songs) and sending them as one request it is possible to get more audio features at once. The challenge is to untangle the double nested list which is returned (see script “3_audio_features_query.ipynb”). For a detailed description of the audio feature variables please check the codebook. This concluded the second and final part of the data gathering process.
danceability | energy | key | loudness | speechiness | acousticness | liveness | valence | tempo | Song.ID |
---|---|---|---|---|---|---|---|---|---|
0.825 | 0.652 | 1 | -3.183 | 0.0802 | 0.5810 | 0.0931 | 0.931 | 95.977 | 7qiZfU4dY1lWllzX7mPBI3 |
0.852 | 0.773 | 8 | -2.921 | 0.0776 | 0.1870 | 0.1590 | 0.907 | 102.034 | 6mICuAdrwEjh6Y6lroV2Kg |
0.720 | 0.763 | 9 | -4.068 | 0.0523 | 0.4060 | 0.1800 | 0.742 | 101.965 | 5knuzwU65gJK7IF5yJsuaW |
0.681 | 0.594 | 7 | -7.028 | 0.2820 | 0.1650 | 0.1340 | 0.535 | 186.054 | 5aAx2yezTd8zXrkmtKl66Z |
0.461 | 0.834 | 2 | -4.868 | 0.0989 | 0.0232 | 0.1400 | 0.471 | 135.007 | 6PCUP3dWmTjcTtXY02oFdT |
0.761 | 0.838 | 4 | -3.073 | 0.0502 | 0.4000 | 0.1760 | 0.710 | 93.974 | 3AEZUABDXNtecAOSC1qTfo |
The data preparation was rather extensive. Its comprehensive and detailed description goes beyond the scope of this high-level report but can be found here (for more detail please check the scripts). As a general rule of the entire project, I inspected the data after each and every single command. This allowed for correcting mistakes and repeating erroneous queries.
In addition to the careful data inspection after each command, a dedicated script handled major points of actions. These included joining the two data sets, subsetting relevant variables, coherent variable naming, cleaning the country names, adding a week variable which was necessary for later visualisation, checking missing data and excluding countries for which Spotify only started publishing the charts recently. The following shows a sample of the “4_joining_charts_audio_features.R” and “5_data_preparation.R” scripts.
# merging charts and audio features
= left_join(charts, audio_features, by = "Song.ID")
data
# select relevant columns, change to all lower case names
= data %>%
data select(-c("type":"time_signature")) %>%
rename("country" = "Country", "song" = "Song", "artist" = "Artist", "date" = "Date", "song_id" = "Song.ID", "streams" = "Streams", "rank" = "Rank")
# clean the country names
$country = countrycode(data$country, "country.name", "country.name", warn = FALSE, nomatch = NULL)
data$country = ifelse(data$country == "global", "Global", data$country)
data
# create year and week variable (with leading zeros for the single digit weeks) those are need later for the visualisation
$year = isoyear(ymd(data$date))
data
$week = date2week(data$date)
data$week = substr(data$week, 7, 8) # extract week number by index
data
# excluding some country due to a lack of data (Spotify started collecting charts later some countries than others)
= data %>% filter(country != "Andorra", country != "Russia", country != "Ukraine", country != "India", country != "Egypt", country != "Morocco", country != "United Arab Emirates", country != "Saudi Arabia")
data
# export cleaned data
write.csv(data, here("data", "5_clean_data.csv"), row.names = FALSE)
Next, I calculated a baseline to compare the observed 2020 valence scores against. To account for the order of the charts (the top song should be more influential than the 200th song), I weighted the valence scores by their reversed rank. As compared to weighting by the number of streams, weighting by the reversed rank is not influenced by Spotify’s popularity, i.e. larger countries like the USA can be directly compared with smaller countries like Estonia. It also has the advantage that the top songs were not too influential.
Furthermore, I splitted the data in a training and test set. The training set included all data points of 2017, 2018 and 2019. The test set included the 2020 data. A linear model was trained and used to predict the 2020 valence scores. I calculated absolute and relative differences between the observed and predicted scores. Finally, I visualised the relative differences. Below is a sample of the “6_predicting_valence.R” script.
= data %>%
reduced_weighted_data select(country, date, year, week, rank, valence) %>%
mutate(reversed_rank = 201 - rank, weighted_valence = valence * reversed_rank) %>%
group_by(country, year, week) %>%
summarise(weighted_valence = mean(weighted_valence))
# train model to predict 2020 valence scores which will serve as baseline to compare the actual 2020 data against
= as.formula(weighted_valence ~ year + week + country + year:country + week:country)
lm_valence_formula
= reduced_weighted_data %>% filter(year < 2020, week != 53)
training_data = reduced_weighted_data %>% filter(year >= 2020, week != 53)
test_data
# basic linear modelling and predicting
= lm(lm_valence_formula, data = training_data)
lm_valence = predict(lm_valence, newdata = test_data)
weighted_valence_pred
# merged actual and predicted data sets
= cbind(test_data, data.frame(weighted_valence_pred))
combined_data
# calculate difference between baseline (i.e. predicted valence) and actual valence
= combined_data %>%
combined_data mutate(difference = weighted_valence - weighted_valence_pred,
proportion_of_deviation = difference/weighted_valence_pred) %>%
select(country, date, weighted_valence, weighted_valence_pred, difference, proportion_of_deviation)
write.csv(combined_data, here("data", "6_predicted_valence.csv"), row.names = FALSE)
The following section includes the full script of the visualisations. First, the working directory was set to the root of the project folder from where one can access the appropriate subfolders. To import the data, the “data.table::fread” function was used as it’s more efficient than R’s base “read.csv” function.
library("data.table")
library("tidyverse")
# set working directory to main root project folder PRIOR to load "here"
= "/Users/sebastian/Documents/Uni/Sheffield (MSc)/2. Semester/Data Analysis and Viz/spotify_audio_feature_analysis/"
file_path setwd(file_path)
# Load "here" to make command OS independent
library("here")
# import data
= data.frame(fread(here("data", "6_predicted_valence.csv")))
data
# converting variables to the right type
$country = as.factor(data$country)
data$date = as.Date(data$date)
data
::kable(head(data)) knitr
country | date | weighted_valence | weighted_valence_pred | difference | proportion_of_deviation |
---|---|---|---|---|---|
Argentina | 2020-01-08 | 65.46527 | 61.67616 | 3.789113 | 0.0614356 |
Argentina | 2020-01-15 | 65.82003 | 61.60820 | 4.211828 | 0.0683647 |
Argentina | 2020-01-22 | 66.27632 | 61.75966 | 4.516661 | 0.0731329 |
Argentina | 2020-01-29 | 66.09297 | 61.04046 | 5.052508 | 0.0827731 |
Argentina | 2020-02-05 | 65.91741 | 60.47282 | 5.444588 | 0.0900336 |
Argentina | 2020-02-12 | 65.26175 | 59.47407 | 5.787687 | 0.0973145 |
Second, I removed noise from the data. Prior to visualising, I therefore applied a smoothing technique (local weighted regression, loess).
# smoothing the proportion of deviation
= data %>%
models nest(-country) %>%
mutate(m = map(data, loess, formula = proportion_of_deviation ~ as.numeric(date), span = 0.2),
fitted = map(m, `[[`, "fitted")) # retrieve fitted values from each model
# apply fitted Y's as a new column
= models %>%
smoothed_data select(-m) %>%
unnest()
Third, to discover a geographic pattern in the data, I loosly sorted the countries according to their geographic location (from east to west). Cyprus was excluded due to its extreme values which would skew the rest of the heat map. Moreover, as not all countries have values for the last week so it’s removed.
At first glance, there’s a noticeable difference between Central/South America and the rest of the world which has somewhat higher valence scores than predicted. It’s encouraging to see that countries close to each other have similar patterns as it suggests that those scores capture a real construct.
In Europe, on average, there is higher valence than predicted. A hypothesis could be that people listen to more positive music to uplift their mood. However, concrete interpretation is up for debate. An interesting extension of this project would be to relate lockdown measures, Covid cases or fatalities to valence and check if there is a relationship.
# ordering countries loosely by proximity
= c("Global", "New Zealand", "Australia", "Philippines", "Hong Kong SAR China", "Japan", "Singapore", "Malaysia", "Vietnam", "Thailand", "Taiwan", "Indonesia", "Israel", "Turkey", "Greece", "Cyprus", "Bulgaria", "Hungary", "Czechia", "Romania", "Poland", "Slovakia", "Austria","Luxembourg", "Switzerland", "Italy", "Germany", "Netherlands", "Belgium", "France", "Spain", "Portugal", "United Kingdom", "Ireland", "Denmark", "Norway", "Sweden", "Finland", "Latvia", "Lithuania", "Estonia", "South Africa", "Iceland", "United States", "Canada", "Mexico", "Dominican Republic", "Guatemala", "El Salvador", "Honduras", "Nicaragua", "Costa Rica", "Panama", "Colombia", "Ecuador", "Peru", "Bolivia", "Chile", "Brazil", "Paraguay", "Argentina", "Uruguay")
level_ordered $country = factor(smoothed_data$country, levels = level_ordered)
smoothed_data
# heat map for smoothed data
= smoothed_data %>%
valence_heatmap filter(country != "Cyprus", date < "2021-02-05") %>% # exclude Cyprus due to skewing the results
ggplot(aes(date, country, fill = fitted)) +
geom_tile(color = "white") +
scale_x_date(date_breaks = "1 month",
date_labels = "%b %y",
expand = c(0, 0), # removing padding around the data
sec.axis = dup_axis()) + # duplicate x-axis to the top
scale_y_discrete(limits = rev) + # removing padding around the data
scale_fill_gradientn(colours = c("#971D2B", "#FDFDC3", "#2B653C"),
values = c(0, 0.5, 1),
limits = c(-0.27, 0.27),
labels = scales::percent_format(accuracy = 1L),
name = "Deviation from\nexpected valence",
guide = guide_legend(title.position = "top",
label.position = "bottom")) +
labs(title = "Valence Up and Down",
subtitle = "Based on Spotify's weekly top 200 songs per country",
caption = "Source: https://spotifycharts.com/regional, 12.02.2021",
x = "", y = "") +
theme_classic() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.margin = margin(t = 1, r = 1.5, l = 0.5, unit = "cm"),
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "top",
legend.justification = "right",
legend.direction = "horizontal",
legend.title.align = 1,
legend.key.width = unit(1, unit = "cm"),
legend.key.height = unit(0.2, unit = "cm"),
legend.spacing.x = unit(0, unit = "cm"),
legend.margin = margin(t = -1.4, unit = "cm"),
legend.box.margin = margin(b = -0.5, unit = "cm"))
valence_heatmap
The line plot shows the valence over time for the US, UK and Germany. It allows for a more detailed evaluation of the curve than the heat map. Interestingly, the UK and the US have largely similar patterns and have more positive valence scores than predicted. In contrast, Germany has slightly lower scores than predicted.
I found the peaks and turning points particularly interesting. In April, valence scores in the UK and the US reached their maximum at about the same time when the first Covid wave peaked. In contrast, in August for the US and in October for Germany, valence started to increase again antithetical to the Covid situation in these countries.
Finally, Germany was doing relatively well compared to the UK and US during the first wave. However, the second wave in October struck Germany much harder and at that time its valence score converged to the other two countries.
The choice of countries displayed here is arbitrary, I invite you to download the data (“6_predicted_valence.csv”) and the script (“7_visualising_valence.R”) to play around with the countries.
# line plot
= smoothed_data %>%
valence_line_plot filter(country == "United States" | country == "United Kingdom" | country == "Germany") %>% # pick your country
ggplot(aes(date, fitted, colour = country)) +
geom_line() +
scale_x_date(date_breaks = "2 month",
date_labels = "%b %y") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
limits = c(-0.25, 0.25)) +
scale_colour_discrete(name = "") +
labs(title = "Valence Up and Down",
subtitle = "Based on Spotify's weekly top 200 songs per country",
caption = "Source: https://spotifycharts.com/regional, 12.02.2021",
x = "", y = "Deviation from expected valence") +
theme_classic() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "cm"),
panel.grid.major = element_line(size = 0.1, colour = "gray"),
legend.position = "top",
legend.justification = "right",
legend.text = element_text(size = 11),
legend.direction = "horizontal",
legend.margin = margin(t = -0.7, b = 0, unit = "cm"),
legend.box.margin = margin(b = -0.1, unit = "cm"))
valence_line_plot
I found that the visualisation revealed highly interesting patterns in the data. The similarity of patterns of neighbouring countries strengthened the trust in the data. However, the data are purely correlational. An interesting extension of the project would be to add data on the Covid cases/fatalities, lockdown measures, etc.
It was fun to do this project and I learned a lot especially regarding data wrangling and using Python and an API. Moreover, the project significantly increased my confidence and self-dependence in data management and visualisation.
The Economist - Data from Spotify suggest that listeners are gloomiest in February. https://www.economist.com/graphic-detail/2020/02/08/data-from-spotify-suggest-that-listeners-are-gloomiest-in-february
Spotify web scraper basis. https://gist.github.com/hktosun/d4f98488cb8f005214acd12296506f48
Spotify web scraping song IDs. https://medium.com/the-innovation/how-to-scrape-the-most-popular-songs-on-spotify-using-python-8a8979fa6b06
Spotify audio feature query basis. https://stmorse.github.io/journal/spotify-api.html
Spotify API documentation. https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-several-audio-features
My GitHub repository. https://github.com/sebastianplnr/spotify_audio_feature_analysis