Background and research question

In February 2020, The Economist published an article which compared the mood of different populations by assessing the valence of the music they listened to, i.e. by analysing the top 200 Spotify songs people listened to. Since then, the Covid-19 pandemic has gridlocked the world. Physical illness and repeated lockdowns have likely caused depression, loneliness and anxiety. But how can we assess these potential effects on mental health at scale? I hypothesised that these effects manifest in the music people preferred. I specifically speculated that the valence of the music people listened to during the pandemic deviated from the valence expected from the years prior to the pandemic. To this end, I compared the predicted valence scores for 2020 (based on the three preceding, non-Covid years) to the actual 2020 scores and visualised the observed differences.

Data origin

Scraping Spotify charts

I first scraped and queried the data. This was the most time consuming part of the project. I scraped the charts, i.e. the top 200 songs for 70 countries from Spotify’s website with a modified script found on GitHub. The modifications included (i) changing the time interval from daily to weekly charts which substantially reduced the dataset size; (ii) scraping the songs’ URLs as they included the song IDs which were necessary to query the audio features; (iii) adding a “memory” feature such that the scraping didn’t have to start from scratch whenever it was disrupted (e.g., due to connection issues). The scraping output included 70 CSV files which had to be merged and cleaned (for more details please check script “2_import_merge_charts.R”). This concluded the first part of the data gathering process.

(Update 2021-05-23: unfortunately, Spotify seem to have revoked permission for querying the source code of their webpage or at least through the “requests” library)

Head of the charts data frame
Country Song Artist Date Song.ID Streams Rank
Andorra Shape of You Ed Sheeran 2017-01-06 7qiZfU4dY1lWllzX7mPBI3 1,588 1
Andorra Chantaje (feat. Maluma) Shakira 2017-01-06 6mICuAdrwEjh6Y6lroV2Kg 1,489 2
Andorra Rockabye (feat. Sean Paul & Anne-Marie) Clean Bandit 2017-01-06 5knuzwU65gJK7IF5yJsuaW 1,357 3
Andorra Starboy The Weeknd, Daft Punk 2017-01-06 5aAx2yezTd8zXrkmtKl66Z 1,099 4
Andorra Castle on the Hill Ed Sheeran 2017-01-06 6PCUP3dWmTjcTtXY02oFdT 1,040 5
Andorra Reggaetón Lento (Bailemos) CNCO 2017-01-06 3AEZUABDXNtecAOSC1qTfo 1,022 6

Querying audio features

I next queried the audio features for all songs. Spotify offers an API which is accessible to registered developers. Again, a developer offered a template to query the audio features. Spotify only allows a limited number of requests, so that theoretically one request results in one set of audio features for a limited number of songs. There is a workaround however. By creating chunks (of 100 songs) and sending them as one request it is possible to get more audio features at once. The challenge is to untangle the double nested list which is returned (see script “3_audio_features_query.ipynb”). For a detailed description of the audio feature variables please check the codebook. This concluded the second and final part of the data gathering process.

Sample of the audio feature variables
danceability energy key loudness speechiness acousticness liveness valence tempo Song.ID
0.825 0.652 1 -3.183 0.0802 0.5810 0.0931 0.931 95.977 7qiZfU4dY1lWllzX7mPBI3
0.852 0.773 8 -2.921 0.0776 0.1870 0.1590 0.907 102.034 6mICuAdrwEjh6Y6lroV2Kg
0.720 0.763 9 -4.068 0.0523 0.4060 0.1800 0.742 101.965 5knuzwU65gJK7IF5yJsuaW
0.681 0.594 7 -7.028 0.2820 0.1650 0.1340 0.535 186.054 5aAx2yezTd8zXrkmtKl66Z
0.461 0.834 2 -4.868 0.0989 0.0232 0.1400 0.471 135.007 6PCUP3dWmTjcTtXY02oFdT
0.761 0.838 4 -3.073 0.0502 0.4000 0.1760 0.710 93.974 3AEZUABDXNtecAOSC1qTfo

Data preparation

The data preparation was rather extensive. Its comprehensive and detailed description goes beyond the scope of this high-level report but can be found here (for more detail please check the scripts). As a general rule of the entire project, I inspected the data after each and every single command. This allowed for correcting mistakes and repeating erroneous queries.

In addition to the careful data inspection after each command, a dedicated script handled major points of actions. These included joining the two data sets, subsetting relevant variables, coherent variable naming, cleaning the country names, adding a week variable which was necessary for later visualisation, checking missing data and excluding countries for which Spotify only started publishing the charts recently. The following shows a sample of the “4_joining_charts_audio_features.R” and “5_data_preparation.R” scripts.

# merging charts and audio features
data = left_join(charts, audio_features, by = "Song.ID")


# select relevant columns, change to all lower case names
data = data %>%
  select(-c("type":"time_signature")) %>% 
  rename("country" = "Country", "song" = "Song", "artist" = "Artist", "date" = "Date", "song_id" = "Song.ID", "streams" = "Streams", "rank" = "Rank")


# clean the country names 
data$country = countrycode(data$country, "country.name", "country.name", warn = FALSE, nomatch = NULL)
data$country = ifelse(data$country == "global", "Global", data$country)


# create year and week variable (with leading zeros for the single digit weeks) those are need later for the visualisation
data$year = isoyear(ymd(data$date))

data$week = date2week(data$date)
data$week = substr(data$week, 7, 8) # extract week number by index


# excluding some country due to a lack of data (Spotify started collecting charts later some countries than others)
data = data %>% filter(country != "Andorra", country != "Russia", country != "Ukraine", country != "India", country != "Egypt", country != "Morocco", country != "United Arab Emirates", country != "Saudi Arabia")


# export cleaned data
write.csv(data, here("data", "5_clean_data.csv"), row.names = FALSE)

Predicting 2020 valence scores

Next, I calculated a baseline to compare the observed 2020 valence scores against. To account for the order of the charts (the top song should be more influential than the 200th song), I weighted the valence scores by their reversed rank. As compared to weighting by the number of streams, weighting by the reversed rank is not influenced by Spotify’s popularity, i.e. larger countries like the USA can be directly compared with smaller countries like Estonia. It also has the advantage that the top songs were not too influential.

Furthermore, I splitted the data in a training and test set. The training set included all data points of 2017, 2018 and 2019. The test set included the 2020 data. A linear model was trained and used to predict the 2020 valence scores. I calculated absolute and relative differences between the observed and predicted scores. Finally, I visualised the relative differences. Below is a sample of the “6_predicting_valence.R” script.

reduced_weighted_data = data %>%
  select(country, date, year, week, rank, valence) %>%
  mutate(reversed_rank = 201 - rank, weighted_valence = valence * reversed_rank) %>% 
  group_by(country, year, week) %>% 
  summarise(weighted_valence = mean(weighted_valence))


# train model to predict 2020 valence scores which will serve as baseline to compare the actual 2020 data against
lm_valence_formula = as.formula(weighted_valence ~ year + week + country + year:country + week:country)

training_data = reduced_weighted_data %>% filter(year < 2020, week != 53)
test_data = reduced_weighted_data %>% filter(year >= 2020, week != 53)


# basic linear modelling and predicting
lm_valence = lm(lm_valence_formula, data = training_data)
weighted_valence_pred = predict(lm_valence, newdata = test_data)


# merged actual and predicted data sets 
combined_data = cbind(test_data, data.frame(weighted_valence_pred))


# calculate difference between baseline (i.e. predicted valence) and actual valence
combined_data = combined_data %>%
  mutate(difference = weighted_valence - weighted_valence_pred,
         proportion_of_deviation = difference/weighted_valence_pred) %>%
  select(country, date, weighted_valence, weighted_valence_pred, difference, proportion_of_deviation)


write.csv(combined_data, here("data", "6_predicted_valence.csv"), row.names = FALSE)

Visualisation

The following section includes the full script of the visualisations. First, the working directory was set to the root of the project folder from where one can access the appropriate subfolders. To import the data, the “data.table::fread” function was used as it’s more efficient than R’s base “read.csv” function.

library("data.table")
library("tidyverse")

# set working directory to main root project folder PRIOR to load "here"
file_path = "/Users/sebastian/Documents/Uni/Sheffield (MSc)/2. Semester/Data Analysis and Viz/spotify_audio_feature_analysis/"
setwd(file_path)

# Load "here" to make command OS independent
library("here")

# import data
data = data.frame(fread(here("data", "6_predicted_valence.csv")))

# converting variables to the right type
data$country = as.factor(data$country)
data$date = as.Date(data$date)

knitr::kable(head(data))
country date weighted_valence weighted_valence_pred difference proportion_of_deviation
Argentina 2020-01-08 65.46527 61.67616 3.789113 0.0614356
Argentina 2020-01-15 65.82003 61.60820 4.211828 0.0683647
Argentina 2020-01-22 66.27632 61.75966 4.516661 0.0731329
Argentina 2020-01-29 66.09297 61.04046 5.052508 0.0827731
Argentina 2020-02-05 65.91741 60.47282 5.444588 0.0900336
Argentina 2020-02-12 65.26175 59.47407 5.787687 0.0973145

Second, I removed noise from the data. Prior to visualising, I therefore applied a smoothing technique (local weighted regression, loess).

# smoothing the proportion of deviation
models = data %>%
  nest(-country) %>%
  mutate(m = map(data, loess, formula = proportion_of_deviation ~ as.numeric(date), span = 0.2),
         fitted = map(m, `[[`, "fitted")) # retrieve fitted values from each model


# apply fitted Y's as a new column
smoothed_data = models %>%
  select(-m) %>%
  unnest()

Heat map

Third, to discover a geographic pattern in the data, I loosly sorted the countries according to their geographic location (from east to west). Cyprus was excluded due to its extreme values which would skew the rest of the heat map. Moreover, as not all countries have values for the last week so it’s removed.

At first glance, there’s a noticeable difference between Central/South America and the rest of the world which has somewhat higher valence scores than predicted. It’s encouraging to see that countries close to each other have similar patterns as it suggests that those scores capture a real construct.

In Europe, on average, there is higher valence than predicted. A hypothesis could be that people listen to more positive music to uplift their mood. However, concrete interpretation is up for debate. An interesting extension of this project would be to relate lockdown measures, Covid cases or fatalities to valence and check if there is a relationship.

# ordering countries loosely by proximity
level_ordered = c("Global", "New Zealand", "Australia", "Philippines", "Hong Kong SAR China", "Japan", "Singapore", "Malaysia", "Vietnam", "Thailand", "Taiwan", "Indonesia", "Israel", "Turkey", "Greece", "Cyprus", "Bulgaria", "Hungary", "Czechia", "Romania", "Poland", "Slovakia", "Austria","Luxembourg", "Switzerland", "Italy", "Germany", "Netherlands", "Belgium", "France", "Spain", "Portugal", "United Kingdom", "Ireland", "Denmark", "Norway", "Sweden", "Finland", "Latvia", "Lithuania", "Estonia", "South Africa", "Iceland", "United States", "Canada", "Mexico", "Dominican Republic", "Guatemala", "El Salvador", "Honduras", "Nicaragua", "Costa Rica", "Panama", "Colombia", "Ecuador", "Peru", "Bolivia", "Chile", "Brazil", "Paraguay", "Argentina", "Uruguay")
smoothed_data$country = factor(smoothed_data$country, levels = level_ordered)


# heat map for smoothed data
valence_heatmap = smoothed_data %>%
  filter(country != "Cyprus", date < "2021-02-05") %>% # exclude Cyprus due to skewing the results
  
  ggplot(aes(date, country, fill = fitted)) +
  geom_tile(color = "white") +
  
  scale_x_date(date_breaks = "1 month",
               date_labels = "%b %y",
               expand = c(0, 0), # removing padding around the data 
               sec.axis = dup_axis()) + # duplicate x-axis to the top
  
  scale_y_discrete(limits = rev) + # removing padding around the data
  
  scale_fill_gradientn(colours = c("#971D2B", "#FDFDC3", "#2B653C"),
                       values = c(0, 0.5, 1),
                       limits = c(-0.27, 0.27),
                       labels = scales::percent_format(accuracy = 1L),
                       name = "Deviation from\nexpected valence",
                       guide = guide_legend(title.position = "top",
                                            label.position = "bottom")) +
  
  labs(title = "Valence Up and Down",
       subtitle = "Based on Spotify's weekly top 200 songs per country",
       caption = "Source: https://spotifycharts.com/regional, 12.02.2021",
       x = "", y = "") +
  
  theme_classic() +
  
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.margin = margin(t = 1, r = 1.5, l = 0.5, unit = "cm"),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "top",
        legend.justification = "right",
        legend.direction = "horizontal",
        legend.title.align = 1,
        legend.key.width = unit(1, unit = "cm"),
        legend.key.height = unit(0.2, unit = "cm"),
        legend.spacing.x = unit(0, unit = "cm"),
        legend.margin = margin(t = -1.4, unit = "cm"),
        legend.box.margin = margin(b = -0.5, unit = "cm"))

valence_heatmap

Line plot

The line plot shows the valence over time for the US, UK and Germany. It allows for a more detailed evaluation of the curve than the heat map. Interestingly, the UK and the US have largely similar patterns and have more positive valence scores than predicted. In contrast, Germany has slightly lower scores than predicted.

I found the peaks and turning points particularly interesting. In April, valence scores in the UK and the US reached their maximum at about the same time when the first Covid wave peaked. In contrast, in August for the US and in October for Germany, valence started to increase again antithetical to the Covid situation in these countries.

Finally, Germany was doing relatively well compared to the UK and US during the first wave. However, the second wave in October struck Germany much harder and at that time its valence score converged to the other two countries.

The choice of countries displayed here is arbitrary, I invite you to download the data (“6_predicted_valence.csv”) and the script (“7_visualising_valence.R”) to play around with the countries.

# line plot
valence_line_plot = smoothed_data %>%
  filter(country == "United States" | country == "United Kingdom" | country == "Germany") %>% # pick your country
  
  ggplot(aes(date, fitted, colour = country)) +
  geom_line() +
  
  scale_x_date(date_breaks = "2 month",
               date_labels = "%b %y") +
  
  scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
                     limits = c(-0.25, 0.25)) +
  
  scale_colour_discrete(name = "") +
  
  labs(title = "Valence Up and Down",
       subtitle = "Based on Spotify's weekly top 200 songs per country",
       caption = "Source: https://spotifycharts.com/regional, 12.02.2021",
       x = "", y = "Deviation from expected valence") +
  
  theme_classic() +
  
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "cm"),
        panel.grid.major = element_line(size = 0.1, colour = "gray"),
        legend.position = "top",
        legend.justification = "right",
        legend.text = element_text(size = 11),
        legend.direction = "horizontal",
        legend.margin = margin(t = -0.7, b = 0, unit = "cm"),
        legend.box.margin = margin(b = -0.1, unit = "cm"))

valence_line_plot

Conclusion

I found that the visualisation revealed highly interesting patterns in the data. The similarity of patterns of neighbouring countries strengthened the trust in the data. However, the data are purely correlational. An interesting extension of the project would be to add data on the Covid cases/fatalities, lockdown measures, etc.

It was fun to do this project and I learned a lot especially regarding data wrangling and using Python and an API. Moreover, the project significantly increased my confidence and self-dependence in data management and visualisation.

References