Data Synthesis and Visualization
1. Background
As a passionate rap enthusiast, I wondered if there were any patterns to uncover in the large and diverse library of tracks. Billboard has long been an industry standard in music rankings, so I wanted to start by pulling data on its top rap songs from inception to present day. Billboard first released its Hot Rap Songs list on March 11, 1989 and has been updating it weekly since then. Each week, about 25 songs make the list.
While this was a good method of getting a list of relevant tracks to analyze, the next challenge was getting actual data on these tracks. I wanted to get both musical features (such as tempo, danceability, loudness, etc.) as well as lyrical features (words per second, overall sentiment, most common words, etc.). Luckily, two other music resources, Spotify and Genius, make this data readily available through APIs. Thus, I was left with the following process for data pulling:
- Scrape Billboard charts for top rap tracks
- Use the Spotify API to get musical features for these tracks
- Use the Genius API to get the lyrics for these tracks
After combining the data from these three sources, I chose to focus on two questions:
- How has rap changed over the years? Can I represent this visually?
- Is it possible to accurately predict which tracks will break into the top 5 spots on Billboard? Do top 5 rap songs across all years share any similarities?
This notebook will focus on pulling, combining, and cleaning data in addition to developing insights for the first question above.
The analysis resulted in the following conclusions:
- Profanity seems to have increased with the passage of time. The Golden Era in particular was relatively free of profanity.
- The word “yo” was common in the Golden Era but has since been less prevalent.
- Several words were popular across all eras: love, money, real, girl, etc.
- Danceability steadily declined since the Golden Era, reached a bottom during the Blog Era, and trended upward again.
- Energy had a double peak during the Bling and Blog Eras.
- Rap got significantly louder post-Golden Era and has stayed elevated since.
- Tracks have clearly gotten shorter over time, with rap songs today over a minute shorter than versus the Golden Era.
- Words per second has trended upward, suggesting that rap today is faster than any other period.
- Tracks today involve more repetition.
- Tracks today are less feel-good than before.
- Within tracks, sentiment is changing much more than before (tracks alternate between happier and sadder vibes).
- Not surprisingly, more recent tracks are currently more popular on Spotify.
2. Pulling, Combining, and Cleaning Data
There are Python modules that act as wrappers for scraping Billboard and pulling from the Spotify and Genius API. The following packages need to be installed to replicate the analysis below. I also had to set up an account on Spotify/Genius directly to get API permission. Spotify can be set up here while Genius can be set up here. Note that this data retrieval process took quite a long time given the volume of data being pulled, so the reader may want to just browse over the code in this section and load the finished data file directly from my GitHub repository.
pip install billboard.py
pip install spotipy
pip install lyricsgenius
In order to gauge the progress of my pull requests, I used the following module:
pip install tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
2.1. Billboard Data
import billboard
from random import randint
from time import sleep
billboard_df = pd.DataFrame()
start_date = "1989-03-11" #Earliest chart available
end_date = "2020-09-05" #Most recent chart at time of project release
chart = billboard.ChartData('rap-song', date = end_date)
stop = False
while stop == False:
print(chart.date)
for song in chart:
song_row = {"title" : song.title,
"artist" : song.artist,
"peak_pos" : song.peakPos,
"last_pos" : song.lastPos,
"weeks_on_chart" : song.weeks,
"current_rank" : song.rank,
"is_new" : song.isNew,
"week" : pd.to_datetime(chart.date)}
billboard_df = billboard_df.append(song_row, ignore_index = True)
#Pause for 1 to 5 seconds between each chart pull in order to avoid overloading site servers
sleep(randint(1,5))
print("done sleep")
if chart.date == start_date:
stop = True
else:
chart = billboard.ChartData('rap-song', chart.previousDate)
#Result of above data pull is saved as csv
billboard_df = pd.read_csv("billboard_history.csv", index_col=0)
billboard_df

After the first data pull, I had over 50,000 rows of tracks. Of course, some of these are duplicates as the same track can appear on multiple different charts. I decided to keep all of these rows for now.
2.2. Spotify Data
I noticed that the artist column above often contains multiple artists (often separated by the word “Featuring”). After exploring the column, I decided it word be best to remove any words not corresponding to actual artist names. Regular expressions would be useful to accomplish this.
import re #import regex functions
def get_artists(artist_string):
artists = re.sub('(?i)[()\[\],]|featuring\s|feat?\.\s','',artist_string)
artists = re.sub('(?i)(\s[x&]\s)|(\swith\s)',' ', artists)
return artists
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
client_id = 'id_string' #Replace with your Spotify API ID for replication
client_secret = 'secret_string' #Replace with your Spotify API secret for replication
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
#For each row, use spotipy package to pull the spotify track object
def get_spotify_track_object(row):
artists = get_artists(row["artist"])
title = row["title"]
q = "track:" + title + " artist:" + artists
track_object = sp.search(q, limit = 1, type = 'track')
return track_object
from tqdm import tqdm
tqdm.pandas() #Allows us to use df.progess_apply()
billboard_spotify_df = billboard_df
billboard_spotify_df['track_object'] = billboard_spotify_df.progress_apply(get_spotify_track_object, axis = 1)
#For each track object pulled before, access and save its data
def extract_track_object_features(row):
try:
track_data = row['track_object']['tracks']['items'][0]
album_type = track_data['album']['album_type']
album_name = track_data['album']['name']
album_release_date = track_data['album']['release_date']
album_num_tracks = track_data['album']['total_tracks']
duration_ms = track_data['duration_ms']
explicit = track_data['explicit']
spotify_track_id = track_data['id']
spotify_popularity = track_data['popularity']
spotify_track_name = track_data['name']
track_number = track_data['track_number']
return album_type, album_name, album_release_date, album_num_tracks, duration_ms, explicit, spotify_track_id, spotify_popularity, spotify_track_name, track_number
except: #if no track object was returned from the API pull, return None
pass
billboard_spotify_df[['album_type', 'album_name', 'album_release_date', 'album_num_tracks',
'duration_ms', 'explicit', 'spotify_track_id', 'spotify_popularity',
'spotify_track_name', 'track_number']] = billboard_spotify_df.progress_apply(extract_track_object_features, axis = 1, result_type="expand")
Due to inconsistencies between Billboard/Spotify artist and track names and the fact that Spotify does not contain all tracks in its library (due to licensing constraints), some tracks that made it onto the Billboard charts did not return a corresponding Spotify track object. As a result, these rows will need to be dropped. Instead of dropping them blindly, I wanted to first see how many missing tracks came from each year and see if there was any pattern.
#Extract year from chart week
billboard_spotify_df['year'] = billboard_spotify_df['week'].apply(lambda x: x[-4:])
#Sum missing tracks grouped by year
billboard_spotify_df.set_index('year')['album_type'].isna().sum(level=0)

The majority of missing tracks are from 2001 and earlier. With some manual analysis, I noticed that a lot of these earlier tracks are more obscure and less popular which probably influences its probability of being included in the Spotify library. While there are certainly more statistically sound ways to handle this issue, I decided to just drop any missing tracks. I was left with about 39,000 rows (including repeats).
billboard_spotify_df_2 = billboard_spotify_df.dropna()
billboard_spotify_df_2.shape
(38997, 16)
Each Spotify track object has a corresponding ID which can be fed into the API to pull its audio features. More detail on how Spotify calculates these values can be found here. Below, I called the API on each track and added the audio features to a dataframe.
def extract_audio_features(row):
audio_features_object = sp.audio_features(row['spotify_track_id'])[0]
features_to_return = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']
return [audio_features_object[feature] for feature in features_to_return]
billboard_spotify_df_2[['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'time_signature']] = billboard_spotify_df_2.progress_apply(extract_audio_features, axis = 1, result_type="expand")
#Drop unnecessary columns to save memory
billboard_spotify_df_2 = billboard_spotify_df_2.drop(columns=["is_new","last_pos","track_object","album_type",
"album_name","album_release_date","album_num_tracks",
"spotify_track_id","spotify_track_name","track_number",
"key","mode","instrumentalness","liveness","tempo",
"time_signature"])
billboard_spotify_df_2 = pd.read_pickle("billboard_spotify_2.pkl")
2.3. Genius Data
The final data pulling step was getting the actual lyrics from Genius for NLP exploration. This API was quite slow, so I needed to maximize efficiency. Since many rows corresponded to the same tracks, it did not make sense to consistently call the API to pull the same lyrics. To address this, I saved each lyric pull into song_object_dic. If a track was already in this dictionary, I just copied over the same lyrics rather than calling the API again.
import lyricsgenius
genius = lyricsgenius.Genius("client_access_token") #Replace with your Genius token for replication
genius.remove_section_headers = True
#Increase timeout limit in case API is slow to pull data
genius.timeout = 60
song_object_dic = {}
def get_genius_song_object(row):
artists = get_artists(row["artist"])
title = row["title"]
#If we already have these lyrics saved then copy them over
if (artists,title) in song_object_dic.keys():
return song_object_dic[(artists,title)]
while True:
try:
song = genius.search_song(title, artists)
break
except:
pass
#If the Genius API search did not find any results, return None
if song is None:
song_object_dic[(artists,title)] = None
pass
#Otherwise, add the song object to the dictionaru and return
else:
song_object_dic[(artists,title)] = song
return song
complete_df = billboard_spotify_df_2
complete_df["song_object"] = complete_df.progress_apply(get_genius_song_object, axis = 1)
complete_df = pd.read_pickle('complete.pkl')
Next, for each Genius song object, I accessed its attributes (including lyrics) and added it to a dataframe.
def extract_song_object_features(row):
try:
song_object = row['song_object']
genius_album = song_object.album
genius_artist = song_object.artist
genius_song_title = song_object.title
genius_year = song_object.year
genius_lyrics = song_object.lyrics
return genius_album, genius_artist, genius_song_title, genius_year, genius_lyrics
except:
pass
complete_df[["genius_album", "genius_artist", "genius_song_title", "genius_year", "genius_lyrics"]] = complete_df.progress_apply(extract_song_object_features, axis = 1, result_type="expand")
#Drop columns no longer needed
complete_df_2 = complete_df.drop(columns=["song_object","genius_album","genius_artist",
"genius_song_title","genius_year"])
As with the Spotify step, some track lyrics were not found by the Genius API search. Below, I once again calculated the number of missing tracks per year. The distribution of tracks is more uniform compared to the Spotify step so it seemed okay to drop them.
complete_df_2.set_index('year')['genius_lyrics'].isna().sum(level=0)

complete_df_2 = complete_df_2.dropna(subset=["genius_lyrics"])
complete_df_2 = complete_df_2.reset_index(drop=True)
complete_df_2 = pd.read_pickle("complete_2.pkl")
3. Visualization
3.1. Natural Language Processing
With all of the data pulled and saved, it was time to analyze the lyrics. I used the popular NLTK package:
pip install nltk
I used the following steps to process lyrics:
- Covert all characters to lowercase
- Remove any characters that are not letters, digits, spaces, or linebreaks
- Use a stemmer on each word and save the result into a list of words
Similar to before, since there were repeat tracks, I used a dictionary to keep track of what I had already processed so avoid unnecessary computing.
import nltk
#The below command only needs to be run the first time NLTK is used. A separate window will pop up to guide installation.
#nltk.download()
from nltk.stem import PorterStemmer
ps = PorterStemmer()
processed_lyrics_dic = {}
def lyric_processor(row):
artist = row["artist"]
title = row["title"]
if (artist,title) in processed_lyrics_dic.keys():
return processed_lyrics_dic[(artist,title)]
#Step 1
lyrics = row["genius_lyrics"].lower()
#Step 2
lyrics = re.sub("[^\w\s]+", "", lyrics)
#Step 3
word_tokens = nltk.word_tokenize(lyrics)
processed_lyrics = [ps.stem(token) for token in word_tokens]
processed_lyrics_dic[(artist,title)]=processed_lyrics
return processed_lyrics
complete_df_2["processed_lyrics"] = complete_df_2.progress_apply(lyric_processor, axis = 1)
Another challenge is that tracks often repeat the same word many times. One method would be to include every distinct occurrence into total word count. However, in an extreme case, a track that just repeated one word over and over again would give an unfair influence to the popularity of that word. I got around this by only considering the distinct set of words for each track. As a result, a single track can only contribute +1 to a particular word’s count. However, if the same track appeared in multiple different weeks, it would contribute +1 in each occurrence. This seemed like a good balance to assess word popularity, as song that stayed on the charts longer would have their words contribute more.
#Consider only distinct word occurences
complete_df_2["unique_lyrics"] = complete_df_2["processed_lyrics"].progress_apply(lambda x : list(set(x)))
Next, I had to remove any stop words that did not contribute much to the overall meaning of the track. NLTK includes a set of stop words. I ran same pre-processing as before on this set (removing non-word/digit/space characters and stemming the result). With some additional manual analysis, I came up with a list of custom_stopwords which I also wanted to remove.
from nltk.corpus import stopwords
stopwords = stopwords.words("english")
stop_list = [ps.stem(re.sub("[^\w\s]+", "",stopword)) for stopword in stopwords]
custom_stopwords = ['woah','come','could','gon','ye','want','okay','someth','wan','caus','ive','back','na','huh','came',
'ta','say','said','aint','lot','ho','much',
'imma','lookin','like','bout','got','ever','ima','us',
'gave','ill',
'tryna','til','comin','know','anoth','noth','need','mine','done','let',
'would','cant',
'get','made','im','everyth','around','give','gettin',
'though','em','well','thought','better','hey',
'ooh','go','think','yeah',
'way','goin','put','yall','ya','uh','might','thing','oh','make','see',
'take','one','keep','even']
stop_list+=custom_stopwords
complete_df_2["unique_lyrics_no_stopwords"] = complete_df_2["unique_lyrics"].progress_apply(lambda x : [w for w in x if w not in stop_list])
To deal with profanity, I used the following:
pip install better_profanity
For any profane words, I replaced all except the first and last characters with an asterisk.
from better_profanity import profanity
def filter_profanity(word):
if profanity.contains_profanity(word):
return word[0] + "*"*(len(word)-2) + word[-1]
return word
complete_df_2["unique_lyrics_no_stopwords"] = complete_df_2["unique_lyrics_no_stopwords"].progress_apply(lambda x : [filter_profanity(w) for w in x])
To see the evolution of rap, I split the data into four distinct eras. I based the split on a Wikipedia article. Below, I assign each track to an era (based on the year the chart is from) and group the unique lyrics without stop words from each era.
#Convert to columns from string to integer
complete_df_2["year"] = complete_df_2["year"].astype(int)
def get_era(year):
if year <= 1997:
return "Golden Era (1989-1997)"
elif 1997 < year <= 2006:
return "Bling Era (1998-2006)"
elif 2006 < year <= 2014:
return "Blog Era (2007-2014)"
else:
return "Trap/Mumble Era (2015-Present)"
complete_df_2["era"] = complete_df_2["year"].apply(get_era)
eras = np.unique(complete_df_2.era.values)
lyrics_by_era = []
for era in eras:
era_df = complete_df_2.query('era == @era')
lyrics_list = np.concatenate(era_df.unique_lyrics_no_stopwords.values)
lyrics_by_era.append((era, lyrics_list.tolist()))
lyrics_by_era_df = pd.DataFrame(lyrics_by_era, columns = ["era", "lyrics"])
#Reorder rows from earliest to most recent era
lyrics_by_era_df = lyrics_by_era_df.iloc[[2,0,1,3]].reset_index(drop=True)
Finally, with the text processed successfully, it was time to build some visualizations. I used another package to build some word clouds.
pip install wordcloud
from collections import Counter
from wordcloud import WordCloud
fig, ax = plt.subplots(2,2,figsize=(20,20), facecolor=None)
plt.tight_layout(pad = 3)
for i, row in lyrics_by_era_df.iterrows():
if i==0:
x,y = 0,0
elif i==1:
x,y=0,1
elif i==2:
x,y=1,0
else:
x,y=1,1
word_count = Counter(row["lyrics"])
wordcloud = WordCloud(width = 800, height = 800, max_words=100, background_color="white",
random_state=23, min_font_size = 10,
relative_scaling=.5, colormap="plasma").generate_from_frequencies(word_count)
ax[x][y].imshow(wordcloud)
ax[x][y].axis("off")
ax[x][y].set_title(row["era"],pad=15,fontsize=30,fontweight="bold")

From these word clouds, a couple interesting patterns are evident:
- Profanity seems to have increased with the passage of time. The Golden Era in particular was relatively free of profanity.
- The word “yo” was common in the Golden Era but has since been less prevalent.
- Several words were popular across all eras: love, money, real, girl, etc.
I moved onto analyzing the overall sentiment of each track utilizing yet another package:
pip install vaderSentiment
Below, I calculate a compound polarity score for each line in a track. I then return the average polarity across lines as well and the standard deviation of polarity across lines.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
def calculate_sentiment(row):
lyrics = row["genius_lyrics"]
sentiment_list = []
for line in lyrics.splitlines():
if len(line) > 0:
polarity_score = analyser.polarity_scores(line)
sentiment_list.append(polarity_score["compound"])
return sum(sentiment_list)/len(sentiment_list), np.std(sentiment_list)
complete_df_2[["mean_sentiment","std_sentiment"]] = complete_df_2.progress_apply(calculate_sentiment, axis=1, result_type="expand")
Prior to creating more visualizations, I engineered some more feature columns. Note that the unique_word_proportion column attempts to capture the diversity of words in a track. This is calculated as the number of unique words divided by the total number of words.
complete_df_3 = complete_df_2
complete_df_3["duration_sec"] = complete_df_3["duration_ms"]/1000
complete_df_3["word_count"] = complete_df_3["genius_lyrics"].progress_apply(lambda x: len(x.split()))
complete_df_3["words_per_second"] = complete_df_3["word_count"]/complete_df_3["duration_sec"]
complete_df_3["unique_word_proportion"] = complete_df_3["unique_lyrics"].apply(lambda x: len(x))/complete_df_3["word_count"]
complete_df_3 = complete_df_3.drop(columns=["genius_lyrics","unique_lyrics", "unique_lyrics_no_stopwords"])
complete_df_3 = pd.read_pickle("complete_3.pkl")
3.2. Line Charts
Below, I plot the historical trends for several of the previously calculated features. Due to massive variation between tracks (some of which is likely due to incorrect data pulls), I decided to only focus on tracks between the 25th and 75th percentile for each feature. This is accomplished with the remove_outliers function. To generate confidence bands for my line charts, I also calculated grouped standard deviations for tracks within this percentile range.
features_to_plot = ["danceability", "energy", "loudness", "speechiness", "acousticness",
"duration_sec", "words_per_second", "unique_word_proportion",
"valence","mean_sentiment", "std_sentiment","spotify_popularity"]
fig, ax = plt.subplots(6,2, figsize=(20,60))
plt.tight_layout(pad = 3)
def remove_outliers(year_df, feature):
q1 = year_df[feature].quantile(0.25)
q3 = year_df[feature].quantile(0.75)
iqr = q3 - q1
reduced_df = year_df[(year_df[feature]>=q1) & (year_df[feature]<=q3)]
return reduced_df
for i, feature in enumerate(features_to_plot):
col = i%2
row = i//2
axis = ax[row][col]
filtered_df = complete_df_3.groupby("year",as_index=False).apply(remove_outliers,feature)
by_year_df = filtered_df.groupby(by="year")[[feature]].mean()
errors = filtered_df.groupby(by = "year")[[feature]].std()
axis.plot(by_year_df.index,by_year_df[feature],linewidth=5)
axis.set_title(feature,fontsize=20,fontweight="bold")
axis.tick_params("both",labelsize=20)
axis.fill_between(by_year_df.index,by_year_df[feature]-errors[feature],by_year_df[feature]+errors[feature],
facecolor="red",alpha=.2)
axis.axvline(1998,ls='dashed',lw=1,dashes=(7,12))
axis.axvline(2007,ls='dashed',lw=1,dashes=(7,12))
axis.axvline(2015,ls='dashed',lw=1,dashes=(7,12))
bottom, top = axis.set_ylim()
axis.set_xlim(1989,2020)
axis.annotate("Golden",(1994,bottom+.02*(top-bottom)),ha="center",fontsize=17)
axis.annotate("Bling",(2002.5,bottom+.02*(top-bottom)),ha="center",fontsize=17)
axis.annotate("Blog",(2011,bottom+.02*(top-bottom)),ha="center",fontsize=17)
axis.annotate("Trap/Mumble",(2017.5,bottom+.02*(top-bottom)),ha="center",fontsize=17)

Several interesting historical trends are evident from the above plots.
- Danceability: Steadily declined since the Golden Era, reached a bottom during the Blog Era, and trended upward again
- Energy: Had a double peak during the Bling and Blog Eras
- Loudness: Rap got significantly louder post-Golden Era and has stayed elevated since
- Speechiness: Although the mean has trended downward, taking confidence bands into consideration suggests there isn’t much of a trend
- Acousticness: Although the mean has trended upward, taking confidence bands into consideration suggests there isn’t much of a trend
- Track Duration: Tracks have clearly gotten shorter over time, with rap songs today over a minute shorter than versus the Golden Era. This makes sense with the rise of the streaming industry where shorter tracks increase revenue for artists/labels.
- Words Per Second: This has trended upward, suggesting that rap today is faster than any other period.
- Proportion of Unique Words: Interestingly, this is actually lower today (even though words per second is higher). This suggests that tracks today involve more repetition.
- Valence: Has steadily trended downward, suggesting tracks today are less feel-good than before
- Mean Sentiment: Exhibits a similar trend to valence (given it is attempting to measure the same thing), although the trend is less statistically significant
- Standard Deviation of Sentiment: Within tracks, sentiment is changing much more than before (tracks alternate between happier and sadder vibes)
- Spotify Popularity: Not surprisingly, more recent tracks are currently more popular on Spotify, although there is a local maximum in the middle of the Bling Era (perhaps captured by Spotify’s throwback playlists)
4. Conclusion/Further Steps
I was able to find several interesting similarities and differences between rap tracks across time. It would be interesting to see if I could engineer more creative features to find hidden patterns. Visualizing most common bigrams/trigrams could also be useful.
I think that data cleaning needs some addressing, as upon manual inspection, it is evident that some API searches returned incorrect tracks (such as remixes instead of the original). Given the large quantity of tracks, manually cleaning this data would be very time consuming (and may not necessarily improve performance). It may make sense to randomly sample a subset of tracks, clean the data, and repeat the analysis above.