Outdated vs New Drake Lyrics With AI.

Subsequent, I divide my dataset into coaching and testing the place 80% is used for coaching and 20% for testing. Moreover, I will even use Ok-fold validation for testing accuracy as it’s extra correct for a smaller dataset equivalent to this one.

#Defining X and Y
X = df['lyrics']
y = df['drake']
## Divide the dataset into Prepare and Check
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2, random_state = None)

Now it’s time to take a look at fashions with rely vectorizer. Depend vectorizer counts the phrase frequencies in a textual content and creates an array with the phrase frequencies.

# Making use of Countvectorizer
# Creating the Bag of Phrases mannequin

#I attempted completely different n_gram ranges and unigram works greatest
from sklearn.feature_extraction.textual content import CountVectorizer
cv = CountVectorizer()
cv.match(X_train)
trainx_cv = cv.remodel(X_train)
testx_cv = cv.remodel(X_test)
#I create a brand new rely vectorizer for cross validation as right here I match the entire dataset.
#For prepare break up take a look at I solely use prepare for match and simply apply remodel on take a look at
#Utilizing full datset with prepare break up take a look at will trigger leakage and provides incorrect outcomes

cv2 = CountVectorizer(

cvs_X = cv2.fit_transform(X)

Subsequent, I create the next fashions for textual content classification:

Word: I take a look at fashions utilizing each prepare/ take a look at/ break up and cross-validation. Under are cross-validation accuracies.

I’m not together with code for these to save lots of house. The code for them is on the market on my Github.

Subsequent, I repeat the identical steps, however this time I take advantage of a TF-IDF vectorizer as an alternative of Depend vectorizer. TF-IDF does the identical factor as rely vectorizer however the worth will increase proportionally to rely and is inversely proportional to the frequency of the phrase.

#Lets strive with Tf-IDF
#Steps are repeated as earlier than
#I attempted completely different n_gram ranges and unigram works greatest
from sklearn.feature_extraction.textual content import TfidfVectorizer
tf=TfidfVectorizer()
tf.match(X_train)
trainx_tf = tf.remodel(X_train).toarray()
testx_tf = tf.remodel(X_test).toarray()
#For cross validation
#Identical logic as earlier than. For cross validation I take advantage of full dataset to suit tfidf, whereas for prepare break up take a look at I take advantage of solely prepare to suit tfidf
tf2 = TfidfVectorizer()
cvs_X_tf = tf2.fit_transform(X)

The accuracies with TF-IDF vectorizer are:

Alright, so now we all know that Naive Bayes is performing greatest with rely vectorizer, and Linear SVC performs greatest with the TF-IDF vectorizer. Now allow us to attempt to additional optimize mannequin parameters:

#Multinomial NB tunning
pipeline = Pipeline([
('vect', CountVectorizer()),
('classifier', MultinomialNB()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__min_df': (1,2,3),
'vect__max_features': (None, 5000, 10000,15000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'classifier__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
}
Best_NB = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
Best_NB.match(X_train, y_train)
best_parameters = Best_NB.best_estimator_.get_params()

Evaluating accuracy earlier than and after tunning on the identical testing dataset.

Bear in mind accuracies in earlier sections had been cross-validation accuracies. Right here I’m evaluating prepare take a look at break up accuracies on the identical testing dataset.

#LinearSVC tunning
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LinearSVC()),
])
parameters = {
'tfidf__max_df': (0.90, 1.0),
'tfidf__min_df': (1,2,3,),
'tfidf__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'clf__C': (0.1, 1, 10, 100, 1000),
'clf__penalty': ('l1', 'l2'),
}
Best_SVC = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
Best_SVC.match(X_train, y_train)
best_parameters = Best_SVC.best_estimator_.get_params()

General the fashions educated on the balanced dataset carry out barely higher. Later once I take a look at towards some extra new information I observed that fashions educated on the unique dataset had been barely overfitted on outdated songs as they had been predicting outdated songs completely however new tune accuracy was not so good as that of fashions educated on the balanced dataset. The fashions educated on the balanced dataset had a slight lower in outdated tune prediction accuracy, doubtless as a result of some essential outdated tune options being overlooked by eradicating 30% of the dataset. Due to this fact sooner or later when Drake drops one other album, I’ll retest it with extra information.

So now now we have fashions with over 80% % accuracies. This reveals that there are options that assist to efficiently distinguish drake songs as new or outdated drake. Now let’s discover to options and perceive what precisely makes new Drake completely different from outdated Drake.

For this, I will probably be wanting on the mannequin coefficient values. Additionally, I’m solely analyzing the highest fashions.

#Due to tobique for the tactic. Hyperlink: https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers
#This technique take a look at the coeficient values and orders them based mostly on most adverse and optimistic
#Probably the most optimistic values hyperlink to phrases defining outdated drake tune
#Probably the most adverse values hyperlink to phrases defining new drake tune

def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
high = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in high:
print ("t%.4ft%-15stt%.4ft%-15s" % (coef_1, fn_1, coef_2, fn_2))

As talked about in code all credit go to Tobique for this technique. This technique is wanting on the mannequin coefficients and is then ordering them based mostly on essentially the most adverse and most optimistic. It additionally maps the coefficients to the characteristic names which makes it straightforward for us to grasp them. In our case, essentially the most adverse coefficients relate to outdated Drake tune lyrics, and essentially the most optimistic coefficients relate to new Drake tune lyrics. Listed here are the highest 40 options from two fashions:

Outdated Drake Key Phrases: uh, bout, huh, cash, greenback, automotive, ho, hoes, bitch, lady, love, she, her, you, beloved, lacking, low, model, light, dream, ball, crew.

New Drake Key Phrases: yeah, ayy, woah, nonetheless, nah, momma, preach, pray, God, working, shift, dedicate, wifey, really feel, lonely, babe, child, six, facet, crib.

The primary change to notice is that outdated Drake makes use of phrases uh ,bout, huh, ho generally, whereas new Drake makes use of ayy, woah, nah as an alternative.

Outdated Drake talks extra about love/relationships, ladies, materialistic issues equivalent to cash/vehicles/manufacturers, his crew (mates), and being light (excessive).

New Drake has moved previous speaking about ladies, relationships and materialistic as a lot and is now singing/rapping about quite a lot of various things together with God/prayer, his mom, Toronto (a.okay.a The six), working, and his emotions.

Now, let’s take a look at the most effective fashions with some extra songs. I gathered some extra songs lyrics from his singles and latest mixtape Darkish Lane Demo Tapes. The info I gathered contains 14 songs, 7 new and seven outdated. Right here you possibly can see precisely what songs the mannequin predicts accurately.

Mistaken predictions are in crimson.

On this dataset, each fashions Naive Baye and Linear SVC obtained 12/14 right predictions!!!

Linear SVC thinks of Needs as an outdated Drake tune, whereas the Naive Bayes mannequin thinks of Belief Points as a brand new Drake tune. Do you assume that Needs resembles outdated Drake and Belief Points seems like new Drake?

One other fascinating level to notice is that each fashions predicted Chicago Freestyle as an outdated tune when it’s a new Drake tune. Do you assume that Chicago Freestyle seems like outdated Drake?

The challenge has been profitable in figuring out adjustments in Drake’s outdated tune model and new tune model.

Let me know if there are any songs you prefer to me…

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: