Movie_Genre_Classification¶

Create a machine learning model that can predict the genre of a movie based on its plot summary or other textual information. You can use techniques like TF-IDF or word embeddings with classifiers such as Naive Bayes, Logistic Regression, or Support Vector Machines.

Import necessary files¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import string
import re
%matplotlib inline

from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

Load Train Dataset¶

In [2]:
train_path = ('Downloads/archive/Genre Classification Dataset/train_data.txt')
train_data = pd.read_csv(train_path, sep = ':::', names = ['Title', 'Genre', 'Description'])
train_data.head()
C:\Users\kiran\AppData\Local\Temp\ipykernel_12232\595940393.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  train_data = pd.read_csv(train_path, sep = ':::', names = ['Title', 'Genre', 'Description'])
Out[2]:
Title Genre Description
1 Oscar et la dame rose (2009) drama Listening in to a conversation between his do...
2 Cupid (1997) thriller A brother and sister with a past incestuous r...
3 Young, Wild and Wonderful (1980) adult As the bus empties the students for their fie...
4 The Secret Sin (1915) drama To help their unemployed father make ends mee...
5 The Unrecovered (2007) drama The film's title refers not only to the un-re...

Load Test Dataset¶

In [3]:
test_path = ('Downloads/archive/Genre Classification Dataset/test_data.txt')
test_data = pd.read_csv(test_path, sep = ':::', names = ['id', 'Title', 'Description'])
test_data.head()
C:\Users\kiran\AppData\Local\Temp\ipykernel_12232\1819879311.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  test_data = pd.read_csv(test_path, sep = ':::', names = ['id', 'Title', 'Description'])
Out[3]:
id Title Description
0 1 Edgar's Lunch (1998) L.R. Brane loves his life - his car, his apar...
1 2 La guerra de papá (1977) Spain, March 1964: Quico is a very naughty ch...
2 3 Off the Beaten Track (2010) One year in the life of Albin and his family ...
3 4 Meu Amigo Hindu (2015) His father has died, he hasn't spoken with hi...
4 5 Er nu zhai (1955) Before he was known internationally as a mart...

Load Target Dataset¶

In [4]:
test_soln_path = ('Downloads/archive/Genre Classification Dataset/test_data_solution.txt')
test_soln_data = pd.read_csv(test_soln_path, sep = ':::', names = ['Title', 'Genre', 'Description'])
test_soln_data.drop(test_soln_data.columns[[0,2]], axis = 1, inplace = True)
test_soln_data.rename(columns = {'Genre':'Target_Genre'}, inplace = True)
test_soln_data.head()
C:\Users\kiran\AppData\Local\Temp\ipykernel_12232\320415847.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  test_soln_data = pd.read_csv(test_soln_path, sep = ':::', names = ['Title', 'Genre', 'Description'])
Out[4]:
Target_Genre
1 thriller
2 comedy
3 documentary
4 drama
5 drama
In [5]:
train_data.describe()
Out[5]:
Title Genre Description
count 54214 54214 54214
unique 54214 27 54086
top Oscar et la dame rose (2009) drama Grammy - music award of the American academy ...
freq 1 13613 12
In [6]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 54214 entries, 1 to 54214
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        54214 non-null  object
 1   Genre        54214 non-null  object
 2   Description  54214 non-null  object
dtypes: object(3)
memory usage: 1.7+ MB
In [7]:
test_data.describe()
Out[7]:
id
count 54200.000000
mean 27100.500000
std 15646.336632
min 1.000000
25% 13550.750000
50% 27100.500000
75% 40650.250000
max 54200.000000
In [8]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           54200 non-null  int64 
 1   Title        54200 non-null  object
 2   Description  54200 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB
In [9]:
train_data.isnull().sum()
Out[9]:
Title          0
Genre          0
Description    0
dtype: int64
In [10]:
counts = train_data.Genre.value_counts()
counts
Out[10]:
Genre
 drama           13613
 documentary     13096
 comedy           7447
 short            5073
 horror           2204
 thriller         1591
 action           1315
 western          1032
 reality-tv        884
 family            784
 adventure         775
 music             731
 romance           672
 sci-fi            647
 adult             590
 crime             505
 animation         498
 sport             432
 talk-show         391
 fantasy           323
 mystery           319
 musical           277
 biography         265
 history           243
 game-show         194
 news              181
 war               132
Name: count, dtype: int64

Ploting the counts of Genres in the training dataset¶

In [11]:
plt.figure(figsize = (10,8))
sns.countplot(data=train_data, y='Genre', order=counts.index)
Out[11]:
<Axes: xlabel='count', ylabel='Genre'>
No description has been provided for this image

Ploting the distribution of Genres using a bar plot¶

In [12]:
plt.figure(figsize = (10,8))
sns.barplot(x=counts.index, y=counts)
plt.xticks(rotation=90)
Out[12]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26]),
 [Text(0, 0, ' drama '),
  Text(1, 0, ' documentary '),
  Text(2, 0, ' comedy '),
  Text(3, 0, ' short '),
  Text(4, 0, ' horror '),
  Text(5, 0, ' thriller '),
  Text(6, 0, ' action '),
  Text(7, 0, ' western '),
  Text(8, 0, ' reality-tv '),
  Text(9, 0, ' family '),
  Text(10, 0, ' adventure '),
  Text(11, 0, ' music '),
  Text(12, 0, ' romance '),
  Text(13, 0, ' sci-fi '),
  Text(14, 0, ' adult '),
  Text(15, 0, ' crime '),
  Text(16, 0, ' animation '),
  Text(17, 0, ' sport '),
  Text(18, 0, ' talk-show '),
  Text(19, 0, ' fantasy '),
  Text(20, 0, ' mystery '),
  Text(21, 0, ' musical '),
  Text(22, 0, ' biography '),
  Text(23, 0, ' history '),
  Text(24, 0, ' game-show '),
  Text(25, 0, ' news '),
  Text(26, 0, ' war ')])
No description has been provided for this image
In [13]:
stemmer = LancasterStemmer()
stop_words = set(stopwords.words('english'))

def corpus(text):
    text = text.lower()  # Lowercase all characters
    text = re.sub(r'@\S+', '', text)  # Remove Twitter handles
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'pic.\S+', '', text)
    text = re.sub(r"[^a-zA-Z+']", ' ', text)  # Keep only characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text + ' ')  # Keep words with length > 1 only
    text = "".join([i for i in text if i not in string.punctuation])
    words = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')  # Remove stopwords
    text = " ".join([i for i in words if i not in stopwords and len(i) > 2])
    text = re.sub("\s[\s]+", " ", text).strip()  # Remove repeated/leading/trailing spaces
    return text

train_data['Corpus_cleaning'] = train_data['Description'].apply(corpus)
test_data['Corpus_cleaning'] = test_data['Description'].apply(corpus)
In [14]:
train_data
Out[14]:
Title Genre Description Corpus_cleaning
1 Oscar et la dame rose (2009) drama Listening in to a conversation between his do... listening conversation doctor parents year old...
2 Cupid (1997) thriller A brother and sister with a past incestuous r... brother sister past incestuous relationship cu...
3 Young, Wild and Wonderful (1980) adult As the bus empties the students for their fie... bus empties students field trip museum natural...
4 The Secret Sin (1915) drama To help their unemployed father make ends mee... help unemployed father make ends meet edith tw...
5 The Unrecovered (2007) drama The film's title refers not only to the un-re... films title refers recovered bodies ground zer...
... ... ... ... ...
54210 "Bonino" (1953) comedy This short-lived NBC live sitcom centered on ... short lived nbc live sitcom centered bonino wo...
54211 Dead Girls Don't Cry (????) horror The NEXT Generation of EXPLOITATION. The sist... next generation exploitation sisters kapa bay ...
54212 Ronald Goedemondt: Ze bestaan echt (2008) documentary Ze bestaan echt, is a stand-up comedy about g... bestaan echt stand comedy growing facing fears...
54213 Make Your Own Bed (1944) comedy Walter and Vivian live in the country and hav... walter vivian live country difficult time keep...
54214 Nature's Fury: Storm of the Century (2006) history On Labor Day Weekend, 1935, the most intense ... labor day weekend intense hurricane ever make ...

54214 rows × 4 columns

In [15]:
test_data
Out[15]:
id Title Description Corpus_cleaning
0 1 Edgar's Lunch (1998) L.R. Brane loves his life - his car, his apar... brane loves life car apartment job especially ...
1 2 La guerra de papá (1977) Spain, March 1964: Quico is a very naughty ch... spain march quico naughty child three belongin...
2 3 Off the Beaten Track (2010) One year in the life of Albin and his family ... one year life albin family shepherds north tra...
3 4 Meu Amigo Hindu (2015) His father has died, he hasn't spoken with hi... father died hasnt spoken brother years serious...
4 5 Er nu zhai (1955) Before he was known internationally as a mart... known internationally martial arts superstar b...
... ... ... ... ...
54195 54196 "Tales of Light & Dark" (2013) Covering multiple genres, Tales of Light & Da... covering multiple genres tales light dark anth...
54196 54197 Der letzte Mohikaner (1965) As Alice and Cora Munro attempt to find their... alice cora munro attempt find father british o...
54197 54198 Oliver Twink (2007) A movie 169 years in the making. Oliver Twist... movie years making oliver twist artful dodger ...
54198 54199 Slipstream (1973) Popular, but mysterious rock D.J Mike Mallard... popular mysterious rock mike mallard askew bro...
54199 54200 Curitiba Zero Grau (2010) Curitiba is a city in movement, with rhythms ... curitiba city movement rhythms different pulsa...

54200 rows × 4 columns

In [16]:
print("shape before drop nulls",train_data.shape)
train_data = train_data.drop_duplicates()
print("shape after drop nulls",train_data.shape)
shape before drop nulls (54214, 4)
shape after drop nulls (54214, 4)
In [17]:
import warnings
warnings.filterwarnings("ignore", "use_inf_as_na")
train_data['length_Corpus_cleaning'] = train_data['Corpus_cleaning'].apply(len)

Visualizing the text length¶

In [18]:
sns.histplot(data = train_data, x = train_data['length_Corpus_cleaning'], bins = 20, kde = True)
plt.xlabel('Length')
plt.ylabel('Frequency')
Out[18]:
Text(0, 0.5, 'Frequency')
No description has been provided for this image
In [19]:
plt.figure(figsize=(12, 6))
# Subplot 1: Original text length distribution
plt.subplot(1, 2, 1)
original_lengths = train_data['Description'].apply(len)
plt.hist(original_lengths, bins=range(0, max(original_lengths) + 100, 100), color = 'blue')
plt.title('Original Text Length')
plt.xlabel('Text Length')
plt.ylabel('Frequency')

# Subplot 2: Cleaned text length distribution
plt.subplot(1, 2, 2)
cleaned_lengths = train_data['Corpus_cleaning'].apply(len)
plt.hist(cleaned_lengths, bins=range(0, max(cleaned_lengths) + 100, 100), color = 'green')
plt.title('Cleaned Text Length')
plt.xlabel('Text Length')
plt.ylabel('Frequency')
Out[19]:
Text(0, 0.5, 'Frequency')
No description has been provided for this image

TF-IDF Text vectorization¶

In [20]:
%%time
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(train_data['Corpus_cleaning'])
X_test = tfidf.transform(test_data['Corpus_cleaning'])
CPU times: total: 6.72 s
Wall time: 6.72 s
In [21]:
X_train
Out[21]:
<54214x124210 sparse matrix of type '<class 'numpy.float64'>'
	with 2640592 stored elements in Compressed Sparse Row format>
In [22]:
X_test
Out[22]:
<54200x124210 sparse matrix of type '<class 'numpy.float64'>'
	with 2578617 stored elements in Compressed Sparse Row format>
In [23]:
train_data['Corpus_cleaning']
Out[23]:
1        listening conversation doctor parents year old...
2        brother sister past incestuous relationship cu...
3        bus empties students field trip museum natural...
4        help unemployed father make ends meet edith tw...
5        films title refers recovered bodies ground zer...
                               ...                        
54210    short lived nbc live sitcom centered bonino wo...
54211    next generation exploitation sisters kapa bay ...
54212    bestaan echt stand comedy growing facing fears...
54213    walter vivian live country difficult time keep...
54214    labor day weekend intense hurricane ever make ...
Name: Corpus_cleaning, Length: 54214, dtype: object
In [24]:
X_train.shape
Out[24]:
(54214, 124210)
In [25]:
X_test.shape
Out[25]:
(54200, 124210)
In [26]:
X = X_train
y = train_data['Genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
In [27]:
X_train.shape
Out[27]:
(43371, 124210)
In [28]:
y_train.shape
Out[28]:
(43371,)
In [29]:
X_test.shape
Out[29]:
(10843, 124210)
In [30]:
y_test.shape
Out[30]:
(10843,)

Multinomial Naive Bayes¶

In [31]:
%%time
import warnings
warnings.filterwarnings("ignore")

model_nb = MultinomialNB()
model_nb.fit(X_train, y_train)
CPU times: total: 578 ms
Wall time: 598 ms
Out[31]:
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [32]:
y_nb_pred = model_nb.predict(X_test)
In [33]:
y_nb_pred.shape
Out[33]:
(10843,)
In [34]:
from sklearn.metrics import accuracy_score, classification_report
In [35]:
accuracy_score(y_nb_pred, y_test)
Out[35]:
0.44526422576777647
In [36]:
print(classification_report(y_nb_pred, y_test))
               precision    recall  f1-score   support

      action        0.00      0.00      0.00         0
       adult        0.00      0.00      0.00         0
   adventure        0.00      0.00      0.00         0
   animation        0.00      0.00      0.00         0
   biography        0.00      0.00      0.00         0
      comedy        0.04      0.61      0.07        93
       crime        0.00      0.00      0.00         0
 documentary        0.90      0.54      0.67      4462
       drama        0.88      0.38      0.53      6284
      family        0.00      0.00      0.00         0
     fantasy        0.00      0.00      0.00         0
   game-show        0.00      0.00      0.00         0
     history        0.00      0.00      0.00         0
      horror        0.00      0.00      0.00         0
       music        0.00      0.00      0.00         0
     musical        0.00      0.00      0.00         0
     mystery        0.00      0.00      0.00         0
        news        0.00      0.00      0.00         0
  reality-tv        0.00      0.00      0.00         0
     romance        0.00      0.00      0.00         0
      sci-fi        0.00      0.00      0.00         0
       short        0.00      0.50      0.00         4
       sport        0.00      0.00      0.00         0
   talk-show        0.00      0.00      0.00         0
    thriller        0.00      0.00      0.00         0
         war        0.00      0.00      0.00         0
     western        0.00      0.00      0.00         0

     accuracy                           0.45     10843
    macro avg       0.07      0.08      0.05     10843
 weighted avg       0.88      0.45      0.58     10843

Logistic Regression¶

In [37]:
%%time
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
CPU times: total: 4min 10s
Wall time: 1min 30s
Out[37]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [38]:
y_lr_pred = model_lr.predict(X_test)
In [39]:
y_lr_pred
Out[39]:
array([' comedy ', ' drama ', ' documentary ', ..., ' drama ', ' short ',
       ' horror '], dtype=object)
In [40]:
accuracy_score(y_lr_pred, y_test)
Out[40]:
0.5808355621138062
In [41]:
print(classification_report(y_lr_pred, y_test))
               precision    recall  f1-score   support

      action        0.22      0.56      0.32       103
       adult        0.21      0.82      0.33        28
   adventure        0.11      0.50      0.18        30
   animation        0.02      0.67      0.04         3
   biography        0.00      0.00      0.00         0
      comedy        0.59      0.53      0.56      1602
       crime        0.01      0.50      0.02         2
 documentary        0.86      0.65      0.74      3520
       drama        0.81      0.53      0.64      4108
      family        0.05      0.50      0.09        14
     fantasy        0.00      0.00      0.00         0
   game-show        0.35      0.93      0.51        15
     history        0.00      0.00      0.00         0
      horror        0.55      0.68      0.61       348
       music        0.38      0.69      0.49        78
     musical        0.00      0.00      0.00         0
     mystery        0.00      0.00      0.00         0
        news        0.00      0.00      0.00         0
  reality-tv        0.13      0.45      0.20        56
     romance        0.00      0.00      0.00         4
      sci-fi        0.15      0.51      0.24        43
       short        0.31      0.51      0.39       636
       sport        0.17      0.70      0.28        23
   talk-show        0.10      0.57      0.17        14
    thriller        0.12      0.49      0.20        78
         war        0.00      0.00      0.00         0
     western        0.67      0.96      0.79       138

     accuracy                           0.58     10843
    macro avg       0.21      0.44      0.25     10843
 weighted avg       0.73      0.58      0.63     10843

Support Vector CLasssifier¶

In [42]:
%%time
model_svc = SVC()
model_svc.fit(X_train, y_train)
CPU times: total: 1h 3min 31s
Wall time: 1h 3min 39s
Out[42]:
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC()
In [43]:
y_svc_pred = model_svc.predict(X_test)
In [44]:
y_svc_pred
Out[44]:
array([' drama ', ' drama ', ' comedy ', ..., ' drama ', ' drama ',
       ' horror '], dtype=object)
In [45]:
accuracy_score(y_svc_pred, y_test)
Out[45]:
0.5691229364567002
In [46]:
print(classification_report(y_svc_pred, y_test))
               precision    recall  f1-score   support

      action        0.13      0.64      0.22        55
       adult        0.14      0.84      0.24        19
   adventure        0.09      0.52      0.15        23
   animation        0.01      1.00      0.02         1
   biography        0.00      0.00      0.00         0
      comedy        0.54      0.53      0.54      1472
       crime        0.00      0.00      0.00         0
 documentary        0.88      0.64      0.74      3692
       drama        0.84      0.50      0.62      4578
      family        0.05      0.73      0.10        11
     fantasy        0.00      0.00      0.00         0
   game-show        0.33      1.00      0.49        13
     history        0.00      0.00      0.00         0
      horror        0.52      0.72      0.60       310
       music        0.25      0.77      0.38        47
     musical        0.00      0.00      0.00         0
     mystery        0.00      0.00      0.00         0
        news        0.00      0.00      0.00         0
  reality-tv        0.06      0.67      0.11        18
     romance        0.00      0.00      0.00         0
      sci-fi        0.11      0.67      0.19        24
       short        0.22      0.60      0.33       389
       sport        0.10      0.90      0.17        10
   talk-show        0.04      0.75      0.07         4
    thriller        0.06      0.50      0.11        40
         war        0.05      1.00      0.10         1
     western        0.66      0.96      0.78       136

     accuracy                           0.57     10843
    macro avg       0.19      0.52      0.22     10843
 weighted avg       0.76      0.57      0.63     10843

Comparision¶

In [47]:
acs_nb = accuracy_score(y_nb_pred, y_test)
acs_lr = accuracy_score(y_lr_pred, y_test)
acs_svc = accuracy_score(y_svc_pred, y_test)
In [48]:
total = acs_nb + acs_lr + acs_svc
# total = acs_nb + acs_lr
pie_part_1 = acs_nb/total
pie_part_2 = acs_lr/total
pie_part_3 = acs_svc/total
In [49]:
labels = ['Multinomial Naive Bayes', 'Logistic Regression', 'Support Vector Classifier']
sizes = [pie_part_1, pie_part_2, pie_part_3]
# labels = ['Multinomial Naive Bayes', 'Logistic Regression']
# sizes = [pie_part_1, pie_part_2]
In [50]:
plt.figure(figsize=(10,5))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.legend(labels, loc=2)
Out[50]:
<matplotlib.legend.Legend at 0x22092805450>
No description has been provided for this image

Adding the predicted values to a new dataframe with the target genre¶

In [51]:
test_data = test_data[:10843]
test_data['Predicted_Genre_nb'] = y_nb_pred
In [52]:
test_data = test_data[:10843]
test_data['Predicted_Genre_lr'] = y_lr_pred
In [54]:
test_data = test_data[:10843]
test_data['Predicted_Genre_svm'] = y_svc_pred
In [59]:
test_data.to_csv('predicted_genres.csv', index=False)

# Add actual genre column to predicted dataFrame
extracted_col = test_soln_data["Target_Genre"]
test_data.insert(7, "Target_Genre", extracted_col)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_12232\3190150681.py in ?()
      1 test_data.to_csv('predicted_genres.csv', index=False)
      2 
      3 # Add actual genre column to predicted dataFrame
      4 extracted_col = test_soln_data["Target_Genre"]
----> 5 test_data.insert(7, "Target_Genre", extracted_col)

~\anaconda3\Lib\site-packages\pandas\core\frame.py in ?(self, loc, column, value, allow_duplicates)
   4927                 "'self.flags.allows_duplicate_labels' is False."
   4928             )
   4929         if not allow_duplicates and column in self.columns:
   4930             # Should this be a different kind of error??
-> 4931             raise ValueError(f"cannot insert {column}, already exists")
   4932         if not is_integer(loc):
   4933             raise TypeError("loc must be int")
   4934         # convert non stdlib ints to satisfy typing checks

ValueError: cannot insert Target_Genre, already exists
In [60]:
test_data.head()
Out[60]:
id Title Description Corpus_cleaning Predicted_Genre_nb Target_Genre Predicted_Genre_lr Predicted_Genre_svm
0 1 Edgar's Lunch (1998) L.R. Brane loves his life - his car, his apar... brane loves life car apartment job especially ... drama NaN comedy drama
1 2 La guerra de papá (1977) Spain, March 1964: Quico is a very naughty ch... spain march quico naughty child three belongin... drama thriller drama drama
2 3 Off the Beaten Track (2010) One year in the life of Albin and his family ... one year life albin family shepherds north tra... drama comedy documentary comedy
3 4 Meu Amigo Hindu (2015) His father has died, he hasn't spoken with hi... father died hasnt spoken brother years serious... documentary documentary horror horror
4 5 Er nu zhai (1955) Before he was known internationally as a mart... known internationally martial arts superstar b... documentary drama music music
In [62]:
correctly_predicted_values_nb = (test_data['Predicted_Genre_nb'] == test_data['Target_Genre']).sum()
correctly_predicted_values_lr = (test_data['Predicted_Genre_lr'] == test_data['Target_Genre']).sum()
correctly_predicted_values_svm = (test_data['Predicted_Genre_svm'] == test_data['Target_Genre']).sum()

print("Number of samples Correctly predicted by Multinomial Naive Bayes Classifier:", correctly_predicted_values_nb)
print("Number of samples Correctly predicted by Logistic Regression:", correctly_predicted_values_lr)
print("Number of samples Correctly predicted by Support Vector Classifier:", correctly_predicted_values_svm)
Number of samples Correctly predicted by Multinomial Naive Bayes Classifier: 2718
Number of samples Correctly predicted by Logistic Regression: 2143
Number of samples Correctly predicted by Support Vector Classifier: 2298

Creating the pkl file to predict the user input¶

In [63]:
import pickle
with open('tfidf.pkl', 'wb') as file:
    pickle.dump(tfidf, file)
with open('model_lr.pkl', 'wb') as file:
    pickle.dump(model_lr, file)

print("Models pickled successfully.")
Models pickled successfully.

Sample text data¶

In [64]:
# title = "Edgar's Lunch (1998)"
# discription = "L.R. Brane loves his life - his car, his apartment, his job, but especially his girlfriend, Vespa. One day while showering, Vespa runs out of shampoo. L.R. runs across the street to a convenience store to buy some more, a quick trip of no more than a few minutes. When he returns, Vespa is gone and every trace of her existence has been wiped out. L.R.'s life becomes a tortured existence as one strange event after another occurs to confirm in his mind that a conspiracy is working against his finding Vespa."
In [65]:
title = input("Enter movie Title")
discription = input("Enter movie Discription")
In [66]:
new_data = [title, discription]
In [67]:
new_data_transformed = tfidf.transform(new_data)
In [68]:
predictions = model_nb.predict(new_data_transformed)
In [69]:
for text, prediction in zip(new_data, predictions):
    print(f"Text: Predicted Genre: {prediction}")
Text: Predicted Genre:  drama 
Text: Predicted Genre:  drama 
In [ ]: