Welcome to the Linux Foundation Forum!

Unbalanced Sentiment Analysis Classes


I’m presently working on a Sentiment Analysis project that uses a movie reviews dataset similar to this one, and I’m having trouble with class imbalance. The dataset comprises both good and negative evaluations, however the positive ratings outnumber the negative ones by a large margin. Here’s an example of my code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the movie reviews dataset
data = pd.read_csv('movie_reviews.csv')

# Preprocess the data
# ... (code for data preprocessing)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train the Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# Evaluate the model
accuracy = model.score(X_test_vectorized, y_test)
print(f"Accuracy: {accuracy}")

The code’s accuracy is approximately 90%, however I assume it is biassed towards the majority class (positive reviews) owing to the class imbalance issue. I want to make sure that my model responds well to both positive and negative feedback. How can I handle the issue of class imbalance while also improving the overall effectiveness of my sentiment analysis model?

Thank you for your assistance!


Upcoming Training