Welcome to the Linux Foundation Forum!

Unbalanced Sentiment Analysis Classes

Hello,

I’m presently working on a Sentiment Analysis project that uses a movie reviews dataset similar to this one, and I’m having trouble with class imbalance. The dataset comprises both good and negative evaluations, however the positive ratings outnumber the negative ones by a large margin. Here’s an example of my code:

  1. import pandas as pd
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.feature_extraction.text import CountVectorizer
  4. from sklearn.naive_bayes import MultinomialNB
  5.  
  6. # Load the movie reviews dataset
  7. data = pd.read_csv('movie_reviews.csv')
  8.  
  9. # Preprocess the data
  10. # ... (code for data preprocessing)
  11.  
  12. # Split the dataset into training and testing sets
  13. X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
  14.  
  15. # Vectorize the text data using CountVectorizer
  16. vectorizer = CountVectorizer()
  17. X_train_vectorized = vectorizer.fit_transform(X_train)
  18. X_test_vectorized = vectorizer.transform(X_test)
  19.  
  20. # Train the Naive Bayes classifier
  21. model = MultinomialNB()
  22. model.fit(X_train_vectorized, y_train)
  23.  
  24. # Evaluate the model
  25. accuracy = model.score(X_test_vectorized, y_test)
  26. print(f"Accuracy: {accuracy}")

The code’s accuracy is approximately 90%, however I assume it is biassed towards the majority class (positive reviews) owing to the class imbalance issue. I want to make sure that my model responds well to both positive and negative feedback. How can I handle the issue of class imbalance while also improving the overall effectiveness of my sentiment analysis model?

Thank you for your assistance!

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Categories

Upcoming Training