{ "cells": [ { "cell_type": "markdown", "source": [ "# Setup\n", "Import libraries and mount your drive" ], "metadata": { "id": "RQAnOwCpK6L2" } }, { "cell_type": "markdown", "source": [ "Link to the Dataset:\n", "[Click Here](//drive.google.com/drive/folders/1yCejyr9fWoZlmP0aTgEbhkv9CvA1ltWr?usp=sharing)" ], "metadata": { "id": "I98aXUxZat7v" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tq8JNMW9k4Yc" }, "outputs": [], "source": [ "import string\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import math\n", "import os\n", "from matplotlib import pyplot as plt\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.model_selection import GridSearchCV\n", "import re\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import numpy as np\n", "from sklearn.metrics import classification_report" ] }, { "cell_type": "code", "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VnDLLLcFk-7t", "outputId": "e71a4c48-89f4-4a8b-81d4-6862a8359038" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mounted at /content/drive\n" ] } ] }, { "cell_type": "markdown", "source": [ "# Read Dataset\n", "Replace the paths with your paths to read the files" ], "metadata": { "id": "ZKkavoIhLCql" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iaVvBZ0Lk4Yh" }, "outputs": [], "source": [ "#extract files\n", "tr = pd.read_csv('/content/drive/MyDrive/PA4_dataset/PA4_dataset/train.csv')\n", "ts = pd.read_csv('/content/drive/MyDrive/PA4_dataset/PA4_dataset/test.csv')\n", "stop = pd.read_table('/content/drive/MyDrive/PA4_dataset/PA4_dataset/stop_words.txt',header=None)[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4YRCOK-Bk4Yi" }, "outputs": [], "source": [ "#separate tweets from files\n", "tr_tweets = tr[\"Tweet\"]\n", "ts_tweets = ts[\"Tweet\"]\n", "tr_labels = tr[\"Sentiment\"]\n", "ts_labels = ts[\"Sentiment\"]\n", "tr_size = len(tr_tweets)\n", "ts_size = len(ts_tweets)" ] }, { "cell_type": "markdown", "source": [ "# Data Preprocessing\n", "Clean your data to remove unwanted symbols.
\n", "The following methods are helpful:
\n", "\n", "\n", "* [.casefold()](https://www.w3schools.com/python/ref_string_casefold.asp)\n", "* [.lstrip()](https://www.w3schools.com/python/ref_string_lstrip.asp)\n", "* [re.sub()](https://www.w3schools.com/python/python_regex.asp)\n", "* [.rstrip()](https://www.w3schools.com/python/ref_string_rstrip.asp)\n", "* [.replace](https://www.w3schools.com/python/ref_string_replace.asp)\n", "\n", "*Note: You may use other functions for processing but these should be enough*" ], "metadata": { "id": "vN5lOoInLPjq" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RE60JSBUk4Yj" }, "outputs": [], "source": [ "#Part 1 - Data Preprocessing\n", "for indx in range(tr_size): # pre-process training data\n", " #convert strings to lowercase\n", " #remove usernames and hyperlinks\n", " #removing digits and next line symbols\n", " #remove punctuation and symbols\n", " # {Code here}\n", " for word in stop: #remove stop words\n", " # {Code here}\n", "for indx in range(ts_size): # pre-process testing data\n", " #convert strings to lowercase\n", " #remove usernames and hyperlinks\n", " #removing digits and next line symbols\n", " #remove punctuation and symbols\n", " # {Code here}\n", " for word in stop: #remove stop words\n", " # {Code here}" ] }, { "cell_type": "markdown", "source": [ "# Bag of Words" ], "metadata": { "id": "M9ZZJhkeON8F" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gxsI2Nj6k4Yl" }, "outputs": [], "source": [ "#Part 2 - Bag Of Words\n", "# Extract vocab from training data set\n", "vocab_list = []\n", "for indx in range(tr_size):\n", " # read tweet\n", " tr_tweets[indx] = tr_tweets[indx].split()\n", " # append words from each tweet to vocab_list \n", " # make sure words don't get repeated\n", " # {Code Here}\n", "\n", "# Create Bag of Words Matrix of size (number of tweets in training data, size of vocabulary)\n", "# each row is a tweet and each column is a word\n", "matrix = np.zeros(('''Enter size here''')) \n", "\n", "# Populate BoW\n", "# Go through the words in each tweet \n", "# for each word append the count of the matrix at relevant position \n", "# {Code Here}" ] }, { "cell_type": "markdown", "source": [ "# Implement Naive Bayes\n", "\n", "\n", "* Step 01: Create a dictionary 'count' to store the number of times a word occurs in each class (positive, negative, neutral). You can do this by using the matrix. For each tweet use the index to access the actual label from tr_labels.
\n", "The dictionary will have the following structure:
\n", "$count[word] = [positive, negative, neutral]$
\n", "Example:
$count['to'] = [5,10,12]$\n", "\n" ], "metadata": { "id": "8wXjQexAQsE8" } }, { "cell_type": "code", "source": [ "count = {}\n", "# {Code Here}" ], "metadata": { "id": "MVms6fiznDtX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* Step 02: Find the prior probability of each class
\n", "To calculate the prior probability of a class calculate the ratio of
\n", "$N_{c}$ : $N_{t}$
\n", "where $N_{c}$ is the number of tweets belonging to the class and $N_{t}$ is the total number of tweets.
\n", "Store these in:
\n", "$prior = [positivePrior, negativePrior, neutralPrior]$" ], "metadata": { "id": "vRuXOnlZSPfx" } }, { "cell_type": "code", "source": [ "prior = []\n", "# {Code Here}" ], "metadata": { "id": "lSrK4pamXbOq" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* Step 03: Find the likelihood of each word for every class
\n", "Each word will have 3 likelihoods, one corresponding to each class.
\n", "Calculate the likelihood of a word specific to a class:
\n", "\n", " * count($w_{i}$,c) : the number of times the word occurs in the specific class\n", " * $|V|$ : the size of vocabulary\n", " * Summation count(w,c) : sum of the number of times each word belonging to the vocabulary occurs in the specific class i.e. the word count of this class\n", "\n", "> ![image.png]()
\n", "Store inside a dictionary 'likelihoods' in the form:\n", "$likelihoods[word] = [positiveLikelihood, negativeLikelihood, neutralLikelihood]$
\n", "> For example :
$likelihoods['to']=[0.3,0.57,0.23]$" ], "metadata": { "id": "fITy2ozcXYCx" } }, { "cell_type": "code", "source": [ "likelihoods = {}\n", "# {Code Here}" ], "metadata": { "id": "I_1lINavXesD" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "labelsMap = {\"positive\":0, \"negative\":1, \"neutral\":2}" ], "metadata": { "id": "xpFHw53iXoO4" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Predictions and Evaluation\n", "Given a test data point, and the set of prior probabilities and likelihoods, we need to return the 'best' class c.
\n", "1. We will create a vector 'prob' of length equal to the number of classes, 3.
\n", "$prob = [0,0,0]$
\n", "2. For each class c, we will initially add our prior probability to prob.
\n", "3. Then for each word in the test data in our vocabulary, we will multiply the corresponding likelihood with prob.
\n", "4. Finally, the maximum index of our 'prob' vector will be the predicted class for the test data point.
\n", "5. Compare these to the actual labels to calculate F1_score, accuracy and confusion matrix. You can use scikit libraries for this.
" ], "metadata": { "id": "gDJkWdWSXtAH" } }, { "cell_type": "code", "source": [ "# Predictions\n", "for i,tweet in enumerate(ts_tweets):\n", " tweet = tweet.split(); \n", " actual_label = labelsMap[ts_labels[i]]; \n", " # {Code Here}" ], "metadata": { "id": "yPX3dbLmw6DV" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Evaluation\n", "# {Code Here}" ], "metadata": { "id": "D4a0Tc_BaCK7" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Scikit Implementation of Naive Bayes\n", "Use Scikit-Learn’s implementation of the na¨ıve Bayes classifier on the bag of words. Remember to implement one vs rest model with the in-built classifier in binary classification\n", "mode. Report the accuracy, F1 score, and confusion matrix of test using the library’s\n", "implementation." ], "metadata": { "id": "tgQgUubFaM0u" } }, { "cell_type": "code", "source": [ "# {Code Here}" ], "metadata": { "id": "Q1BzeUJfaYoL" }, "execution_count": null, "outputs": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "colab": { "name": "PA4_Naive_Bayes", "provenance": [], "collapsed_sections": [] } }, "nbformat": 4, "nbformat_minor": 0 }