{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "KRE9dI3nlA6C" }, "source": [ "# Programming Assignment 3 : Linear Regression\n", "\n", "## Instructions:\n", "\n", "## Marks: 100\n", "## Due Date: March, 13, 2022 \n", "\n", "## Instructions \n", "\n", "* Submit your code both as notebook file (.ipynb) and python script (.py) on LMS. The name of both files should be 'RollNo_PA3'. Failing to submit any one of them will result in the reduction of marks.\n", "\n", "* The datasets required for this assignment have been uploaded to LMS. \n", "\n", "* The code MUST be implemented independently. Any plagiarism or cheating of work from others or the internet will be immediately referred to the DC.\n", "\n", "* 10% penalty per day for 3 days after due date. No submissions will be accepted\n", "\n", "after that.\n", "\n", "\n", "* Use procedural programming style and comment your code properly.\n", "\n", "* **Deadline to submit this assignment is 13/03/2022.**\n", "* Make sure to run all blocks before submission.\n", "\n", "### Goal: \n", "\n", "The goal of this assignment is to get you familiar with Linear Regression and to give hands on experience of basic python tools and libraries which will be used in implementing the algorithm.\n", "\n", "### Note:\n", "\n", "You are not allowed to use scikit-learn or any other machine learning toolkit for part 1 and 2. You have to implement your own Linear Regression model from scratch. You may use Pandas, NumPy, Matplotlib and other standard python libraries\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "osHEjCuclA6I" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "3_mlmXGflA6J" }, "source": [ "# Part 1: Simple Linear Regression (10 Marks)" ] }, { "cell_type": "markdown", "metadata": { "id": "vacnDDnplA6K" }, "source": [ "## Dataset: \n", "The Dataset for this part is provided in the included zip folder within the folder labelled \"DataPart1\". In case you are doing this assignment on colab, please upload the datafile to colab before starting.\n", "\n", "## Pre-Processing:\n", "The dataset you have been provided contains the marketing impact of a company via 3 advertisement mediums (Youtube, Facebook and Newspaper) on their sales. Data is the advertisement budget (in thousands of dollars) along with sales. Before you begin with Simple Linear Regression:\n", " \n", "
    \n", "
  1. Plot Budget against Sales for each advertisement medium.
  2. \n", "
  3. Identify which advertisement media has a linear relationship with Sales
  4. \n", "
  5. Create a new Pandas data frame to extract this column along with corresponding sales data.
  6. \n", "
  7. Split the data into train/test sets (80/20 Split).
  8. \n", "
\n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "srNc2EJPlA6K" }, "outputs": [], "source": [ "# Plot 3 different scatterplots and do all the required pre-processing.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "nhgQPoCSlA6L" }, "source": [ "## Tasks:\n", "\n", "Implement Linear Regression from scratch to predict the sales of the company based on their advertisement budget for the media you selected. You will implement the following functions:" ] }, { "cell_type": "markdown", "metadata": { "id": "_sfnZXM8lA6L" }, "source": [ "* Predict function:\n", " This function calculates the hypothesis for the input sample given the values of weights. \n", " \\begin{equation*}\n", " h(x,{\\theta}) = \\theta_0 + \\theta_1 x,\n", " \\end{equation*}\n", "\n", " where $${\\theta} \\in \\mathbb{R}^{2} $$ is the weight vector given by $${\\theta} = [ \\theta_0, \\theta_1]^T $$\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kQGEWGxulA6M" }, "outputs": [], "source": [ "def predict(X,theta0,theta1):\n", " # X --> Data point\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "gC-rE2vflA6N" }, "source": [ "
  • Mean Square Error Function: This function calculates the cost of using weights as parameters for linear regression. The formula to calculate Mean Square Error is given below:
  • \n", "\n", "\\begin{equation*}\n", " J(\\theta_0,\\theta_1) =\\frac{1}{2n} \\sum_{i=1}^{n} (\\hat{y}^i - y^i)^2,\n", " \\end{equation*}\n", " where $y^i$ and $\\hat{y}^i$ are the actual and predicted labels of the $i$-th training instance respectively and $n$ is the total number of training samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yKJA8PD5lA6O" }, "outputs": [], "source": [ "def mean_square_error(X,Y,theta0,theta1):\n", " # X -> data point\n", " # Y -> True value corresponding to that point X\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "ZSAl3RsAlA6O" }, "source": [ "* Batch Gradient Descent: This function learns the values of weights when given as parameters the learning rate $\\alpha$ and the number of iterations called epoch.\n", "Experiment with different values to determine the best parameters.\n", "\n", "For $j=0$ and $j=1$ repeat until convergence \\{\n", "\n", "$ \\qquad \\theta_j := \\theta_j - \\alpha \\frac{\\partial}{\\partial \\theta_j} J(\\theta_0,\\theta_1)$\n", "\n", "\\}\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j_JaqXmzlA6P" }, "outputs": [], "source": [ "def gradient_descent(X,Y,alpha,epochs):\n", " # X -> train_x\n", " # Y -> train_y \n", " \n", " theta_0 = 0\n", " theta_1 = 0\n", " J = list()\n", " for epoch in epochs:\n", " # Your code\n", " # Call your predict function.\n", " # Modify your Theta_0 and Theta_1 accordingly.\n", " # Append Cost to J\n", "\n", " return J, theta0, theta1\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "P-P6xkZjlA6P" }, "outputs": [], "source": [ "alpha=0.00001\n", "epochs=15000\n", "\n", "## Make your X = train_x and Y= train_y\n", "X=0\n", "Y=0\n", "\n", "J,theta0,theta1 = gradient_descent(X,Y,alpha,epochs)\n", "print(\"Cost after convergence is: \",J[-1])" ] }, { "cell_type": "markdown", "metadata": { "id": "cizlZWANlA6P" }, "source": [ "* Use a value of $\\alpha$ $<$ $0.00001$ and epochs $>$ 15000\n", "* Your Minimum Cost on your test set should be around 11-15\n", " " ] }, { "cell_type": "markdown", "metadata": { "id": "a6fThY_clA6Q" }, "source": [ "### Question: \n", "Given the data, explain why such a large number of epochs and low learning rate is being used? Explain in terms of data and gradient descent function. Please answer by adding a markdown underneath this question.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "owL7flkLlA6Q" }, "source": [ "### Plot Cost against Number of Epochs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i5MAADkAlA6Q" }, "outputs": [], "source": [ "#Plotting" ] }, { "cell_type": "markdown", "metadata": { "id": "sDPA8pd2lA6R" }, "source": [ "### Plotting Linear Fit\n", "\n", "- Using your learned paramters, plot a linear fit of Sales (Y-Axis) against Advertisement Budget (X-Axis).\n", "\n", "Plot the original Scatterplot on the same graph as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j3nd3FkRlA6R" }, "outputs": [], "source": [ "#Plotting" ] }, { "cell_type": "markdown", "metadata": { "id": "D_muwreqlA6R" }, "source": [ "# Part 2: Multivariate Linear Regression (60 Marks)" ] }, { "cell_type": "markdown", "metadata": { "id": "wiDCMEdulA6S" }, "source": [ "Concrete is the most important material in civil engineering. Concrete compressive strength is an extremely important datapoint that engineers take into consideration while making decisions. To physically measure compressive strength is an expensive and costly process and is also dependent on age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate and fine aggregate. \n", "\n", "For this part, we will be prediciting the compressive strength of concrete using the mixture composition and age of a concrete mixture. You will find the data for this part in the folder labelled \"DataPart2\".\n", "\n", "Data Attributes:\n", "\n", "* Cement (Component 1) : Kg in a m3 Mixture\n", "* Blast Furnace Slag (Component 2): Kg in a m3 Mixture\n", "* Fly Ash (Component 3): Kg in a m3 Mixture\n", "* Water (Compontnet 4): Kg in a m3 Mixture\n", "* Superplasticizer (Component 5): Kg in a m3 Mixture\n", "* Coarse Aggregate (Component 6): Kg in a m3 Mixture\n", "* Fine Aggregate (Component 7): Kg in a m3 Mixture\n", "* Age: (Days: 1~365)\n", "* Concrete Compressive Strength (Output): MPa\n", "\n", "\n", "Data Credits: Prof. I-Cheng Yeh\n" ] }, { "cell_type": "markdown", "metadata": { "id": "RNOvUacglA6S" }, "source": [ "## Tasks:\n", "\n", "* You are required to select the best features by drawing Scatter Plots/Heat Maps and using Pearson's correlation coefficent. (You may import a library for this)\n", "\n", "* Please justify your selection (Removing any attribute or keeping all attributes) in a markdown box below this one (Add one) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zpsRnAPWlA6S" }, "outputs": [], "source": [ "## Scatter Plot/Heat Map and Correlation Matrix" ] }, { "cell_type": "markdown", "metadata": { "id": "v3lRmEeylA6T" }, "source": [ "* Data Normalization: Normalize the Dataset by subtracting the mean of each feature from the feature value and then divide by the standard deviation of that feature:\n", "\n", "\\begin{equation*}\n", " x_{\\rm norm} = \\frac{x - {\\text{mean}}(x)}{{\\rm std}(x)}\n", " \\end{equation*}\n", "(For normalization of test set, use mean and standard deviation of training set.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VOCakm06lA6T" }, "outputs": [], "source": [ "#Data Normalization\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "CUK_Lp3clA6T" }, "source": [ "* Implement Predict Function, Mean Square Error and Batch Gradient Descent Function as explained in Part 1 for multivariate linear regression.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j_bEfG2_lA6T" }, "outputs": [], "source": [ "## Implementation\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xHEjPnAglA6U" }, "source": [ "* Plot the No. of Epochs (y-axis) vs Training Loss (x-axis)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2ugqYLDFlA6U" }, "outputs": [], "source": [ "#Plotting\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Cs0C5u_tlA6U" }, "source": [ "* Measure Mean Square Error of your test set using your learned rate. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PuCugf_llA6U" }, "outputs": [], "source": [ "## Measure" ] }, { "cell_type": "markdown", "metadata": { "id": "4icjMhfwlA6U" }, "source": [ "#### Question 1: Mention the best values of Alpha and Numb Of Epochs:\n", "\n", "Answer 1:\n", "
      \n", "
    1. \n", "Alpha:\n", "
    2. \n", "
    3. \n", "Numb of Epochs:\n", "
    4. \n", "
    \n", "\n", "#### Question 2: What is the Mean_Square_Error of your model? Suggest Possible ways to improve the accuracy with a small description of each avenue. \n", "\n", "Answer 2:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5ELbrcYolA6U" }, "source": [ "# Part 3: Regularized Linear Regression (30 Marks)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eQ8wSAxwlA6V" }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.linear_model import Ridge\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import ElasticNet\n", "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "markdown", "metadata": { "id": "NYLKmea5lA6V" }, "source": [ "Regularization is a technique that assumes smaller weights generate simple models and helps avoid overfitting. In this part, you will be using various regularization techniques on the Cement Dataset (Provided in Part 2).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8mUz9viflA6V" }, "source": [ "## Tasks:\n", "\n", "Implement the least squares [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [Lasso Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso),[Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge), and [Elastic Net Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) using [scikit-learn](https://scikit-learn.org/stable/index.html). You are required to:\n", "\n", "* Try out different values of regularization paramters (alpha in scikit-learn document) and use the validation set to determine the best value of regularization parameter by computing validation loss using [Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).\n", "\n", "* For Ridge Regression and Elastic Net Regression, plot regularization coefficients on the x-axis and learned parameters $\\theta$ on the y-axis. Please read this [blog](https://scienceloft.com/technical/understanding-lasso-and-ridge-regression/) as reference.\n", "\n", "* After evaluating the best value of the regularization parameter, use the [Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) to compute the loss on the test set for each regression." ] }, { "cell_type": "markdown", "source": [ "### Question: What is the difference between Ridge Regression and Lasso Regression\n", "\n", "Ans: " ], "metadata": { "id": "GzktVrajqVCS" } } ], "metadata": { "interpreter": { "hash": "de5ddf0c894aaffd6c628c6466f7e9ce5c0a31a4c07ebe305b7d3a4ad0daff48" }, "kernelspec": { "display_name": "Python 3.9.10 64-bit (windows store)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "orig_nbformat": 4, "colab": { "name": "A03.ipynb", "provenance": [], "collapsed_sections": [] } }, "nbformat": 4, "nbformat_minor": 0 }