{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kXCyEtDav_TR",
    "outputId": "09ba2cbd-cc6e-4569-e8ea-d94e3b81e071"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import time\n",
    "#from google.colab import drive, output\n",
    "#drive.mount('/content/drive')\n",
    "from geopy.distance import geodesic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions \n",
    "\n",
    "*   Submit your code both as notebook file (.ipynb) and python script (.py) on LMS. The name of both files should be 'RollNo_PA01', for example: \"23100214_PA01\". Failing to submit any one of them will result in the reduction of marks.\n",
    "*  All the cells must be run once before submission and should be displaying the results(graphs/plots etc). If output of the cells is not being displayed, marks will be dedcuted.\n",
    "*   The code MUST be implemented independently. Any plagiarism or cheating of work from others or the internet will be immediately referred to the DC.\n",
    "* 10% penalty per day for 3 days after due date. No submissions will be accepted\n",
    "after that.  \n",
    "* Use procedural programming style and comment your code properly.\n",
    "* **Deadline to submit this assignment is  27/02/2023 (23:55).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I4c2QQ2Q0Hca"
   },
   "source": [
    "# Task 1\n",
    "You are required to implement a simple linear regression (`simple_LR`) class with 2 arguments that fits and plots a self-generated linear distribution,\n",
    "\n",
    "\\begin{equation}\n",
    "                             \\hat y = a + bX.\n",
    "\\end{equation}\n",
    "\n",
    "\n",
    "\n",
    "*   The class consists of two functions; the paramterised constructor and `plot_model` that plots the fitted model on top of a scatter plot of the data.\n",
    "*   The constructor receives 5 arguments:\n",
    "      *   $a$, y-intercept of the fitted line.\n",
    "      *   $b$, gradient of the fitted line.\n",
    "      *   $n$, the number of evenly spaced points to plot.\n",
    "      *   $x_{min}$, the minimum value that x can take in the interval.\n",
    "      *   $x_{max}$, the maximum value that x can take in the interval.\n",
    "*   The `plot_model` function receives no arguments only outputs a plot.\n",
    "  \n",
    "\n",
    "**Steps to follow:**\n",
    "\n",
    "\n",
    "1.   Initialize the arguments inside the constructor.\n",
    "2.   Inside plot equation generate a list of $n$ evenly spaced called $X$ values between $x_{min}$ and $x_{max}$.\n",
    "3.   For the generated list find $y$ using $y = a + bX$.\n",
    "4.   Add random normal noise to $y$ with $\\mu = 0$ and $\\sigma =  0.5$.\n",
    "5.   For these $X$ and $y$ find the optimal weights using analytical solution given by,\n",
    "\n",
    "\\begin{equation}\n",
    "w = (X^TX)^{-1}X^Ty.\n",
    "\\end{equation}\n",
    "\n",
    "6.   On the same figure, plot a scatter of the original $X$ and $y$ and a line plot of the fitted line. The plot should have axis labels and a legend showing the equation of the line.\n",
    "\n",
    "The code should be as **vectorized** as possible; For loops are not allowed. All of the steps 2 to 6 should be done inside `plot_equation` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "CbjDhTBv0cn9"
   },
   "outputs": [],
   "source": [
    "class simple_LR():\n",
    "    def __init__(self, a, b, x_min, x_max, n):\n",
    "    #Write your code here\n",
    "        pass\n",
    "  \n",
    "    def plot_equation(self):\n",
    "    #Write your code here\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 803
    },
    "id": "JhFfOAdJu8EY",
    "outputId": "8517a971-ce1b-4f28-f3b6-aed217fc7231"
   },
   "outputs": [],
   "source": [
    "#Do not modify this cell\n",
    "vals = [(3, -4), (5, -5), (1, 2)]\n",
    "for a, b in vals:\n",
    "    L = simple_LR(a, b, 0, 5, 10)\n",
    "    L.plot_equation()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6dqDzI9RHHch"
   },
   "source": [
    "# Task 2\n",
    "## Part 1\n",
    "You are required to implement a multivariate linear regression (`multivariate_LR`) class with 9 arguments.\n",
    "\n",
    "\\begin{equation}\n",
    "                             \\hat y = \\theta_0 + \\sum_{i = 1}^{d} \\theta_i X_i\n",
    "\\end{equation}\n",
    "\n",
    "\n",
    "\n",
    "*   The class consists of **9** functions: paramterised constructor, `predict`, `mean_square_loss`, `loss_derivative`, `gradient_descent`, `plot_loss`, `animate_gradient_descent` and `adjusted_R_Squared`.\n",
    "*   The constructor is passed 9 arguments which have the following description:\n",
    "      *   $X$, the feature matrix of the data. It should be a numpy matrix of dimension $n \\times m$.\n",
    "      *   $y$, the output vector, It should be a **1-D** numpy vector.\n",
    "      *   `train_size`, between **0** and **1**, corresponds to the fraction of the data allocated to the training set.\n",
    "      *   `epochs`, the maximum number of epochs for gradient descent.\n",
    "      *   `learning_rate`, the learning rate for the gradient descent.\n",
    "      *   `intercept`, a boolean variable: True if intercept is to be fitted else false.\n",
    "      * `normalize`, a boolean variable: True if $X$ is to be normalized.\n",
    "      * `method`, signifies the gradient descent method: `'batch'`, `'sgd'`, `'minibatch'`.\n",
    "      * `batchsize`, number of batches for the minibatch method, set to 10 by default.\n",
    "\n",
    "**Description:**\n",
    "* Before passing $X$ and $y$ to the class, make sure they are **2-D** numpy arrays.\n",
    "* Initialize the arguments passed inside the constructor, normalize $X$ if `normalize = True`. Add an intercept column to $X$ if `intercept = True`. Also initialize a parameter $w$ which is a column vector that stores the weights. You should also randomly split $X$ and $y$ here based on the `train_size` argument. Also intialize the `loss_history` and `weight_history` parameters which store the loss and the weights respectively at each iteration.\n",
    "* The `predict` function receives no arguments returns the prediction vector for the test $X$.\n",
    "* The `mean_square_loss` function recevies 2 vectors $x$ and $y$ and returns the Euclidean distance between them.\n",
    "* The `loss_derivative` function recevies 2 vectors $X$ and $y$ and returns the value of the **GRADIENT** with the current weights.\n",
    "* The `gradient_descent` function recevies no arguments and performs gradient descent using the one of the 3 specified descent methods.\n",
    "* The `plot_loss` function receives no arguments and plots the loss curve using `loss_history`.\n",
    "* The `animate_gradient_descent` function receives no arguments and animates on a plot how the fitted line evolves on **1-D** using weights stored in `weight_history` . Use `time.sleep(1)` inside a for loop that iterates over `weight_history` plot the scatter of the original $X$ and $y$ variables as well as the fitted line at that iteration.\n",
    "* The `adjusted_R_Squared` function receives no arguments and returns the adjusted $R^2$ value.\n",
    "\n",
    "The code should be as vectorized as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 279
    },
    "id": "p27BJtg8cEXa",
    "outputId": "2a22b283-36e6-467b-dcbe-266e408f7aeb"
   },
   "outputs": [],
   "source": [
    "#Do not modify this cell\n",
    "X = np.linspace(0, 10, 20)\n",
    "Y = 3 + 3 * X + np.random.normal(loc = 0, scale = 4, size = len(X))\n",
    "X = X.reshape((len(X), 1))\n",
    "Y = Y.reshape((len(X), 1))\n",
    "graph = plt.figure()\n",
    "graph = plt.scatter(X, Y)\n",
    "graph = plt.xlabel('x')\n",
    "graph = plt.ylabel('y')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OfF_JVhfGhYG"
   },
   "outputs": [],
   "source": [
    "class multivariate_LR():\n",
    "    def __init__(self, X, Y, train_size, epochs, learning_rate, intercept, normalize, method, batch_size = 10):\n",
    "    \n",
    "    def predict(self):\n",
    "        pass\n",
    "  \n",
    "    def mean_square_loss(self, x, y):\n",
    "        pass\n",
    "\n",
    "    def loss_derivative(self, x, y):\n",
    "        pass\n",
    "    \n",
    "    def gradient_descent(self):\n",
    "        for i in range(self.epochs):\n",
    "            if self.method == 'batch':\n",
    "                pass\n",
    "                #Write your code here\n",
    "            elif self.method == 'sgd':\n",
    "                pass\n",
    "                #Write your code here\n",
    "            elif self.method == 'minibatch':\n",
    "                pass\n",
    "                #Write your code here\n",
    "\n",
    "    def plot_loss(self):\n",
    "        pass\n",
    "    \n",
    "    def animate_gradient_descent(self):\n",
    "        pass\n",
    "    \n",
    "    def adjusted_R_squared(self):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 281
    },
    "id": "0GE7ucszJdcg",
    "outputId": "882480a7-62d7-4b5b-92f5-009f33d11856"
   },
   "outputs": [],
   "source": [
    "#Do not modify this cell as this is for testing\n",
    "multivariatelr = multivariate_LR(X, Y, 0.8, 20, 0.02, False, False, 'sgd')\n",
    "multivariatelr.gradient_descent()\n",
    "multivariatelr.animate_gradient_descent()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2\n",
    "Read the Bykea delivery dataset and split it into a feature matrix and an output vector that contains **delivery charge**. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8qyIDvyVHJSt"
   },
   "outputs": [],
   "source": [
    "#Write your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the already imported `geodesic` function to calcuate distance between the pickup and dropoff points. Make a distance column in the feature matrix and drop the latitude and longitude columns. `geodesic` calculates distances as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nX_aLoH-HRxD"
   },
   "outputs": [],
   "source": [
    "geodesic((24.8607, 67.0011),(31.5204, 74.3587)).km"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s69RpbJ-3QgO"
   },
   "source": [
    "Use the `multivariate_LR` class with appropriate arguments to plot the loss, return the adjusted $R^2$ value for the model, and return the model's prediction for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 308
    },
    "id": "a5xh2iLT2GrL",
    "outputId": "933bb1b0-9b3e-400c-900b-f630ceef0523"
   },
   "outputs": [],
   "source": [
    "#Write your code here"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}