In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
#from google.colab import drive, output
#drive.mount('/content/drive')
from geopy.distance import geodesic

## Instructions 

* Submit your code both as notebook file (.ipynb) and python script (.py) on LMS. The name of both files should be 'RollNo_PA01', for example: "23100214_PA01". Failing to submit any one of them will result in the reduction of marks.
* All the cells must be run once before submission and should be displaying the results(graphs/plots etc). If output of the cells is not being displayed, marks will be dedcuted.
* The code MUST be implemented independently. Any plagiarism or cheating of work from others or the internet will be immediately referred to the DC.
* 10% penalty per day for 3 days after due date. No submissions will be accepted
after that. 
* Use procedural programming style and comment your code properly.
* **Deadline to submit this assignment is 27/02/2023 (23:55).**

# Task 1
You are required to implement a simple linear regression (`simple_LR`) class with 2 arguments that fits and plots a self-generated linear distribution,

\begin{equation}
 \hat y = a + bX.
\end{equation}



* The class consists of two functions; the paramterised constructor and `plot_model` that plots the fitted model on top of a scatter plot of the data.
* The constructor receives 5 arguments:
 * $a$, y-intercept of the fitted line.
 * $b$, gradient of the fitted line.
 * $n$, the number of evenly spaced points to plot.
 * $x_{min}$, the minimum value that x can take in the interval.
 * $x_{max}$, the maximum value that x can take in the interval.
* The `plot_model` function receives no arguments only outputs a plot.
 

**Steps to follow:**


1. Initialize the arguments inside the constructor.
2. Inside plot equation generate a list of $n$ evenly spaced called $X$ values between $x_{min}$ and $x_{max}$.
3. For the generated list find $y$ using $y = a + bX$.
4. Add random normal noise to $y$ with $\mu = 0$ and $\sigma = 0.5$.
5. For these $X$ and $y$ find the optimal weights using analytical solution given by,

\begin{equation}
w = (X^TX)^{-1}X^Ty.
\end{equation}

6. On the same figure, plot a scatter of the original $X$ and $y$ and a line plot of the fitted line. The plot should have axis labels and a legend showing the equation of the line.

The code should be as **vectorized** as possible; For loops are not allowed. All of the steps 2 to 6 should be done inside `plot_equation` method.

In [1]:
class simple_LR():
 def __init__(self, a, b, x_min, x_max, n):
 #Write your code here
 pass
 
 def plot_equation(self):
 #Write your code here
 pass

In [2]:
#Do not modify this cell
vals = [(3, -4), (5, -5), (1, 2)]
for a, b in vals:
 L = simple_LR(a, b, 0, 5, 10)
 L.plot_equation()

# Task 2
## Part 1
You are required to implement a multivariate linear regression (`multivariate_LR`) class with 9 arguments.

\begin{equation}
 \hat y = \theta_0 + \sum_{i = 1}^{d} \theta_i X_i
\end{equation}



* The class consists of **9** functions: paramterised constructor, `predict`, `mean_square_loss`, `loss_derivative`, `gradient_descent`, `plot_loss`, `animate_gradient_descent` and `adjusted_R_Squared`.
* The constructor is passed 9 arguments which have the following description:
 * $X$, the feature matrix of the data. It should be a numpy matrix of dimension $n \times m$.
 * $y$, the output vector, It should be a **1-D** numpy vector.
 * `train_size`, between **0** and **1**, corresponds to the fraction of the data allocated to the training set.
 * `epochs`, the maximum number of epochs for gradient descent.
 * `learning_rate`, the learning rate for the gradient descent.
 * `intercept`, a boolean variable: True if intercept is to be fitted else false.
 * `normalize`, a boolean variable: True if $X$ is to be normalized.
 * `method`, signifies the gradient descent method: `'batch'`, `'sgd'`, `'minibatch'`.
 * `batchsize`, number of batches for the minibatch method, set to 10 by default.

**Description:**
* Before passing $X$ and $y$ to the class, make sure they are **2-D** numpy arrays.
* Initialize the arguments passed inside the constructor, normalize $X$ if `normalize = True`. Add an intercept column to $X$ if `intercept = True`. Also initialize a parameter $w$ which is a column vector that stores the weights. You should also randomly split $X$ and $y$ here based on the `train_size` argument. Also intialize the `loss_history` and `weight_history` parameters which store the loss and the weights respectively at each iteration.
* The `predict` function receives no arguments returns the prediction vector for the test $X$.
* The `mean_square_loss` function recevies 2 vectors $x$ and $y$ and returns the Euclidean distance between them.
* The `loss_derivative` function recevies 2 vectors $X$ and $y$ and returns the value of the **GRADIENT** with the current weights.
* The `gradient_descent` function recevies no arguments and performs gradient descent using the one of the 3 specified descent methods.
* The `plot_loss` function receives no arguments and plots the loss curve using `loss_history`.
* The `animate_gradient_descent` function receives no arguments and animates on a plot how the fitted line evolves on **1-D** using weights stored in `weight_history` . Use `time.sleep(1)` inside a for loop that iterates over `weight_history` plot the scatter of the original $X$ and $y$ variables as well as the fitted line at that iteration.
* The `adjusted_R_Squared` function receives no arguments and returns the adjusted $R^2$ value.

The code should be as vectorized as possible.

In [None]:
#Do not modify this cell
X = np.linspace(0, 10, 20)
Y = 3 + 3 * X + np.random.normal(loc = 0, scale = 4, size = len(X))
X = X.reshape((len(X), 1))
Y = Y.reshape((len(X), 1))
graph = plt.figure()
graph = plt.scatter(X, Y)
graph = plt.xlabel('x')
graph = plt.ylabel('y')

In [None]:
class multivariate_LR():
 def __init__(self, X, Y, train_size, epochs, learning_rate, intercept, normalize, method, batch_size = 10):
 
 def predict(self):
 pass
 
 def mean_square_loss(self, x, y):
 pass

 def loss_derivative(self, x, y):
 pass
 
 def gradient_descent(self):
 for i in range(self.epochs):
 if self.method == 'batch':
 pass
 #Write your code here
 elif self.method == 'sgd':
 pass
 #Write your code here
 elif self.method == 'minibatch':
 pass
 #Write your code here

 def plot_loss(self):
 pass
 
 def animate_gradient_descent(self):
 pass
 
 def adjusted_R_squared(self):
 pass

In [None]:
#Do not modify this cell as this is for testing
multivariatelr = multivariate_LR(X, Y, 0.8, 20, 0.02, False, False, 'sgd')
multivariatelr.gradient_descent()
multivariatelr.animate_gradient_descent()

## Part 2
Read the Bykea delivery dataset and split it into a feature matrix and an output vector that contains **delivery charge**. 

In [None]:
#Write your code here

Use the already imported `geodesic` function to calcuate distance between the pickup and dropoff points. Make a distance column in the feature matrix and drop the latitude and longitude columns. `geodesic` calculates distances as follows.

In [None]:
geodesic((24.8607, 67.0011),(31.5204, 74.3587)).km

Use the `multivariate_LR` class with appropriate arguments to plot the loss, return the adjusted $R^2$ value for the model, and return the model's prediction for the test data.

In [None]:
#Write your code here