In this notebook, I do two things:

Exploratory data analysis on Iris dataset. I try to show the data-related info, its statistics, and the relationships between features.
Secondly, I attempt to implement Neural Networks from scratch using NumPy and try to train it on Iris dataset using gradient descent.

1. Dataset

Here, I download the Iris dataset which was discussed at focus in very beginning of the class. The dataset is publicly available at UCI repsitory (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/). For this notebook, I’ve already downloaded the data and can be found as ./Iris.csv.

Image Source here

raw_data = pd.read_csv("./Iris.csv")
raw_data.head()

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

1.1 Exploratory Data Analysis

I started with understanding what type of data this is. As shown below, the dataset consists of four columns with float64 type values and a fifth column which is an object called Species. As we already know this is a dataset that is widely used to test and understand classification, each data entry could be seen to have 4 features (Sepal length, Sepal Width, Petal Length and Petal Width) based on which it is categorized into species.

raw_data.drop("Id", axis=1, inplace = True)

print("Data technical aspects:")
print(raw_data.info())

Data technical aspects:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

Having a well-balanced distribution among the classes is a very important aspect of machine learning, especially if we are considering to train methods like Neural Networks. They tend to generate unnecessary biasses and inconsistent performance if we have a dataset with unevenly distributed classes. Here, we can see that our data consists of three classes with equally distributed samples among them.

print("Class Distribution:")
print(raw_data["Species"].value_counts())

Class Distribution:
Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: Species, dtype: int64

Since all the features are real-world measurements of Sepal and Petal in centimeters, they all seem to be roughly on the same scale as can be seen below. Relative smaller values of mean and standard deviation means that feature standardization may not be necessary.

print("Dataset Statistics:")
raw_data.describe()

Dataset Statistics:

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Data Visualization:

In the following section, I visualize the dataset-related statistics to understand it better. For each feature, I generate violen-plots for all three classes, which helps understand the scale and distribution of each of the features among the different classes.

plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.violinplot(x='Species', y = 'SepalLengthCm', data=raw_data)
plt.subplot(2,2,2)
sns.violinplot(x='Species', y = 'SepalWidthCm', data=raw_data)

plt.subplot(2,2,3)
sns.violinplot(x='Species', y = 'PetalLengthCm', data=raw_data)
plt.subplot(2,2,4)
sns.violinplot(x='Species', y = 'PetalWidthCm', data=raw_data)
plt.show()

png

Observations:

All three species have quite similar Sepal Width values indicating this alone will not be a good distinguishing feature.
Shorter range of values are observed of Sepal Length for Iris-setosa (4-6 cm) when compared to Iris-virginica which has length ranging from 4 to 8cm.
All the classes differ widely when their Petal Width are compared essetially making them more seperable.

Here I generate Multivariate plots to better understand the relationships between features:

sns.pairplot(raw_data, hue="Species")
plt.show()

png

Observations:

Iris-setosa is observed to be easily identified (blue) and can be easily seperated. On the other hand, Iris-virginica and Iris-versicolor are seen to have quite overlap.
Petal Length and Petal Width are observed to be the best features to identify various flower types as their feature space is observed to have less overlap, which means they are more seperable.

1.2 Dataset-preprocessing:

The next step would be to process the raw data to a format which can be used to train our classification model.

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
np.random.seed(5720)

Since the features are all float-values at around same scale, we don’t have to change its type. However, the labels (Species) are in text-format which can’t be directly utilized for training our model. Here, I encode the labels using one-hot encoding which will encode each species’ class as a vector consisting of float-values. The result can be seen below:

data_array = raw_data.values

X_features = data_array[:, 0:4]
Y_labels = data_array[:, 4]

one_hot = OneHotEncoder()
Y_transformed = one_hot.fit_transform(Y_labels.reshape(-1,1))
Y_onehot = Y_transformed.toarray()

print("Converted Labels to One-hot encoding:\n")
for i in range(1,150,50):
    print(f"{Y_labels[i]} -> {Y_onehot[i]}")

Converted Labels to One-hot encoding:

Iris-setosa -> [1. 0. 0.]
Iris-versicolor -> [0. 1. 0.]
Iris-virginica -> [0. 0. 1.]

Generalization or in other words, performance of models on unseen data is very crucial in machine learning. Therefore, I try to create train-test sets from dataset. Train-set will be used for training our model and Test-set can be used to evaluate our model performance on unseen data. Here, I will split the dataset into these two sets, 80% (train-set) of which we will use to train our models and 20% (test-set) that we will hold back as a test dataset.

test_size = 0.20

def process_data(X, Y):
    X, Y = X[:, :, np.newaxis], Y[:, :, np.newaxis]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size)
    
    X_train, X_test = X_train.astype(float), X_test.astype(float)
    Y_train, Y_test = Y_train.astype(float), Y_test.astype(float)
    
    train = [(X_train[i], Y_train[i]) for i in range(len(X_train))]
    test = [(X_test[i], Y_test[i]) for i in range(len(X_test))]
    
    return train, test

train, test = process_data(X=X_features, Y=Y_onehot)

input_dimensions, output_dimensions = len(train[0][0]), len(train[0][1])
print('Number of Input Features: ', input_dimensions)
print('Number of Output classes: ', output_dimensions)

Number of Input Features:  4
Number of Output classes:  3

2. Feedforward Neural Networks (using NumPy)

Image source here

In this section, I implement a class representing a general feed-forward neural network from scratch that utilizes the sigmoid activation function. Additionally, I have constructed gradient descent mechanism to train my implemented neural network. The class also consists of several helped-functions such as generating training logs, evaluating the predictions etc.

Below, I give mathematical intuition behind each of the functions (highlighted in brackets) that is present in my implementation of feedforward neural network.

Let’s start with the error:

$e r r o r = \frac{1}{2} (y P r e d - Y)^{2}$

Here, $y P r e d$ = forward_Prop $(X) $, a n d$ Y$$ is the desired output value from the neural network.

C o s t J = \frac{1}{2} \sum_{n} {(e r r o r)}^{2} = \frac{1}{2} \sum_{n} {(y P r e d - Y)}^{2}

The feedforward operation (forward_Prop) is a fairly simple process that consists of successive matrix-vector multiplications. For a single neuron with index $i$ in layer $(l + 1)$ , this process can be formulated as follows:

z_{i}^{(l + 1)} = \sum_{j = 1}^{n_{l}} w_{i j}^{(l + 1, l)} a_{j}^{(l)} + b_{i}^{(l + 1)}

where $w_{i j}^{l}$ is the weight of the neuron, $a_{j}^{(l)}$ is the input that particular neuron is receiving and $b_{i}^{(l + 1)}$ is the bias that will be added to output of the $i^{t h}$ neuron.

Once we have initialized all the weights $W$ , we need to iteratively update it such that cost $J$ is minimized.

- Consider layer $L$ , Computing $\frac{\partial}{\partial W^{(L)}} (C o s t J)$

$\frac{\partial}{\partial W^{(L)}} (C o s t J) = C o s t J^{'} \times \frac{\partial}{\partial W^{(L)}} (y P r e d)$

where $C o s t J^{'}$ is the derivative of the Cost function (grad_cost)

- Let’s compute $\frac{\partial y P r e d}{\partial W^{(L)}}$

We use Sigmoid activation function, so $y P r e d$ can be written as:
$y P r e d = S i g m o i d (X^{(L)} * W^{(L)} + b^{L})$

$\frac{\partial y P r e d}{\partial W^{(L)}} = \frac{\partial}{\partial W^{(L)}} (S i g m o i d (X^{(L)} . * W^{(L)} + b^{L}))) = S i g m o i d^{^{'}} (X^{(L)} . * W^{(L)} + b^{L})) * (\frac{\partial}{\partial W^{(L)}} ((X^{(L)} . * W^{(L)} + b^{L})))$

Here $S i g m o i d^{'}$ is the derivative of the Sigmoid activation function (g_prime).

- Computing $\frac{\partial}{\partial W^{(L)}} ((X^{(L)} . * W^{(L)} + b^{L}))$

It can be seen that $\frac{\partial}{\partial W^{(L)}} ((X^{(L)} . * W^{(L)} + b^{L})) = X^{(L)}$ , where $X^{L} = S i g m o i d (W^{L - 1} * X^{L - 1} + b^{L - 1})$

- Computing gradients in backward propogation (`back_prop`)

Consider a variable $Δ W^{(L)}$ , such that: $Δ W^{(L)} = C o s t^{'} J \times S i g m o i d^{'} (W^{(L)} \times X^{(L)} + b^{L})$
For each iteration ( $i$ ) starting from the second last layer to the first: $Δ b^{(i)} = Δ W^{(L)} (Derivative w.r.t \partial b^{L} is just Δ W^{(L)}$
$Δ W^{(i)} = Δ W^{(L)} * X^{L} (where, X^{L} = S i g m o i d (W^{i - 1} * X^{i - 1} + b^{i - 1})$ $Δ W^{(L)} = (Δ W^{(L)} * W^{i}) * S i g m o i d^{'} (W^{i} * X^{i} + b^{i})$

- Updating the parameters (`SGD_step`)

$W^{(i)} \leftarrow W^{(i)} + η * Δ W^{(i)}$ $b^{i} \leftarrow b^{i} + η * Δ b^{i}$

class Network:
    def __init__(self, sizes):
        """
        Feedforward Neural Network
        
        sizes: list [input_dimensions, hidden_layer_dimensions, output_dimensions]
        L: length of the layer
        biases: list containing biases values for each layer
        weights: list containing weights for each layer
        
        Parameters:
        sizes: list containing dimenions of the neual network
        """
        self.L = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(n, 1) for n in self.sizes[1:]]
        self.weights = [np.random.randn(n, m) for (
            m, n) in zip(self.sizes[:-1], self.sizes[1:])]

        self.acc_train_array = []
        self.acc_test_array = []

    def sigmoid(self, z, threshold=20):
        """
        Sigmoid activation function
        """
        z = np.clip(z, -threshold, threshold)
        return 1.0 / (1.0 + np.exp(-z))

    def g_prime(self, z):
        """
        Derivative of sigmoid activation function
        """
        return self.sigmoid(z) * (1.0 - self.sigmoid(z))

    def forward_prop(self, a):
        """
        Forward propagation: 
        : Do layerwise dot product between the input and the weights, 
        : adding the coresponding biases and taking activations of it 
        : starting from the first layer then forward and return the final output.
        """
        for (W, b) in zip(self.weights, self.biases):   
            a = self.sigmoid(np.dot(W, a) + b)
        return a
    
    def cost(self, yhat, y):
        """
        Cost Function
        : Cost(a,y) = (yhat-y)^2/2
        """
        return 0.5*np.square(a-y)
    
    def grad_cost(self, yhat, y):
        """
        Gradient of cost function:
        : Derivative of Cost(yhat,y) 
        """
        return (yhat - y)
    
    def log_train_progress(self, train, test, epoch):
        """ Logs training progres. 
        """
        acc_train = self.evaluate(train)
        self.acc_train_array.append(acc_train)
        if test is not None:
            acc_test = self.evaluate(test)
            self.acc_test_array.append(acc_test)
            print("Epoch {:4d}: Train acc {:10.5f}, Test acc {:10.5f}".format(epoch+1, acc_train, acc_test))
        else:
            print("Epoch {:4d}: Train acc {:10.5f}".format(epoch+1, acc_train))
            
    def back_prop(self, x, y):
        """
        Back propagation for computing the gradients
        
        : Once forward prop completes (implemented inside), initate list of gradients (dws, dbs),
        : where each element of list stores the corresponding gradients of that layer.
        : For each layer compute the gradients and update the list (dws, dbs) and return it.
        Parameters:
            x: Sample features
            y: Sample labels
            
        RETURN: (dws, dbs)
                 dws: list of layerwise derivative of weights
                 dbs: list of layerwise derivative of biases
        """        
        a = x
        # List initialized for storing layer-wise output before it is fed to activations        
        pre_activations = [np.zeros(a.shape)]          
        # List initialized for storing layer-wise activations
        activations = [a]
        
        # Forward propogation to compute layer-wise pre_activations and activations
        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, a) + b
            pre_activations.append(z)
            a = self.sigmoid(z)
            activations.append(a)

        db_list = [np.zeros(b.shape) for b in self.biases]
        dW_list = [np.zeros(W.shape) for W in self.weights]        
        
        delta = self.grad_cost(activations[self.L-1], y) * \
            self.g_prime(pre_activations[self.L-1]) 
        
        for ell in range(self.L-2, -1, -1):
            db_list[ell] = delta 
            dW_list[ell] = np.dot(delta, activations[ell].T)
            delta = np.dot(self.weights[ell].T, delta) * self.g_prime(pre_activations[ell]) 
        return (dW_list, db_list)

    def SGD_step(self, x, y, eta):
        """
        Update the values of weights (self.weights) & biases (self.biases)
        : Get values of gradients (dws, dbs) by calling back_prop 
        : and update parameters using obtained gradients & learning rate eta
        
        Parameters:
            x: single sample features.
            y: single sample target.
            eta: learning rate.
            lam: Regularization parameter.
                
        RETURN: none
        """
        dWs, dbs = self.back_prop(x, y)
        self.weights = [W - eta * (dW) for (W, dW) in zip(self.weights, dWs)]
        self.biases = [b - eta * (db) for (b, db) in zip(self.biases, dbs)]

    def train(self, train, epochs, eta, verbose=True, test=None):
        """
        Training routine for the neural network. For each epoch the following is done:
        : shuffle the training dataset.
        : call self.SGD_step which will in turn call backprop & update parameters
        : Call self.log_train_progress according to the verbose
        
        Paramerers:
            train: Training set -> list containing tuple (Training Feature, Training label)
            epochs: Number of epocs to run
            eta: Learning rate
            verbose: True to print accuracy updates, False otherwise
            test: Test set -> list containing tuple (Test Feature, Test label)
        """
        n_train = len(train)
        for epoch in range(epochs):
            perm = np.random.permutation(n_train)
            for kk in range(n_train):
                self.SGD_step(*train[perm[kk]], eta)
            if verbose and epoch == 0 or (epoch + 1) % 20 == 0:
                self.log_train_progress(train, test, epoch)
    
    def predict(self, data):
        """
        Generate predictions
        : Calls forward propagation to generate predictions
        
        Parameters: data: (X,Y)
        
        RETURN: yhat (predictions)
        """
        preds = []
        for x, y in data:
            yhat = self.forward_prop(x)
            preds.append(yhat)
        return preds
    
    def evaluate(self, test):
        """
        Evaluate current model 
        : computes the fraction of labels matching 
        test : (test_x, test_y)
        """
        ctr = 0
        for x, y in test:
            yhat = self.forward_prop(x)
            ctr += yhat.argmax() == y.argmax()
        return float(ctr) / float(len(test))

2.1 Neural Network Training

It’s time to train the above implemented neural network. Generally, The hidden layer dimension (no. of neurons) is seen to affect the classification capabilities of a neural network. Here I try training my neural network with three configurations- each consisting of different hidden layer width (5, 20 and 50). I train each of these configuations for 100 epochs with a learning rate of $1 e - 2$ . For every 20 epochs, I report both the train and the test performances.

nns = []
for hidden_layer_dimensions in [5, 20, 50]:
    print('\nHidden Layer Dimensions: ', hidden_layer_dimensions)
    nn = Network([input_dimensions, hidden_layer_dimensions, output_dimensions])
    nn.train(train, epochs=100, eta=0.2, verbose=True, test=test)
    nns.append(nn)
    

Hidden Layer Dimensions:  5
Epoch    1: Train acc    0.56667, Test acc    0.70000
Epoch   20: Train acc    0.94167, Test acc    1.00000
Epoch   40: Train acc    0.92500, Test acc    0.96667
Epoch   60: Train acc    0.98333, Test acc    0.96667
Epoch   80: Train acc    0.98333, Test acc    0.96667
Epoch  100: Train acc    0.95000, Test acc    0.96667

Hidden Layer Dimensions:  20
Epoch    1: Train acc    0.31667, Test acc    0.40000
Epoch   20: Train acc    0.96667, Test acc    1.00000
Epoch   40: Train acc    0.97500, Test acc    1.00000
Epoch   60: Train acc    0.97500, Test acc    1.00000
Epoch   80: Train acc    0.95833, Test acc    0.96667
Epoch  100: Train acc    0.97500, Test acc    1.00000

Hidden Layer Dimensions:  50
Epoch    1: Train acc    0.65833, Test acc    0.70000
Epoch   20: Train acc    0.80833, Test acc    0.83333
Epoch   40: Train acc    0.97500, Test acc    0.96667
Epoch   60: Train acc    0.98333, Test acc    0.93333
Epoch   80: Train acc    0.97500, Test acc    0.93333
Epoch  100: Train acc    0.97500, Test acc    1.00000

2.2 Training Analysis

Here, I plot the the evolution of training accuracy as the epochs are incremented for all three configuations.

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,7))
epochs_array = [i for i in range(1, 120, 20)]

ax.plot(epochs_array, nns[0].acc_train_array, color="blue", ls='dashed',  label="5")
ax.plot(epochs_array, nns[1].acc_train_array, color="green", ls='dashed',  label="10")
ax.plot(epochs_array, nns[2].acc_train_array, color="red", ls='dashed', label="50")

ax.legend(title='Hidden layer\n   neurons', loc="lower right", fontsize=16)
plt.rcParams['legend.title_fontsize'] = 'xx-large'
ax.set_xlabel("epochs", fontsize=16)
ax.set_ylabel("Train accuracy", fontsize=16)
plt.title("Train performance over epochs", fontsize=18)
plt.grid(ls='--', color='gray', alpha=0.5)
plt.show()

png

Observations:

All three configurations are observed to be converging.
Hidden layer width of 5 (blue) and 10 (green) are observed to train and reach the optimal faster as compared to network with 50 hidden-layer width (red). This could be explained due to the fact that larger neural networks, having more parameters takes more iterations to get trained.

2.3 Test Analysis

Here, I generate a very similar plot with modification of being on the test-set.

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,7))
epochs_array = [i for i in range(1, 120, 20)]

ax.plot(epochs_array, nns[0].acc_test_array, color="blue", ls='--', label="5")
ax.plot(epochs_array, nns[1].acc_test_array, color="green",  ls='--', label="10")
ax.plot(epochs_array, nns[2].acc_test_array, color="red",  ls='--', label="50")

ax.legend(title='Hidden layer\n   neurons', loc="lower right", fontsize=16)
plt.rcParams['legend.title_fontsize'] = 'xx-large'
ax.set_xlabel("epochs", fontsize=16)
ax.set_ylabel("Test accuracy", fontsize=16)
plt.title("Test performance over epochs", fontsize=18)
plt.grid(ls='--', color='gray', alpha=0.8)
plt.show()

png

Observations:

Again, all three configurations are observed to have converged and are giving very good test-set performance.
One interesting thing to note here is that, having a bigger neural network doesn’t necessarily imply getting better performance. As can be observed from above, the width-10 network (green) is observed to surpass a much bigger width-50 network (red) almost always. The reason could be that width-50 model could be overparamterized which means- it has more modelling capability than what is required for the classification task. This suggests that simpler models generally tend to have higher generalizability on unseen set as compared to more complex overparamaterized networks.
Additionally, width-10 network (green) also beats the width-5 network (blue) over test-set. The fact that width-5 network’s performance not improving over iterations means it has saturated and just doesn’t have enough modelling capability to get optimal results for our classification task. The width-10 network is oberved to be at a balance here.

2.4 Overfitting

Here I train a neural network for a much larger no. of iterations than what is required.

hidden_layer_dimensions = 50
nn2 = Network([input_dimensions, hidden_layer_dimensions, output_dimensions])
nn2.train(train, epochs=400, eta=0.2, verbose=True, test=test)

Epoch    1: Train acc    0.38333, Test acc    0.30000
Epoch   20: Train acc    0.97500, Test acc    0.96667
Epoch   40: Train acc    0.97500, Test acc    1.00000
Epoch   60: Train acc    0.91667, Test acc    0.96667
Epoch   80: Train acc    0.91667, Test acc    0.96667
Epoch  100: Train acc    0.95000, Test acc    0.90000
Epoch  120: Train acc    0.95000, Test acc    0.96667
Epoch  140: Train acc    0.94167, Test acc    0.86667
Epoch  160: Train acc    0.95000, Test acc    0.96667
Epoch  180: Train acc    0.97500, Test acc    1.00000
Epoch  200: Train acc    0.97500, Test acc    1.00000
Epoch  220: Train acc    0.98333, Test acc    0.96667
Epoch  240: Train acc    0.97500, Test acc    1.00000
Epoch  260: Train acc    0.98333, Test acc    0.96667
Epoch  280: Train acc    0.97500, Test acc    1.00000
Epoch  300: Train acc    0.98333, Test acc    0.96667
Epoch  320: Train acc    0.98333, Test acc    0.96667
Epoch  340: Train acc    0.97500, Test acc    0.96667
Epoch  360: Train acc    0.95000, Test acc    0.86667
Epoch  380: Train acc    0.98333, Test acc    0.96667
Epoch  400: Train acc    0.97500, Test acc    0.96667

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,7))
epochs_array = [i for i in range(1, 420, 20)]

ax.plot(epochs_array, nn2.acc_train_array, color="blue", ls='--', label="Train")
ax.plot(epochs_array, nn2.acc_test_array, color="red", ls='--', label="Test")

ax.legend(title='Data', loc="lower right", fontsize=16)
plt.rcParams['legend.title_fontsize'] = 'xx-large'
ax.set_xlabel("epochs", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
plt.title("Performance over epochs (Hidden layer dim: 50)", fontsize=18)
plt.grid(ls='--', color='gray', alpha=0.8)
plt.show()

png

Training for more iterations doesn’t necessarily imply a greater test-performance. In fact, after 300 epochs, a decrease in test performance is observed even though the train performance is still getting better. This refers to overfitting. Since neural networks are trying to model the train-set and when trained for larger iterations, they tend to learn the train data itself than the underlying distribution which leads to poor performance on unseen test-set.

3. Discussion

From the exploratory analysis of Iris dataset, we observed not all the features have equal seperable characteristics. Here, features of the Petals could help better distinguish the different classes as compared to the attrcibute Sepal. The features of the dataset were relatively on the same scale due to which we may not need to standardize it. Additionally, Iris has a balanced class distribution with each having 50 entries, this is good for making the training of a neural network stable and more consistent.

From the analysis on Neural networks, the first crucial thing to note was that having a bigger network doesn’t necessarily perform better than simpler/smaller networks. The required complexity of a network completely depends on the task at hand. Going with a very small network may lead to underparamaterization which means the network doesn’t have enough capacity to learn the underlying distribution to get optimal results. On the other hand, very large networks requires more training and may also lead to overfitting, which essentially means that network has started to learn the training data itself which leads to poor performance on unseen data. Overfitting also becomes likely when trained for huge no. of iterations as we discussed above.

Generally, we want a model with balanced complexity, which is something that could be derived solely based on hit-trial method. Essentially models with different configurations needs to trained and tested to see what optimal configuration performs better. Additionally, the test-set performance needs to be monitored continuously over the iterations to avoid the case of overfitting. There are other variations or model designs such as Dropouts (dropping neurons randomly during training) that are widely used these days which let’s you train larger networks while reducing the case of overfitting. It’ll be surely intersting to analyze the various perfromances to detect any underlying pattern for different hyperparameters and design choices to get the optimal network.

References

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

[2] Casper Hansen (2019) [https://mlfromscratch.com/neural-networks-explained/]

[3] Maziar Raissi (Github- Applied Deep Learning) - Lecture Notes [https://github.com/maziarraissi/Applied-Deep-Learning] 

A Mathematical Persepective and Application on Iris Dataset

Table of Contents:

- 1. Dataset

- 2. Feedforward Neural Networks (using NumPy)

- 3. Discussion

- References

1. Dataset

1.1 Exploratory Data Analysis

Data Visualization:

Observations:

Observations:

1.2 Dataset-preprocessing:

2. Feedforward Neural Networks (using NumPy)

Let’s start with the error:

- Consider layer $L$ , Computing $\frac{\partial}{\partial W^{(L)}} (C o s t J)$

- Let’s compute $\frac{\partial y P r e d}{\partial W^{(L)}}$

- Computing $\frac{\partial}{\partial W^{(L)}} ((X^{(L)} . * W^{(L)} + b^{L}))$

- Computing gradients in backward propogation (`back_prop`)

- Updating the parameters (`SGD_step`)

2.1 Neural Network Training

2.2 Training Analysis

Observations:

2.3 Test Analysis

Observations:

2.4 Overfitting

3. Discussion

References

Table of Contents:

- 1. Dataset

- 2. Feedforward Neural Networks (using NumPy)

- 3. Discussion

- References

1. Dataset

1.1 Exploratory Data Analysis

Data Visualization:

Observations:

Observations:

1.2 Dataset-preprocessing:

2. Feedforward Neural Networks (using NumPy)

Let’s start with the error:

- Consider layer L, Computing ∂∂W(L)(CostJ)

- Let’s compute ∂yPred∂W(L)

- Computing ∂∂W(L)((X(L).∗W(L)+bL))

- Computing gradients in backward propogation (back_prop)

- Updating the parameters (SGD_step)

2.1 Neural Network Training

2.2 Training Analysis

Observations:

2.3 Test Analysis

Observations:

2.4 Overfitting

3. Discussion

References

- Consider layer $L$ , Computing $\frac{\partial}{\partial W^{(L)}} (C o s t J)$

- Let’s compute $\frac{\partial y P r e d}{\partial W^{(L)}}$

- Computing $\frac{\partial}{\partial W^{(L)}} ((X^{(L)} . * W^{(L)} + b^{L}))$

- Computing gradients in backward propogation (`back_prop`)

- Updating the parameters (`SGD_step`)