#!/usr/bin/env python
# coding: utf-8
# # Machine Learning with Python
#
# Collaboratory workshop
#
# This is a notebook developed for the third day of the Collaboratory Workshop, Machine Learning with Python. For more information, go to the workshop home page:
#
# https://github.com/kose-y/W17.MachineLearning/wiki/Day-3
# - __Day 1__ - Fundamentals and Motivation
# - __Day 2__ - Classification and Cross-Validation
# - __Day 3__ - Regression and Unsupervised Learning
# - Linear and nonlinear regressions
# - Unsupervised learning
# - Conclusions
# In[ ]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 15 # increase font size within a plot
# ## Regression
#
# From a set of features, determine a continuous target.
#
# e.g., Create a function that estimates the cell volume
#
#

#
# ## Linear regression
# Say, we predict a volume (`V`) of a cell by their length (`L`), width (`d`), and smoothness (`s`).
# $$
# V(L, d, s) = \alpha L + \beta d + \gamma s + V_0
# $$
#
# Training in linear regression means finding the best possible parameters $\alpha$, $\beta$, $\gamma$, and $V_0$.
# A linear regression is any regression of a model that is linear on the fitting parameters.
# ### Creating a simple dataset
# Let's get started creating a simple linear dataset using NumPy's random function:
# In[ ]:
numSamples = 100 # Defining the number of samples
linearCoef = 0.5 # This is the correct linear coeficient
Intercept = 2.2 # This is the correct intercept parameter
X = np.random.random( numSamples )*10.0 # Randomly sampling X-points.
e = np.random.random( numSamples ) - 0.5 # Noise
print("Min of X: ", X.min())
print("Max of X: ", X.max())
print("Average of the error component: ", e.mean())
# Let's define the the dependent variable
# In[ ]:
Y = linearCoef*X + Intercept + e
# Let's plot it:
# In[ ]:
plt.plot(X, Y, 'o', color=(0.2,0.6,1.0)) # color: red-green-blue.
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()
# We want to use the linear model
# $$ Y = \beta X + \gamma $$
# If everything works out, we should expect $\beta \approx 0.5$ and $\gamma \approx 2.2$.
# We split the dataset into training set and test set.
# In[ ]:
from sklearn.model_selection import train_test_split
# features must have shape (100,1), while X has shape (100,)
X = X.reshape((numSamples,1))
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.33)
# One may perform linear regression using the class `LinearRegression` in `linear_model` subpackage.
#
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
# In[ ]:
from sklearn.linear_model import LinearRegression
# In[ ]:
model = LinearRegression()
model.fit( X_train , Y_train )
# In[ ]:
x_array = np.linspace(0,10,100)
y_array = model.predict( x_array.reshape((100,1)) )
# In[ ]:
plt.plot(X_train, Y_train, 'o', color=(0.2,0.6,1.0))
plt.plot(x_array, y_array, 'r-', linewidth=3.)
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()
# Similarly to the classifiers covered yesterday, you can access information of the fitted `LinearRegression` model through their attributes:
# In[ ]:
print("Coefficient: ", model.coef_ )
print("Intercept: ", model.intercept_ )
# ### Performance measure
# In analogy with a classifier's accuracy, the most commonly used metric to evaluate the performance of linear regression is the coefficient of determination ($R^2$).
#
# $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares `((y_true - y_pred)** 2).sum()` and $v$ is the total sum of squares `((y_true - y_true.mean()) ** 2).sum()`.
#
# - $v$ represents total variation, or "total sum of squares"
# - $u$ represents unexplained variation, or "residual sum of squares"
# In[ ]:
print("R-sq on test set: ", model.score(X_test, Y_test))
# ## Polynomial regression
# A linear regression is any regression of a model that is linear on the fitting parameters, e.g.,
# $$Y = \beta_1 X + \beta_2 Z + \gamma$$
#
# What about this model?
# $$Y = \beta_1 X + \beta_2 X^2 + \gamma$$
#
# Although the model is quadratic, __this is still considered a linear regression__, and can be solved using the same methods.
# Creating a 2D feature space with $X$ and $X^2$.
# In[ ]:
(linearCoef*X + 0.15*X**2 + Intercept + e).shape
# In[ ]:
X.shape
# In[ ]:
e.shape
# In[ ]:
e = e.reshape((len(e),1))
Y = linearCoef*X + 0.15*X**2 + Intercept + e
# In[ ]:
Y.shape
# Generate the dataset:
# In[ ]:
e = e.reshape((len(e),1))
Y = linearCoef*X + 0.15*X**2 + Intercept + e
plt.plot(X, Y, 'o', color=(0.2,0.6,1.0))
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()
# Splitting the data to training and test sets, and creating a new feature matrix:
# In[ ]:
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.33)
features = np.zeros( (len(X_train),2) )
features[:,0] = X_train[:,0]
features[:,1] = X_train[:,0]**2
print(features.shape)
# Fitting the model:
# In[ ]:
model = LinearRegression()
model.fit( features , Y_train )
print("Coefs: ",model.coef_)
print("Intercept: ",model.intercept_)
# You can calculate again the prediction along the x axis, and plot it:
# In[ ]:
x_array = np.linspace(0,10,100)
y_array = (x_array * model.coef_[0,0] + x_array**2*model.coef_[0,1]
+ model.intercept_)
plt.plot(X, Y, 'o', color=(0.2,0.6,1.0))
plt.plot(x_array, y_array, 'r-', linewidth=3.)
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()
# ## Regularization and Overfitting
#
#
# Let's explore a more interesting dataset.
# In[ ]:
data = np.loadtxt('Regression_Exercise_dataset.dat')
print(data.shape)
# In[ ]:
Y_origin = data[:,0] # all rows, first column
X_origin = data[:,1] # all rows, second column
X, X_test, Y, Y_test = train_test_split(
X_origin,Y_origin,test_size=0.2,
shuffle=True)
# In[ ]:
plt.plot(X, Y, 'o')
plt.savefig("regression_data.png", dpi=500)
plt.show()
# Lets store in the next array the average value of the coefficeints
# In[ ]:
coefs = []
degrees = []
# In[ ]:
model = LinearRegression()
model.fit( X , Y )
# According the error message above, the issue is probably with the shape of X.
# In[ ]:
print( X.shape )
# As we extensively discussed on Day 2, the features should be organized in an array with shape ```(, )```. In this particular case, we have only a single feature, so ```Num Features```=1. This is why we will update the shape of X using numpy's reshape function:
# In[ ]:
X = X.reshape( (X.shape[0], 1 ))
# In[ ]:
print( X.shape )
# In[ ]:
model = LinearRegression()
model.fit( X , Y )
degrees.append(1) #One degree
coefs.append( np.abs(model.coef_).mean() )
x_array = np.linspace(0,1,100)
x_array = x_array.reshape((len(x_array),1))
print(x_array.shape)
y_array = model.predict(x_array)
plt.plot(X, Y, 'bo')
plt.plot(x_array, y_array, 'r-')
plt.savefig('regression_underfit.png', dpi=500)
#plt.show()
# This does not fit our data well: __underfitting__.
# To use polynomials with higher degrees, we need an array that serves as our input features, having as many columns as the degrees in our polynomial.
#
# `numpy.c_` simplifies columnwise concatenation.
# In[ ]:
X_poly = np.c_[ X, X**2 ] # concatenating two columns
print( X_poly.shape )
# In[ ]:
Y.shape
# All we need to do is train on features and plot it:
# In[ ]:
model = LinearRegression()
model.fit( X_poly , Y )
degrees.append(2) #Second degree
coefs.append( np.abs(model.coef_).mean() )
x_array = np.linspace(0,1,100)
x_array_poly = np.c_[ x_array, x_array**2 ]
y_array = model.predict(x_array_poly)
plt.plot(X, Y, 'bo')
plt.plot(x_array, y_array, 'r-')
plt.show()
# Let's create a function that create the 2D array of features for any degree:
# In[ ]:
def getPoly(myArray,degree):
result = np.zeros((myArray.shape[0],degree))
for j in range(degree):
result[:,j] = myArray.ravel()**(j+1)
return result
# Let's try with a fifth-degree polynomial. Of course, adding terms by hand is not very efficient, but we can construct the ```X_poly``` array for an arbitrary polynomial by using the following snippet of code:
# In[ ]:
X_poly = getPoly(X,degree=5)
print(X_poly.shape)
# Let's now evaluate and plot it:
# In[ ]:
d = 5
X_poly = getPoly(X,degree = d)
x_array_poly = getPoly(x_array,degree = d)
model = LinearRegression()
model.fit( X_poly , Y )
degrees.append(d) #d-th degree
coefs.append( np.abs(model.coef_).mean() )
y_array = model.predict(x_array_poly)
plt.plot(X, Y, 'bo')
plt.plot(x_array, y_array, 'r-')
plt.ylim(-1.6, 2.2)
plt.show()
# In[ ]:
d = 20
X_poly = getPoly(X,degree = d)
x_array_poly = getPoly(x_array,degree = d)
model = LinearRegression()
model.fit( X_poly , Y )
degrees.append(d) #d-th degree
coefs.append( np.abs(model.coef_).mean() )
y_array = model.predict(x_array_poly)
plt.plot(X, Y, 'bo')
plt.plot(x_array, y_array, 'r-')
plt.ylim(-1.6, 2.2)
plt.show()
# In[ ]:
print(degrees)
print(coefs)
# 20-degree fit seems to be an __overfitting__: We can easily see that the coefficients are growing with the degree.
# In[ ]:
plt.loglog(degrees, coefs,'bo--', markersize=8)
plt.ylabel('Avg. coefficient')
plt.xlabel('Degree')
plt.show()
# ## Exploring hyperparameter regularization
# This form of overfitting is common with linear regressions.
#
# - Linear regression we have used thus far: minimizes the quadratic error
# $$
# \min_\beta \|f_\beta(x) - y\|^2
# $$
#
# - To avoid uncontrolled increase in the model parameters, we can penalize large coefficients:
# $$
# \min_\beta \|f_\beta(x) - y\|^2 + \alpha \|\beta\|^2
# $$
#
# This addition of the second term is called __regularization__. $\alpha$ is a new hyperparameter introduced.
# In[ ]:
from sklearn.linear_model import Ridge
# We fit 20-degree polynomial with ridge regression.
# In[ ]:
model = Ridge( alpha = 1.0 )
model.fit( X_poly , Y )
plt.plot(X, Y, 'o')
plt.plot(x_array, model.predict(x_array_poly), 'r-')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
# This is often the result of $\alpha$ too large. Let's try smaller $\alpha$.
# In[ ]:
model = Ridge( alpha = 0.001 )
model.fit( X_poly , Y )
plt.plot(X, Y, 'o')
plt.plot(x_array, model.predict(x_array_poly), 'r-')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
# - Another popular method of regularization is via [Lasso penalty](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html):
# $$
# \min_\beta \|f_\beta(x) - y\|^2 + \alpha \sum_{j}|\beta_j|.
# $$
#
# By using "L1-norm", we encourage sparsity of $\beta$: some of the $\beta_j$'s will be exactly zero. This works as
#
# __In-class exercise__: Let's try Lasso penalty to our polynomial regression example.
#
#
# In[ ]:
from sklearn.linear_model import Lasso
X_poly = getPoly(X, degree = 20)
model = Lasso(alpha=0.01)
model.fit(X_poly, Y)
# In[ ]:
plt.plot(X, Y,'o')
plt.plot(x_array, model.predict(x_array_poly), 'r-')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
# In[ ]:
model.coef_
# __Note__: It is often recommended to normalize the features before analysis, e.g., with the function `normalize` in `sklearn.preprocessing` so that the coefficients are penalized on the same scale.
# __Homework__: One may split the dataset into train-validation-test set, or split it into learning set and test set then use cross-validation on the learning set to choose the best $\alpha$, similar to what is done in yesterday's workshop.
# We can also combine the two to get better performance: it's called [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html).
# # Unsupervised Learning
# So far, for both classification and regression, we have used labels for training to map features to those labels.
#
# In unsupervised learning, no labels are used. The main idea is to discover hidden patterns in the data and use them to generate descriptive statistics.
#
#

#
# ### Clustering
#
# Clustering is grouping examples to certain group with similar features.
#
#
# __$K$-means__ clustering: find $K$ clusters of points in which each observation belongs to the cluster with the nearest cluster centers, serving as a prototype of the cluster.
#
#
#

#
#
#
# __Hierarchical__ clustering: Seeks to build a hierarchy of clusters.
#
#

#
# ### K-means Clustering
#
# This methods partitions the feature space into $K$ clusters based on the means of points.
#
# 1. Randomly initialize clusters' centers
# 2. Assign points to the nearest cluster center
# 3. Update the positions of the centers
# 4. Repeat 2-3.
#
# | | | | |
# |:---:|:---:|:---:|:---:|
# | |  |  |  |
#
# Let's explore the iris dataset using the K-means clustering algorithm. We first import the data using ```sklearn.datasets```.
# ### Iris dataset
#
# - Among the most famous datasets in machine learning.
# - Measurements from three different species of iris plant: Iris setosa, Iris versicolor, and Iris virginica.
# - Four features: Sepal and petal lengths and widths.
#
#
#
#
# In[ ]:
from sklearn.datasets import load_iris
# In[ ]:
iris_data = load_iris()
print( iris_data.data.shape )
# In[ ]:
iris_data.feature_names
# Before we move on, let's visualize the dataset. You can change the features being visualized below.
# In[ ]:
plt.plot( iris_data.data[:,1], iris_data.data[:,3], 'o' ) # Sepal width and Petal width.
plt.show()
# In[ ]:
plt.plot( iris_data.data[:,1], iris_data.data[:,3], 'o',
color=(0.2,0.5,1.0), markersize=4 )
plt.xlabel('Sepal width (cm)')
plt.ylabel('Petal width (cm)')
plt.tight_layout()
plt.show()
# Next, let's create a KMeans model and apply to the iris dataset. Because we know there are three different species of plants in this dataset, let's make an educated guess and use $K=3$ (i.e., we will set the parameter ```n_clusters``` to 3).
#
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
# In[ ]:
from sklearn.cluster import KMeans
# In[ ]:
kmeans = KMeans( n_clusters=3 )
# In[ ]:
kmeans.fit( iris_data.data )
# After the K-means method is applied to the dataset, we can then get the ID of the clusters to which each of the samples is predicted to belong to by using the method ```predict```.
# In[ ]:
clusters = kmeans.predict( iris_data.data )
print( "Shape: ", clusters.shape )
print( "Cluster IDs: ", clusters )
# Let's visualize the resulting clustering by color-coding each cluster ID. This is very similar to how we color-coded different classes in classification datasets.
# In[ ]:
index0 = clusters == 0
index1 = clusters == 1
index2 = clusters == 2
# In[ ]:
plt.plot( iris_data.data[index0,1], iris_data.data[index0,3],
'o', color='b', markersize=4 )
plt.plot( iris_data.data[index1,1], iris_data.data[index1,3],
'o', color='k', markersize=4 )
plt.plot( iris_data.data[index2,1], iris_data.data[index2,3],
'o', color='r', markersize=4 )
plt.xlabel('Sepal width (cm)')
plt.ylabel('Petal width (cm)')
plt.show()
# Let's try $K=4$ and inspect what comes out differently.
# In[ ]:
kmeans = KMeans( n_clusters=4 )
kmeans.fit( iris_data.data )
clusters = kmeans.predict( iris_data.data )
index0 = clusters == 0
index1 = clusters == 1
index2 = clusters == 2
index3 = clusters == 3
plt.plot( iris_data.data[index0,1], iris_data.data[index0,3],
'o', color='b', markersize=4 )
plt.plot( iris_data.data[index1,1], iris_data.data[index1,3],
'o', color='k', markersize=4 )
plt.plot( iris_data.data[index2,1], iris_data.data[index2,3],
'o', color='r', markersize=4 )
plt.plot( iris_data.data[index3,1], iris_data.data[index3,3],
'o', color='g', markersize=4 )
plt.xlabel('Sepal width (cm)')
plt.ylabel('Petal width (cm)')
plt.show()
# - Because K-means assumes known number of clusters, there is some freedom. $K$ can be treated as a hyperparameter.
# - There are methodologies that do not assume the number of clusters known (e.g. affinity propagation, density-based clustering, etc.)
# - K-means is one of the simplest clustering algorithm, and might not be ideal for your application. However, some modifications (e.g. K-medians, K-means++, etc.) are still widely used.
# - When applying clustering, remember that this is different from classification: if what you need is predict categories associated to samples, use supervised learning if possible.
# ## Principal Component Analysis
# PCA can be included in the umbrella of unsupervised learning methods.
#
# - __PCA finds a set of new features that are uncorrelated among themselves__.
# - This is interesting because __correlated features overlap in what information content they include__.
# - The features are linear combination of the original features that maximally describes variations in the feature space.
# ### A little bit more detail of PCA
#
# __Covariance__: a measure of joint variability of two random variables:
#
# $$
# Cov(X, Y) = E[XY] - E[X] E[Y]
# $$
#
# - If the two variables are independent, covariance between them is zero.
# - For instance, number of people riding New York subway and probability of rain in New York have high positive covariance
#
# __Covariance Matrix__: The matrix of covariances between each pair of variables in a set. Always positive definite and symmetric
#
# __Change of basis__
#
#
#
#
# The new covariance matrix is diagonal, new basis are orthogonal from each other. i.e., uncorrelated.
#
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# Let's see PCA in action by creating three variables with correlation:
# In[ ]:
from sklearn.decomposition import PCA
# In[ ]:
# 200 samples with 3 features
X = np.random.random( (200, 3) )
X[:,2] = X[:,0] # First and last features are perfectly corrleated
# numpy's cov function assumes each row to be each variable, so we transpose it.
#
# In[ ]:
# \n breaks the line
print("Covariance matrix:\n ", np.cov(X.T))
# In[ ]:
# np.cov expects the input to be a (number of features) x (number of samples) matrix.
plt.matshow(np.cov(X.T))
plt.show()
# ### Applying the PCA transformation
# In[ ]:
pca = PCA()
pca.fit(X)
# Proportions of variance explained by each principal component. Because two of the features were perfectly correlated, it is expected to have first two PCs explains most of the variance.
# In[ ]:
pca.explained_variance_ratio_
# In[ ]:
pca.components_ # Each column is the principal component
# The PC with highest variance includes variables 0 and 2; and the second PC mostly includes variable 1.
# The transformation computes the coordinated based on the changed basis
# In[ ]:
X_transform = pca.transform(X)
print(np.cov(X_transform.T))
plt.matshow(np.cov(X_transform.T))
plt.savefig('CovMatrix_example_PCA.png', dpi=300)
plt.show()
# No off-diagonal elements, almost no self-correlation for the last variable.
# ## Using PCA on the Breast Cancer dataset
# Let's apply PCA to the Breast cancer dataset from Day 1:
# In[ ]:
from sklearn.datasets import load_breast_cancer
bcancer = load_breast_cancer()
# We split the data into training and test datasets.
# In[ ]:
X, X_test, Y, Y_test = train_test_split(
bcancer.data,bcancer.target,test_size=0.2,
shuffle=False)
# In[ ]:
pca = PCA()
pca.fit( X )
# Let's visualize the importance of each Principal Component.
# In[ ]:
plt.plot( pca.explained_variance_ratio_, 'o--' )
plt.ylabel( 'Explained Variance' )
plt.xlabel( 'Principal Component #' )
plt.show()
# A better way to visualize the decay is using log-scale:
# In[ ]:
plt.plot( pca.explained_variance_ratio_, 'o--' )
plt.ylabel( 'Explained Variance' )
plt.yscale('log')
plt.xlabel( 'Principal Component #' )
plt.show()
# In[ ]:
pca.explained_variance_ratio_
# As a __dimensionality reduction__, we can choose only the first two principal components. We can apply any other machine learning methods we learned in this workshop on the data with the reduced dimension:
# In[ ]:
X_PCAs = pca.transform( X )
index0 = (Y == 0)
index1 = (Y == 1)
plt.plot( X_PCAs[index0,0],
X_PCAs[index0,1], 's', color='r' ) # malignant
plt.plot( X_PCAs[index1,0],
X_PCAs[index1,1], 'o', color='b' ) # benign
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.show()
# We can fit random forest classifier on the transformed two-dimensional features:
# In[ ]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, max_depth = 5)
clf.fit(X_PCAs[:, :2],Y)
X_test_PCAs = pca.transform( X_test )
clf.score(X_test_PCAs[:,:2],Y_test)
# We still get an OK result with only two features. On a high-dimensional dataset such as gene expression, where the number of features is much higher than the number of samples, it often improves the results.
# ## Workshop evaluation:
# (shown in the classroom)
#
#
#
# ## Concluding Remarks
#
# - We extensively explored __Jupyter Notebook__
# - We explored useful functionalities from __NumPy__ and __Matplotlib__ libraries
# - We studied __supervised learning__ and its two fundamental sub-fields: __classification__ and __regression__.
# - We investigated __Scikit-learn__'s structure and discussed in detail how to use its documentation pages.
# - At the end, we briefly discussed __unsupervised learning__.
#
# | | | |
# |:---:|:---:|:---:|
# |
|
|
|
#
# One prominent topic we did not cover in this workshop is Neural Networks. Due to recent advancement in hardware, using multiple layers of neural networks have been massively successful in many fields, including image classification and segmentation. I believe it deserves a separate 3-day workshop with higher hardware demand (i.e. access to GPUs.)
#
# ### Some related books:
#
# - Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow - Aurélien Géron (2nd ed.) : Similar approach to this workshop, with some information in deep learning with Keras and TensorFlow
#
# 
#
# - Pattern Recognition and Machine Learning - Christopher M. Bishop : Some mathematical approach to machine learning.
#
# 
#
# - An Introduction to Statistical Learning - Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani: an intro-level book for people with a bit of knowledge in math and statistics
#
# 
#
# - Machine Learning: A Probabilistic Approach - Kevin Murphy : More mathematically advanced stuff, with online software available.
#
# 
#
# - Deep Learning - Ian Goodfellow, Yoshua Bengio, and Aaron Courville
#
# 
# The goal of this workshop was
# - to provide basic information to become independent
# - increase knowledge with online resources
#
# I hope this workshop helped in this direction.
#
#

#
#
#
# # Thank you!
#
# In[ ]: