#!/usr/bin/env python
# coding: utf-8

# # Machine Learning with Python
# 
# Seyoon Ko
# 
# #### Collaboratory workshop
# 
# 
# This is a notebook for the first day of the Collaboratory Workshop, _Machine Learning with Python_. For more information, go to the workshop home page:
# 
# https://github.com/kose-y/W17.MachineLearning/wiki/Day-1
# 
# 
# #### Credits
# 
# Materials based on slides created by Thiago S Mosqueiro and Renaud Dessalles.
# 
# 

# 
# ### What is Machine learning?
# 
# <div>
# <img src="https://live.staticflickr.com/1816/30212411048_2a1d7200e2_b.jpg" width="400"/>
# </div>
# 
# 
# > field of study that gives computers the ability to learn without being explicitly programmed
# - Arthur Samuel, 1959
# 
# The computer learns a *model* from the data.  
# - a framework trained from data to determine the output when input data are given.
# 
# 
# Human just defines...
# 
# - _data_ that computer learns from
# - _algorithm_ - procedure that is run on data to create a model 
#     - (penalized) linear regression, decision tree, support vector machine, deep neural networks, etc. 
# - _hyperparameters (optional)_ - external configuration for each algorithm. e.g., how much should the computer try to simplify the model? 
# 

# ### Goal of this Workshop
# 
# - Provide basic information to become independent
# - Increase knowledge with online resources
# 
# <div>
# <img src="../GoalFigure.png" width="600"/>
# </div>
# 

# ### Workshop Overview
# - __Day 1__ - Fundamentals and Motivation
#     - Overview of machine learning
#     - Nano-review of Python
#     - Introduction to Jupyter Notebook/JupyterLab
#     - Introduction to Numpy and Matplotlib
#     - Introducing Scikit-Learn’s structure
# - __Day 2__ - Classification and Cross-Validation
#     - Assessing the performance of your models
#     - Learning to choose and use models from Scikit-learn
#     - K-fold cross-validation
# - __Day 3__ - Regression and Unsupervised Learning
#     - Linear and nonlinear regressions
#     - Unsupervised learning
#     - Conclusions
# 
# 
# - __Homework__ - simple data analysis tasks given throughout Days 2 and 3 -- how it works will be described later
# - __Quiz__ - Multiple choice questions given at the end of the workshop
# 
# __Homework and Quiz are required for students taking this workshop to earn UCLA credits__

# ## Jupyter Notebook/JupyterLab
# 
# The application you are looking at right now. 
# 
# 
# <div>
# <img src="https://jupyter.org/assets/homepage/main-logo.svg" width="300"/>
# </div>
# 
# - Open-source
# - Create and share documents
# - Contains live code and hypertext "literate programming"
# - Similar applications
#     - R: R Markdown Notebook
#     - Matlab: Matlab Live Script
# - JupyterLab is an extension to Jupyter Notebook for better user friendliness.
#     
# | | Python Console (REPL) | Jupyter Notebook/JupyterLab |
# |:---|:---|:---|
# | Pros | By default<br>Accessible by SSH<br>Rapidly available | Sharable<br>Can plot and keep figures<br>User friendly |
# | Cons | Not sharable<br>Cannot contain figures<br>Not ergonomic | Need to be installed<br>Not suitable for large projects|
# 
# 
# - Jupyter Notebook is great for short projects (teaching, data analysis, etc.)
# - The name Jupyter stands for Julia, Python, and R: one may install kernels for languages other than Python, including Julia and R.
# - Jupyter Notebook is included in the Anaconda distribution of Python
# 

# ### Let's get started: 
# #### Option 1: Local installation
# - Open Anaconda Navigator.
#     - Mac: Open Spotlight Search (Command + Space bar) and search "Anaconda Navigator"
#     - Windows: Look for Anaconda in your Start menu. 
# - Launch JupyterLab. 
# 
# #### Option 2: Using Binder 
# If local installation fails, we can run the notebook via a cloud service. The link is provided in the course website. 
# 

# Once launched, the Jupyter Dashbord opens in a web browser. 
# <!--Why web page? It's easier to maintain, compared to creating a separate app for each OS.-->
# - Files relative to your home directory is displayed. 
# - You can create folders and a new notebook.
# 
# In a notebook, you can write/edit "cells" of code or write a cell with a text, markdown-formatted. 
# - https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet for some syntax.

# When you open a notebook, you will see the notebook interface.
# 
# You can run code in a cell using the "Run" button in the toolbar or by pressing `Shift + Enter` on the keyboard.
# 

# In[ ]:


x = 5*3 + 2 


# If the last line of the cell assigns something (`x = ...`), nothing is displayed. Otherwise, the result of the last line is displayed.

# In[ ]:


y = 7*x**2 + 0.6*x
z = 1/x # even if `x` is an integer, the result is floating-point
y*z


# Another way to display information: using `print()` function

# In[ ]:


print(y * z)


# In[ ]:


x = range(10) # assign x a range up to 10.
for element in x:
    print("Element", element)
print(x)


# 
# What happens if we execute the second code cell again after running the fourth code cell?
# 
#     ---------------------------------------------------------------------------
#     TypeError                                 Traceback (most recent call last)
#     <ipython-input-5-54fa71c9de4e> in <module>
#     ----> 1 y = 7*x**2 + 0.6*x
#           2 z = 1/x
#           3 y*z
# 
#     TypeError: unsupported operand type(s) for ** or pow(): 'range' and 'int'
#     
#     
# __CAUTION: Execution order is important.__

# __Kernel__: There is a "hidden console" behind each running notebook. 
# 
# When you run a code cell, that code is executed within the kernel. Any output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells — it pertains to the document as a whole and not individual cells.

# After running the third code cell, `x` is no longer a number, it's a range. So if we go back to the second code cell, we get an error.
# 
# __Whenever you go back, make sure to run all the relevant cells again.__

# ### Restarting a notebook
# 
# `Kernel > Restart`. You can restart a notebook: it's similar to exit and reopen the Python console. All the variables will be cleared.
# 
# - `Restart & Clear Output`: restart and remove printed outputs from the notebook
# - `Restart & Run All`: restart kernel and rerun all the cells, sequentially from top to bottom. 
# - `Interrupt`: stop executing the cell currently running. Similar to pressing `Ctrl + C` in the console. Useful when the kernel hangs or trapped in infinite loop, etc. 

# ### Adding a text comment
# 
# You can insert a cell using the "+" button on the toolbar. 
# 
# You can change the type of the cell by clicking the drop-down menu on the toolbar between `Code`, `Markdown`, etc. 
# 
# This is some text in _a_ markdown cell. We __can__ even `use` formulas (TeX-formatted):
# $$\theta = \frac{[L]^n}{K_d+[l]^n}$$

# markdown cell. 

# 

# ### Some Shortcuts 
# * Useful to know some keyboard shortcuts. I frequently use
#     * `Shift + Enter`: execute current cell. 
#     * `b`: create a cell below current cell.
#     * `a`: create a cell above current cell.  
#     * `y`: change cell to code.  
#     * `m`: change cell to Markdown.  
#     Check more shortcuts in menu `Help` -> `Keyboard Shortcuts`.
# 
# 
# * Notebook can be **converted to other formats** such as html, LaTeX, Markdown, Python code, and many others, via menu `File` -> `Export Notebook As...`. For your homework, you will be asked to submit both notebook (`.ipynb`) and (`.html`).
# 

# In[ ]:


# ### Saving your notebook
# 
# Regularly, Jupyter Notebook saves itself automatically. You may force-save yourself by pressing `Ctrl+S` (Windows) or `Cmd+S` (Mac).
# 
# _If you are on binder, your workspace will be destroyed after the virtual machine turns off. If you want to keep the notebook you edited, you will need to download your notebook on your local machine (`File` > `Download as` > `Notebook (.ipynb)`)._

# ### Properly close your notebook
# Just closing the window does not close the running kernel. It may take up memory if not closed properly. On the Dashbord, you will see the green dot indicating the kernel is still running. 
# 
# `File > Close and Shutdown Notebook` to close the notebook properly. The dot next to the icon disappears.

# ### Re-opening a notebook
# 
# You may click on a closed notebook to open it again. The output will be still there. _However, kernel has been restarted from scratch, with no cells executed so far._
# 
# `Kernel > Restart Kernel and Run All Cells...` to run all the cells again. 

# ### Examples of cool notebooks
# 
# Provided through [course webpage](https://github.com/kose-y/W17.MachineLearning/wiki/Day-1).  

# See also: 
# 
# https://www.dataquest.io/blog/jupyter-notebook-tutorial/

# # Important Libraries in Machine Learning
# 
# ## Libraries in Python
# 
# Libraries are important characteristic of how Python works: each application has its libraries.
# 
# | Package | Description | Logo |
# |:---:|:---|:---:|
# | __Numpy__ | Numerical arrays | <img src="https://numpy.org/doc/stable/_static/numpylogo.svg" width="300"/> |
# | Scipy | Scientific Python<br>User-friendly and efficient numerical routines:<br> numerical integration, interpolation, optimization, linear algebra, and statistics | <img src="https://scipy.org/images/logo.svg" width="200"/> |
# | __Matplotlib__ | Plotting | <img src="https://matplotlib.org/stable/_static/logo_light.svg" width="300"/>|
# | Pandas | Data analytics <br> R-like data frames | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1024px-Pandas_logo.svg.png" width="300"/> |
# | __Scikit-learn__ | Machine learning | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/2880px-Scikit_learn_logo_small.svg.png" width="300"/> | 
# 
# All the packages above are included in Anaconda distribution. Packages used in this workshop are in bold.
# 
# 
# 
# Next, we will explore two key libraries used not only to Machine Learning, but also to any quantitative project written in Python. While Numpy is a library for optimized matrix operations, Matplotlib allows you to create virtually any visualization (2-D plots, scatter plots, bar plots, etc).
# 
# Let's start with Numpy.

# ## NumPy
# 
# In standard python, the data collecitons are the `list`s. 
# 
# | Advantages | Disadvantages (for scientists) |
# |:---|:---|
# |- Can contain different types of objects<br>- Easy insertion<br>- Easy concatenation |- Sum of two lists is concatenation, not vector addition<br>- Slow for large lists<br>- No useful function (mean, variance, maximum, etc.)|
# 
# Scientists would like an object closer to the notion of vector or matrix. NumPy implements that.
# Thus, nearly all other scientific libraries in Python are based on NumPy. (dubbed "NumPy ecosystem")
# 
# 
# NumPy is a library for large multidimensional arrays and matrices, and math operations with them. 

# To use numpy in your code, simply use the import statement in the next cell:

# In[ ]:


import numpy


# To create a NumPy array, use `variable = numpy.array(Array_Like)`. 

# In[ ]:


a = [1,2,3,4,5]
print( type(a) )   # displays the type of the variable a

b = numpy.array( [1,2,3,4,5] )
print( type(b) )   # displays the type of the variable b


# We are going to use `numpy` a lot, and five letter seems to be too much to type each time... It is extremely common to rename it as `np` (just a nickname).

# In[ ]:


import numpy as np  # allows user to use np instead of numpy

c = np.array( [5,4,3,2,1] )
print( type(c) )    # displays the type of the variable c


# Using mathematical operations on lists vs. numpy arrays:

# In[ ]:


print("Sum of lists: ", a + a)


# In[ ]:


print("Sum of NumPy arrays: ", b + c )


# In[ ]:


print("Original array: ", b)


# In[ ]:


print("Adding a constant: ", b + 7.1 )


# In[ ]:


print("Multiplying a constant: ", b*2.5 )


# In[ ]:


print("Exponentiation: ", b ** 3 )


# In[ ]:


print("Element-wise product: ", b * c )
print("Element-wise division: ", b / c )
print("Element-wise modulo: ", b % c )


# In[ ]:


# In[ ]:


print("Elementwise exponential: ", np.exp(b) )
print("Elementwise sine: ", np.sin(b) )
print("Elementwise cosine: ", np.cos(b) )


# Selecting elements and sets of elements from your numpy array (same as in list):

# __Python uses zero-based indexing unlike one-based indexing of R, Matlab, Fortran, Julia, etc. The first item in a collection (list, numpy array, etc.) is indexed with number zero. In this workshop, "first, second, third, ..." means one-based indexing, and "index n, index number n, number n, element n, position n,..." all means zero-based Python indexing.__ 

# In[ ]:


print("First element in b: ", b[0] )


# In[ ]:


print("Second element in b: ", b[1] )


# In[ ]:


print("Last element in b: ", b[-1] )


# In[ ]:


print("All elements in b: ", b[:] ) 


# In[ ]:


print("Elements 1, 2, and 3: ", b[1:4] ) # last index is excluded!


# In[ ]:


print("From 0 to end in steps of 2, except for the last one: ", b[0:-1:2] )


# In[ ]:


print("From 0 to end in steps of 2: ", b[0::2] )
print("Reverse b: ", b[::-1] )


# ### Attributes and Methods
# 
# NumPy arrays are objects: they have __attributes__ and __methods__.
# - __Attributes__: Some properties like shape, length, number of dimensions, etc. __syntax: `array.attribute`__
# - __Methods__: "Actions" (or functions) applied to or performed on the object. __syntax: `array.method(arg1, arg2, etc.)`__

# In[ ]:


b.size # attribute: length, number of elements


# In[ ]:


b.ndim # attribute: number of dimensions 


# In[ ]:


b.shape # attribute: tuple for size of each dimension


# In[ ]:


b.sum() # method: sum of elements


# In[ ]:


b.prod()


# A list of NumPy Array (`numpy.ndarray`)'s attributes and methods? You can look them up online. 
# https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

# ### Other ways to create an array
# 
# - Function `np.arange(start, stop, step)` ("array range". argument can be floats.)

# In[ ]:


np.arange(1,5,0.5)


# - Function `np.linspace(start, stop, subdivisions)`: linearly spacing between start and stop.

# In[ ]:


#np.linspace(0,10,4) 

np.linspace(2, 8, 7).astype('int')


# _Hint: you can press tab key for autocompletion_

# - Function `np.ones(size)` and `np.zeros(size)`: an array filled with ones or zeroes.

# In[ ]:


np.ones(5)


# In[ ]:


np.zeros(5)


# ### Two (or higher)-dimensional arrays
# 
# Two dimensional arrays can be created using nested lists (or list of lists). 

# In[ ]:


A = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
print("The array A:\n",A)
print("Shape of A:",A.shape)
print("Number of dimensions of A:",A.ndim)


# One may also create a higher-dimensional array

# In[ ]:


B = A.reshape(3, 2, 2)
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)


# In[ ]:


B[0, :, :]


# In[ ]:


B = np.arange(10)
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)


# In[ ]:


B = B.reshape((10,1))
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)


# ### Indexing multidimensional arrays
# 
# - To access an element at position (i, j): `array[i, j]`
# - May select an entire column or row using a colon: `array[i, :]` for a row, `array[:, j]` for a column. 
# - May also use a slice: e.g., `array[0:2, :]`. 

# In[ ]:


print("Array A:\n",A)


# In[ ]:


print("Element at pos (1,2):",A[1,2],'\n') 


# `A[1][2]` is also possible, just like how nested list is indexed, but is less efficient. `A[1]` creates a copy of vector consisting row 1 of `A`, then it accesses index 2 of the vector. `A[1, 2]` directly accesses the value at the position desired without copyting a row.

# In[ ]:


print("Row #2 of A:\n",A[2,:],'\n')


# In[ ]:


print("Column #1 of A:\n",A[:,1],'\n')


# __Exercise__: How would you reverse the order of columns in matrix `A`?

# In[ ]:


# Basic syntax for "slicing" is `start:end:step`, and each of them can be omitted if start=0, end=len(l), step=1

# ### Accessing elements by logical tests

# Let's recall how to perform _logical tests_ in Python.

# In[ ]:


A


# In[ ]:


print( A > 25 )


# We can select elements based on the result of a logical test:

# In[ ]:


print( A[A > 25] )


# In[ ]:


# __Exercise__:  Replace all odd numbers in `arr` with -1.
# 
# _Hint: you can assign values to a subsetted array._ 
# 
# <!--arr[arr % 2 == 0] = -2-->

# In[ ]:


arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


# In[ ]:


# In[ ]:


arr


# In[ ]:


arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


# Replace all even numbers in `arr` with -2.

# In[ ]:


# In[ ]:


arr


# ## Matplotlib

# Scientists draw graphs all the time... and with the power of Matplotlib, one can even do this.
# 
# Video from:
# https://dx.doi.org/10.1038%2Fs41467-017-01741-8 

# In[ ]:


# This video is not included in the repo
from IPython.display import Video

Video("41467_2017_1741_MOESM3_ESM.mp4", embed=True, width=600)


# We will explore the basics of Matplotlib to create simple plots through `pyplot` subpackage of `matplotlib`. 
# 
# Basic usage: `plt.plot(y)` to plot an array, and `plt.show()` to display the figure.  

# In[ ]:


import matplotlib.pyplot as plt


# In[ ]:


y = np.array([0,10,3,4,2])

plt.plot(y)
plt.show()


# ### Let's plot something complicated

# Matplotlib is a powerful tool
# - In a few lines, you can have a really nice results. 
# 
# To plot simple sinusoidal curves:

# In[ ]:


x = np.arange(15)
y1 = np.sin(0.5*x)
y2 = np.cos(0.5*x)
plt.plot(x, y1)
plt.plot(x, y2)
plt.show()


# Let's decorate it further. `plot()` functions has a plenty of keyword arguments to change configurations. 

# In[ ]:


plt.plot(x, y1, 'o--', markersize=10, linewidth=1.2, # circular markers, dashed line 
            color='r', label='Sine')                 # 'r' for red.
plt.plot(x, y2, 's--', markersize=10, linewidth=1.2, # square markers, dashed line
            color='k', label='Cosine')               # 'k' for black.
plt.xlabel('Time (s)')
plt.ylabel('Fluorescence (a.u.)')

plt.yticks([-1.0,-0.5,0.0,0.5,1.0])

plt.legend(frameon=False) # legend appears without rectangular frame
plt.show()


# To export a plot, here is the usual structure:
# 
# ```python
# f = plt.figure( figsize=(5,3) )
# 
# ...
# plotting...
# ...
# 
# plt.tight_layout()
# plt.savefig('Fig1.png', dpi=300)
# ```
# 
# Next we plot the same functions from last cell, but save it as a PNG file.

# In[ ]:


f = plt.figure( figsize=(5,3) )

x = np.arange(15)
y1 = np.sin(0.5*x)
y2 = np.cos(0.5*x)

plt.plot(x, y1, 'o--', markersize=10, linewidth=1.2, 
            color='r', label='Sine')
plt.plot(x, y2, 's--', markersize=10, linewidth=1.2, 
            color='k', label='Cosine')

plt.xlabel('Time (s)')
plt.ylabel('Fluorescence (a.u.)')

plt.yticks([-1.0,-0.5,0.0,0.5,1.0])

plt.legend(frameon=False) # legend appears without rectangular frame
plt.tight_layout() # fit to the frame. ensure all the figure will be captured

# instead of plt.show(). 
plt.savefig('Fig1.png', dpi=300) # dpi: dots per inch 
plt.close()

print("Figure saved.")


# ### Matplotlib gallery
# 
# One can do wonderful complicated graphs with Matplotlib. 
# 
# - Check their gallery on the website. https://matplotlib.org/stable/gallery/index.html
# - Click on any example to have the code. 

# ## Overview of Machine Learning

# ### Types of learning
# 
# - Supervised learning -  with pairs of example inputs their desired outputs (*labels*)
#     - Classification: e.g., decision tree, $k$-nearest neighbor classifier, naive Bayes
#     - Regression: e.g., linear regression 
# - Unsupervised learning - with unlabeled data
#     - Clustering: e.g., $k$-means, density-based clustering
# - Semi-supervised learning, reinforcement learning, etc.
# 
# In this workshop, we focus heavily on supervised learning. 

# ### Supervised learning
# 
# Requires examples (__training dataset__) from which the model will learn to reproduce the patterns 
# 
# <img src="supervised_learning.png" width="600"/>
# 
# 
# - Classification: e.g., Create a model that identifies cells by their shape  
# <img src="https://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png" width="300"/>
# 
# - Regression: e.g., Create a function that estimates the cell volume 
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/2560px-Linear_regression.svg.png" width="300"/>

# ### Supervised learning: Classification
# 
# A task of identifying the category each input belongs to (categories predefined)
# 
# Example: 
# 
# - identify the type of fruit based on its width and height, e.g. lemon vs. orange
# - identify the type of cancer cell based on its shape e.g. benign cell vs. malignant cell
# 
# 
# - *target*: category we want to identify
# - to do so, we look at *features*: individual measurable properties of objects studied
#     - width, height, etc. of a fruit or cell.
#     
#     
# Procedure
# 
# - First, model learns from the example how to discriminate between classes
# - When the model is trained, we use it to make predictions
# 
# ### Supervised learning: Regression
# 
# Estimates the value of a continous variable
# 
# Example:
# 
# - Estimate cell volume given length, height, and smoothness
# 
# - *target*: desired continuous value 

# ### Model validation
# 
# You are trying to perform a regression using these data... Which model would you use?
# 
# <div>
# <img src="regression_data.png" width="500"/>
# </div>
# 
# 
# 
# #### Ideal solution
# 
# <div>
# <img src="regression_ideal.png" width="500"/>
# </div>
# 
# #### Underfitting
# 
# <div>
# <img src="regression_underfit.png" width="500"/>
# </div>
# $$y = \beta_0 + \beta_1 X$$
# - The model does not take full advantage of data
# - Something is missing
# 
# #### Overfitting
# <div>
# <img src="regression_overfit.png" width="500"/>
# </div>
# $$y = \beta_0 + \beta_1 X + \beta_2 X^2 + \cdots \beta_{20} X^{20}$$
# - The model is too complicated or complex than the data requires 
# 
# #### Generalization
# 
# To avoid overfitting and underfitting, try to __generalize__ the results.
# 
# Say, some new data are collected. 
# Overfitted data perfectly describes the training data, but it fails with the new data. 
# 
# <div>
# <img src="regression_overfit_test.png" width="500"/>
# </div>
# 
# An ideal solution remains a reasonable performance. 
# 
# <div>
# <img src="regression_ideal_test.png" width="500"/>
# </div>
# 
# #### Overfitting and Underfitting
# 
# - __Underfitting__: when a model cannot adequately capture the underlying structure of the data
#     - Solution: Choose another model, something with higher complexity
# - __Overfitting__: when a model reproduces nearly perfectly a particular set of data, and likely fails to explain new data points or predict future observations
#     - Solution: Model selection and cross validation
#     
# ### Machine learning pipeline
# 
# <div>
# <img src="../MLpipeline.png" width="500"/>
# </div>

# ## Example of classifier

# ### Decision Tree
# Classification represented by a simple flowchart-style structure. 
# 
# "learning rule" is very simple: 
# 
# 1. Select a feature
# 2. Select a threshold that separates data points from different classes *the best* 
# 
# 1 and 2 are repeated until we have a satisfactory model (stopping criteria)
# 
# <div>
# <img src="decision_tree.png" width="500"/>
# </div>
# image source: https://github.com/Harvard-IACS/2020-CS109A/blob/master/content/lectures/lecture22/slides/Lecture22_decision_trees.pdf
# 
# - _root_: the node on the top of the tree i.e. first overall splitting
# - _leaf node_: the nodes at the bottom of the tree, no more splitting
# - _depth_ of tree: maximum distance between the root and leaf nodes

# <!--### Random Forests
# __Random Forests__ are composed of many decision trees. 
# - Each tree only considers a subset of the features
# 
# The prediction of a random forest is determined by a majority vote among the decision trees. -->

# ## Getting started with Scikit-Learn
# 
# Construct decision tree on the Breast Cancer Wisconsin Dataset (included in scikit-learn)

# In[ ]:


import sklearn
import matplotlib.pyplot as plt

import sklearn.datasets
bcancer = sklearn.datasets.load_breast_cancer()


# In[ ]:


bcancer


# In[ ]:


type(bcancer)


# In[ ]:


bcancer.data


# What is in this dataset?

# In[ ]:


print("Features: ", bcancer.data.shape)
print("Target: ", bcancer.target.shape) 


# Quick reminder:
# 
# * "Features" in Machine Learning is a set of quantitative characteristics (or measures) about each of your samples. Models will use features to make predictions about a target variable. In the Breast Cancer Wisconsin Dataset, there are 30 different features and 569 samples.
# 
# * "Target" is our objective: our model should use the features to predict the target. During the training, the target is known and also called _ground truth_.
# 
# Let's take a look at the targets

# In[ ]:


print( bcancer.target )


# As you can see, they are always 0 or 1, indicating malignant or benign (respectively).
# 
# Let's inspect one of the samples. The very first sample, which has target 0 (i.e. malignant), has the following features:

# In[ ]:


bcancer.data[0] # a single row of a matrix


# In[ ]:


bcancer.target[0]


# The features were measured based on microscopic image the nuclei of cells extracted from patients:
# - radius
# - texture 
# - perimeter
# - area
# - smoothness 
# - compactness 
# - concavity 
# - concave points 
# - symmetry
# - fractal dimension 

# In[ ]:


bcancer.feature_names


# So our goal is to create a model that receives numbers like these, and predict whether they are from a malignant or benign cancer cells.

# We import `DecisionTreeClassifier` class from `tree` subpackage of `sklearn`: 
# 
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

# In[ ]:


from sklearn.tree import DecisionTreeClassifier


# Next we create a model and store it in a variable:

# In[ ]:


bcancer_model = DecisionTreeClassifier( max_depth=4 )


# and then we fit it to the data at hand.

# In[ ]:


bcancer_model.fit( bcancer.data, bcancer.target )


# Let's evaluate the prediction of this model on two samples: one from class 0 (malignant) and one from class 1 (benign).

# Sample 0 is a malignant tumor:

# In[ ]:


print( bcancer.target[0] ) # correct answer


# We can predict the label of a sample using the `predict()` method:

# In[ ]:


print( bcancer_model.predict( [bcancer.data[0]] ) )


# Sample 19 is a benign tumor:

# In[ ]:


print( bcancer.target[19] )


# __Exercise__: Predict the label of Sample 19. (I hope you try not to copy and paste the content two cells above)

# In[ ]:


# In[ ]:


# <!-- print( bcancer_model.predict( [bcancer.data[19]] ) ) -->

# Why is the square bracket needed?

# We can also evaluate the prediction of your model on an entire dataset:

# In[ ]:


bcancer_predictions = bcancer_model.predict( bcancer.data )
print( bcancer_predictions )


# __Exercise__: Let's compare the ground truth with the predictions evaluted above. The result is a one-dimensional boolean array filled with `True`s and `False`s.
# <!-- prediction_correctness = (bcancer_predictions == bcancer.target)# fill in here-->

# In[ ]:


# In[ ]:


# <!--prediction_correctness = bcancer_predictions == bcancer.target-->

# Every ```True``` above means a prediction that matched its corresponding ground truth. ```False``` elements reflect wrong predictions (also known as misclassifications). Let's visualize how many correct classifications and how many misclassifications we had with this first classifier.

# In[ ]:


compar = prediction_correctness.astype(int) # transforms a boolean (True/False) array
                                                             # into an integer array filled with zeros and ones

plt.hist( compar, bins=3 )

plt.ylabel('Count')
plt.xticks( [0.15,0.85], ['Misclassified', 'Correctly classified'] )

plt.show()


# In[ ]:


prediction_correctness.sum()


# __Exercise__: Compute misclassification rate, number of missclassifications divided by the number of samples.
# <!--1- sum(prediction_correctness)/len(bcancer.target)# Fill in here-->

# In[ ]:


# <!--(compar==0).sum()/len(compar)-->

# More than 95% of the data were classified correctly. 

# ### An Important Question
# Are we confident in this result?
# 
# A big criticism would be... how can you tell we are not overfitting?

# On Day 2 we will explore more models and learn how to validate the model, making sure that we indeed have prediction power.
# 
#