Skip to content

Commit 27c1567

Browse files
committed
added machine learning docs
1 parent 111c5d6 commit 27c1567

File tree

1 file changed

+117
-0
lines changed

1 file changed

+117
-0
lines changed

docs/scenarios/ml.rst

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
================
2+
Machine Learning
3+
================
4+
5+
Python has a vast number of libraries for data analysis, statistics and Machine Learning itself, making it a language of choice for many data scientists.
6+
7+
Some widely used packages for Machine Learning and other Data Science applications are enlisted below.
8+
9+
Scipy Stack
10+
-----------
11+
12+
The Scipy stack consists of a bunch of core helper packages used in data science, for statistical analysis and visualising data. Because of its huge number of functionalities and ease of use, the Stack is considered a must-have for most data science applications.
13+
14+
The Stack consists of the following packages (link to documentation given):
15+
16+
1. `NumPy <http://www.numpy.org/>`_
17+
2. `SciPy library <https://www.scipy.org/>`_
18+
3. `Matplotlib <http://matplotlib.org/>`_
19+
4. `IPython <https://ipython.org/>`_
20+
5. `pandas <http://pandas.pydata.org/>`_
21+
6. `Sympy <http://www.sympy.org/en/index.html>`_
22+
7. `nose <http://nose.readthedocs.io/en/latest/>`_
23+
24+
The stack also comes with Python bundled in, but has been excluded from the above list.
25+
26+
Installation
27+
~~~~~~~~~~~~
28+
29+
For installing the full stack, or individual packages, you can refer to the instructions given `here <https://www.scipy.org/install.html>`_.
30+
31+
**NB:** `Anaconda <https://www.continuum.io/anaconda-overview>`_ is highly preferred and recommended for installing and maintaining data science packages seamlessly.
32+
33+
scikit-learn
34+
------------
35+
36+
Scikit is a free and open-source machine learning library for Python. It offers off-the-shelf functions to implement many algorithms like linear regression, classifiers, SVMs, k-means, Neural Networks etc. It also has a few sample datasets which can be directly used for training and testing.
37+
38+
Because of its speed, robustness and easiness to use, it's one of the most widely-used libraries for many Machine Learning applications.
39+
40+
Installation
41+
~~~~~~~~~~~~
42+
43+
Through PyPI:
44+
45+
.. code-block:: python
46+
47+
pip install -U scikit-learn
48+
49+
Through conda:
50+
51+
.. code-block:: python
52+
53+
conda install scikit-learn
54+
55+
scikit-learn also comes in shipped with Anaconda (mentioned above). For more installation instructions, refer to `this link <http://scikit-learn.org/stable/install.html>`_.
56+
57+
Example
58+
~~~~~~~
59+
60+
For this example, we train a simple classifier on the `Iris dataset <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_, which comes bundled in with scikit-learn.
61+
62+
The dataset takes four features of flowers: sepal length, sepal width, petal length and petal width, and classifies them into three flower species (labels): setosa, versicolor or virginica. The labels have been represented as numbers in the dataset: 0 (setosa), 1 (versicolor) and 2 (virginica).
63+
64+
We shuffle the Iris dataset, and divide it into separate training and testing sets: keeping the last 10 data points for testing and rest for training. We then train the classifier on the training set, and predict on the testing set.
65+
66+
.. code-block:: python
67+
68+
from sklearn.datasets import load_iris
69+
from sklearn import tree
70+
from sklearn.metrics import accuracy_score
71+
import numpy as np
72+
73+
#loading the iris dataset
74+
iris = load_iris()
75+
76+
x = iris.data #array of the data
77+
y = iris.target #array of labels (i.e answers) of each data entry
78+
79+
#getting label names i.e the three flower species
80+
y_names = iris.target_names
81+
82+
#taking random indices to split the dataset into train and test
83+
test_ids = np.random.permutation(len(x))
84+
85+
#splitting data and labels into train and test
86+
#keeping last 10 entries for testing, rest for training
87+
88+
x_train = x[test_ids[:-10]]
89+
x_test = x[test_ids[-10:]]
90+
91+
y_train = y[test_ids[:-10]]
92+
y_test = y[test_ids[-10:]]
93+
94+
#classifying using decision tree
95+
clf = tree.DecisionTreeClassifier()
96+
97+
#training (fitting) the classifier with the training set
98+
clf.fit(x_train, y_train)
99+
100+
#predictions on the test dataset
101+
pred = clf.predict(x_test)
102+
103+
print pred #predicted labels i.e flower species
104+
print y_test #actual labels
105+
print (accuracy_score(pred, y_test))*100 #prediction accuracy
106+
107+
Since we're splitting randomly and the classifier trains on every iteration, the accuracy may vary. Running the above code gives:
108+
109+
.. code-block:: python
110+
111+
[0 1 1 1 0 2 0 2 2 2]
112+
[0 1 1 1 0 2 0 2 2 2]
113+
100.0
114+
115+
The first line contains the labels (i.e flower species) of the testing data as predicted by our classifier, and the second line contains the actual flower species as given in the dataset. We thus get an accuracy of 100% this time.
116+
117+
More on scikit-learn can be read in the `documentation <http://scikit-learn.org/stable/user_guide.html>`_.

0 commit comments

Comments
 (0)