added additional dataset descriptions

rasbt · rasbt · commit 194e34f245ab · 2016-02-03T13:20:43.000-05:00
diff --git a/code/datasets/README.md b/code/datasets/README.md
@@ -4,12 +4,12 @@ Sebastian Raschka, 2015
 
 ### iris
 
-- used in chapters 1, 2, 3
+- used in chapters 1, 2, and 3
 - source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
 
 ### wine
 
-- used in chapters 4, 5
+- used in chapters 4 and 5
 - source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
 
 ### wdbc
@@ -19,7 +19,7 @@ Sebastian Raschka, 2015
 
 ### movie
 
-- used in chapter 8, 9
+- used in chapters 8 and 9
 - movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
 - source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
 
@@ -30,5 +30,5 @@ Sebastian Raschka, 2015
 
 ### mnist
 
-- used in chapter 12, 13
-- source: [http://yann.lecun.com/exdb/mnist/]
+- used in chapters 12 and 13
+- source: [http://yann.lecun.com/exdb/mnist/]
diff --git a/code/datasets/housing/README.md b/code/datasets/housing/README.md
@@ -0,0 +1,36 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+## Boston Housing Data
+
+- Used in chapter 10
+
+The Boston Housing dataset for regression analysis.
+
+**Features**
+
+1. CRIM:      per capita crime rate by town
+2. ZN:        proportion of residential land zoned for lots over 25,000 sq.ft.
+3. INDUS:     proportion of non-retail business acres per town
+4. CHAS:      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
+5. NOX:       nitric oxides concentration (parts per 10 million)
+6. RM:        average number of rooms per dwelling
+7. AGE:       proportion of owner-occupied units built prior to 1940
+8. DIS:       weighted distances to five Boston employment centres
+9. RAD:       index of accessibility to radial highways
+10. TAX:      full-value property-tax rate per $10,000
+11. PTRATIO:  pupil-teacher ratio by town
+12. B:        1000(Bk - 0.63)^2 where Bk is the proportion of b. by town
+13. LSTAT:    % lower status of the population
+
+
+- Number of samples: 506
+
+- Target variable (continuous): MEDV, Median value of owner-occupied homes in $1000's
+
+### References
+
+- Source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
+- Harrison, D. and Rubinfeld, D.L.
+'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
diff --git a/code/datasets/iris/README.md b/code/datasets/iris/README.md
@@ -0,0 +1,25 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+## Iris Flower Dataset
+
+- Used in chapters 1, 2, and 3
+
+The Iris dataset for classification.
+
+**Features**
+
+1. Sepal length
+2. Sepal width
+3. Petal length
+4. Petal width
+
+- Number of samples: 150
+
+- Target variable (discrete): {50x Setosa, 50x Versicolor, 50x Virginica}
+
+### References
+
+- Source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
+- Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
diff --git a/code/datasets/mnist/README.md b/code/datasets/mnist/README.md
@@ -0,0 +1,34 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+## MNIST Dataset
+
+- Used in chapters 12 and 13
+
+
+The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
+
+**Features**
+
+Each feature vector (row in the feature matrix) consists of 784 pixels (intensities) -- unrolled from the original 28x28 pixels images.
+
+- Number of samples: A subset of 5000 images (the first 500 digits of each class)
+
+- Target variable (discrete): {500x 0, ..., 500x 9}
+
+
+### References
+
+- Source: [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)
+- Y. LeCun and C. Cortes. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2010.
+
+
+### Loading MNIST
+
+- The description and code from [chapter 12](http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb#Obtaining-the-MNIST-dataset)
+
+In addition, I added to convenience function to one of my external machine learning packages
+
+- [A function that loads the MNIST dataset into NumPy arrays](http://rasbt.github.io/mlxtend/user_guide/data/load_mnist/)
+- [A utility function that loads the MNIST dataset from byte-form into NumPy arrays](http://rasbt.github.io/mlxtend/user_guide/data/mnist_data/)
diff --git a/code/datasets/movie/README.md b/code/datasets/movie/README.md
@@ -0,0 +1,11 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+## The Large Movie Review Dataset
+
+- Used in chapters 8 and 9
+
+The movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
+
+- Source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
diff --git a/code/datasets/wdbc/README.md b/code/datasets/wdbc/README.md
@@ -0,0 +1,53 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+## Wine Dataset
+
+- Used in chapters 4 and 5
+
+The Wine dataset for classification.
+
+|				  |		  			|
+|----------------------------|----------------|
+| Samples                    | 178            |
+| Features                   | 13             |
+| Classes                    | 3              |
+| Data Set Characteristics:  | Multivariate   |
+| Attribute Characteristics: | Integer, Real  |
+| Associated Tasks:          | Classification |
+| Missing Values             | None           |
+
+|	column| attribute	|
+|-----|------------------------------|
+| 1)  | Class Label                  |
+| 2)  | Alcohol                      |
+| 3)  | Malic acid                   |
+| 4)  | Ash                          |
+| 5)  | Alcalinity of ash            |
+| 6)  | Magnesium                    |
+| 7)  | Total phenols                |
+| 8)  | Flavanoids                   |
+| 9)  | Nonflavanoid phenols         |
+| 10) | Proanthocyanins              |
+| 11) | intensity                    |
+| 12) | Hue                          |
+| 13) | OD280/OD315 of diluted wines |
+| 14) | Proline                      |
+
+
+| class | samples   |
+|-------|----|
+| 0     | 59 |
+| 1     | 71 |
+| 2     | 48 |
+
+
+### References
+
+- Forina, M. et al, PARVUS -
+An Extendible Package for Data Exploration, Classification and Correlation.
+Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,
+16147 Genoa, Italy.
+- Source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
+- Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
diff --git a/code/datasets/wine/README.md b/code/datasets/wine/README.md
@@ -0,0 +1,34 @@
+Sebastian Raschka, 2015
+
+# Python Machine Learning - Supplementary Datasets
+
+### iris
+
+- used in chapters 1, 2, and 3
+- source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
+
+### wine
+
+- used in chapters 4 and 5
+- source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
+
+### wdbc
+
+- used in chapter 6
+- source: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
+
+### movie
+
+- used in chapters 8 and 9
+- movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
+- source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
+
+### housing
+
+- used in chapter 10
+- source: [https://archive.ics.uci.edu/ml/datasets/Housing](https://archive.ics.uci.edu/ml/datasets/Housing)
+
+### mnist
+
+- used in chapter 12, 13
+- source: [http://yann.lecun.com/exdb/mnist/]