Skip to content

Commit 194e34f

Browse files
committedFeb 3, 2016
added additional dataset descriptions
1 parent 37536e7 commit 194e34f

File tree

7 files changed

+198
-5
lines changed

7 files changed

+198
-5
lines changed
 

‎code/datasets/README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ Sebastian Raschka, 2015
44

55
### iris
66

7-
- used in chapters 1, 2, 3
7+
- used in chapters 1, 2, and 3
88
- source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
99

1010
### wine
1111

12-
- used in chapters 4, 5
12+
- used in chapters 4 and 5
1313
- source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
1414

1515
### wdbc
@@ -19,7 +19,7 @@ Sebastian Raschka, 2015
1919

2020
### movie
2121

22-
- used in chapter 8, 9
22+
- used in chapters 8 and 9
2323
- movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
2424
- source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
2525

@@ -30,5 +30,5 @@ Sebastian Raschka, 2015
3030

3131
### mnist
3232

33-
- used in chapter 12, 13
34-
- source: [http://yann.lecun.com/exdb/mnist/]
33+
- used in chapters 12 and 13
34+
- source: [http://yann.lecun.com/exdb/mnist/]

‎code/datasets/housing/README.md

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
## Boston Housing Data
6+
7+
- Used in chapter 10
8+
9+
The Boston Housing dataset for regression analysis.
10+
11+
**Features**
12+
13+
1. CRIM: per capita crime rate by town
14+
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
15+
3. INDUS: proportion of non-retail business acres per town
16+
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
17+
5. NOX: nitric oxides concentration (parts per 10 million)
18+
6. RM: average number of rooms per dwelling
19+
7. AGE: proportion of owner-occupied units built prior to 1940
20+
8. DIS: weighted distances to five Boston employment centres
21+
9. RAD: index of accessibility to radial highways
22+
10. TAX: full-value property-tax rate per $10,000
23+
11. PTRATIO: pupil-teacher ratio by town
24+
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of b. by town
25+
13. LSTAT: % lower status of the population
26+
27+
28+
- Number of samples: 506
29+
30+
- Target variable (continuous): MEDV, Median value of owner-occupied homes in $1000's
31+
32+
### References
33+
34+
- Source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
35+
- Harrison, D. and Rubinfeld, D.L.
36+
'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

‎code/datasets/iris/README.md

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
## Iris Flower Dataset
6+
7+
- Used in chapters 1, 2, and 3
8+
9+
The Iris dataset for classification.
10+
11+
**Features**
12+
13+
1. Sepal length
14+
2. Sepal width
15+
3. Petal length
16+
4. Petal width
17+
18+
- Number of samples: 150
19+
20+
- Target variable (discrete): {50x Setosa, 50x Versicolor, 50x Virginica}
21+
22+
### References
23+
24+
- Source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
25+
- Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

‎code/datasets/mnist/README.md

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
## MNIST Dataset
6+
7+
- Used in chapters 12 and 13
8+
9+
10+
The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
11+
12+
**Features**
13+
14+
Each feature vector (row in the feature matrix) consists of 784 pixels (intensities) -- unrolled from the original 28x28 pixels images.
15+
16+
- Number of samples: A subset of 5000 images (the first 500 digits of each class)
17+
18+
- Target variable (discrete): {500x 0, ..., 500x 9}
19+
20+
21+
### References
22+
23+
- Source: [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)
24+
- Y. LeCun and C. Cortes. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2010.
25+
26+
27+
### Loading MNIST
28+
29+
- The description and code from [chapter 12](http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb#Obtaining-the-MNIST-dataset)
30+
31+
In addition, I added to convenience function to one of my external machine learning packages
32+
33+
- [A function that loads the MNIST dataset into NumPy arrays](http://rasbt.github.io/mlxtend/user_guide/data/load_mnist/)
34+
- [A utility function that loads the MNIST dataset from byte-form into NumPy arrays](http://rasbt.github.io/mlxtend/user_guide/data/mnist_data/)

‎code/datasets/movie/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
## The Large Movie Review Dataset
6+
7+
- Used in chapters 8 and 9
8+
9+
The movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
10+
11+
- Source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)

‎code/datasets/wdbc/README.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
## Wine Dataset
6+
7+
- Used in chapters 4 and 5
8+
9+
The Wine dataset for classification.
10+
11+
| | |
12+
|----------------------------|----------------|
13+
| Samples | 178 |
14+
| Features | 13 |
15+
| Classes | 3 |
16+
| Data Set Characteristics: | Multivariate |
17+
| Attribute Characteristics: | Integer, Real |
18+
| Associated Tasks: | Classification |
19+
| Missing Values | None |
20+
21+
| column| attribute |
22+
|-----|------------------------------|
23+
| 1) | Class Label |
24+
| 2) | Alcohol |
25+
| 3) | Malic acid |
26+
| 4) | Ash |
27+
| 5) | Alcalinity of ash |
28+
| 6) | Magnesium |
29+
| 7) | Total phenols |
30+
| 8) | Flavanoids |
31+
| 9) | Nonflavanoid phenols |
32+
| 10) | Proanthocyanins |
33+
| 11) | intensity |
34+
| 12) | Hue |
35+
| 13) | OD280/OD315 of diluted wines |
36+
| 14) | Proline |
37+
38+
39+
| class | samples |
40+
|-------|----|
41+
| 0 | 59 |
42+
| 1 | 71 |
43+
| 2 | 48 |
44+
45+
46+
### References
47+
48+
- Forina, M. et al, PARVUS -
49+
An Extendible Package for Data Exploration, Classification and Correlation.
50+
Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,
51+
16147 Genoa, Italy.
52+
- Source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
53+
- Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

‎code/datasets/wine/README.md

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Sebastian Raschka, 2015
2+
3+
# Python Machine Learning - Supplementary Datasets
4+
5+
### iris
6+
7+
- used in chapters 1, 2, and 3
8+
- source: [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
9+
10+
### wine
11+
12+
- used in chapters 4 and 5
13+
- source: [https://archive.ics.uci.edu/ml/datasets/Wine](https://archive.ics.uci.edu/ml/datasets/Wine)
14+
15+
### wdbc
16+
17+
- used in chapter 6
18+
- source: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
19+
20+
### movie
21+
22+
- used in chapters 8 and 9
23+
- movie dataset converted into a 2-column CSV format: The first column (`review`) contains the text, and the second column (`sentiment`) denotes the polarity, where 0=negative and 1=positive. The first 25,000 are the training samples and the remaining 25,000 rows are the test samples from the "Large Movie Review Dataset v1.0," respectively.
24+
- source: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
25+
26+
### housing
27+
28+
- used in chapter 10
29+
- source: [https://archive.ics.uci.edu/ml/datasets/Housing](https://archive.ics.uci.edu/ml/datasets/Housing)
30+
31+
### mnist
32+
33+
- used in chapter 12, 13
34+
- source: [http://yann.lecun.com/exdb/mnist/]

0 commit comments

Comments
 (0)
Please sign in to comment.