Introduction to Statistical Learning

Preamble

Consider the wine quality dataset from UCI Machine Learning Respository1. We will focus only on the data concerning white wines (and not red wines). Dichotomize the quality variable as good, which takes the value 1 if quality \(\geq\) 7 and the value 0, otherwise. We will take good as response and all the 11 physiochemical characteristics of the wines in the data as predictors.

Problem Statements

Use 10-fold cross-validation for estimating the test error rates below and compute the estimates using caret package with seed set to 1234 before each computation.

  1. Fit a KNN with K chosen optimally using test error rate. Report error rate, sensitivity, specificity, and AUC for the optimal KNN based on the training data. Also, report its estimated test error rate.
  2. Repeat (a) using logistic regression.
  3. Repeat (a) using LDA.
  4. Repeat (a) using QDA.
  5. Compare the results in (a)-(d). Which classifier would you recommend? Justify your answer.

Methodologies

Data Modeling

  • K-nearest Neighbors Classifier (KNN)
  • Logistic Regression
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)

Further Modeling

  • Naive Bayes
  • Decision Tree (CART Algorithm)
  • Random Forest (Classification)
  • Bagging (Bootstrap Aggregation)
  • Boosting
  • eXtreme Gradient Boosting (XGBoost)
  • Support Vector Machine (SVM)
  • Neural Networks (NNET)

Footnotes

  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.↩︎