Overview

The Human Activity Recognition (HAR) dataset from UCI is a public domain dataset (1, 2). In recent years, the HAR dataset has enabled researchers around the world to develop machine learning tools to classify physical activity through wearable technology. Here, I classify bicep curls performed by six test subjects by the matter in which they performed the curls; this was categorized by the ‘classe’ variable in the dataset. The forms were the correct curl (A), curling while bringing elbows to the front (B), curling but lifting only halfway (C), curling but lowering only halfway (D), and curling while thrusting the hips forward (E). Of the various regression model approaches used, the random forest ensemble method produced the highest accuracy in predicting bicep curl form. When building the models in this project, I used cross-validation to optimize the parameters and to better fit the training data (3). The work presented here is my own with non-negligible support and guidance from Cheng Juan’s publication on practical machine learning using R, which I incidentally found when browsing Google for examples of cross validation in R (4).

Downloading and Pre-processing

I began by downloading the training and testing data from the following URLs. Then, I looked at the structure of the testing (WLE) dataset. The first 7 columns were identifiers, and columns 8 through 160 were measurements. I removed the identifiers and converted the measurement data to numeric objects in R. Then, I removed columns which had missing data.

url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if (!file.exists("training.csv")) {
  download.file(url1, "training.csv")
}
if (!file.exists("testing.csv")) {
  download.file(url2, "testing.csv")
}

WLE <- read.csv("training.csv", na.strings=c("", "NA"))
testing <- read.csv("testing.csv", na.strings=c("", "NA"))

str(WLE, list.len=10) ## cols 1:7 are identifiers and will not be useful in prediction
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##   [list output truncated]
WLE[,8:159] <- lapply(WLE[,8:159], as.numeric) ## convert to numeric
small_WLE <- WLE[,-c(1:7)] ## remove first 7 columns
totalNAs <- is.na(apply(small_WLE[,-153], 2, sum)) ## identify columns with missing values
build <- small_WLE[,!totalNAs] ## exclude columns with missing values
rm(small_WLE, totalNAs) ## clean up the environment

Building the Models

Next, I split the build data into training and validation (for out-of-sample error estimates) sets using the caret package. I also loaded the rpart and randomForest packages that the future models would require for training.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart); library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
inBuild <- createDataPartition(build$classe, p=0.7, list=FALSE)
validation <- build[-inBuild,]; training <- build[inBuild,]

Then, I generated my models. I began with the most computationally simple and easiest model to understand: the decision tree.

set.seed(1001)
s0 <- system.time({fit0 <- train(classe~., data=training, method="rpart")})[3] ## decision tree
pred0 <- predict(fit0, validation)
cm0 <- confusionMatrix(pred0, validation$classe)[2:3]
print(cm0)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1517  496  473  429  155
##          B   33  383   35  165  143
##          C  119  260  518  370  302
##          D    0    0    0    0    0
##          E    5    0    0    0  482
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   4.927782e-01   3.369250e-01   4.799263e-01   5.056373e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##  1.832826e-248            NaN
print(s0)
## elapsed 
##  15.285

The decision tree in this case was a poor model: it predicted the classe variable with an accuracy of 49%. The model failed to identify the class D (4). However, it trained relatively quickly in 13 seconds while running on a single core.

Next, I generated a random forest algorithm also using the caret package. My machine had 32 cores, so I was able to run the following tasks in parallel using the doParallel package (5).

library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cl <- makeCluster(detectCores())
print(cl)
## socket cluster with 32 nodes on host 'localhost'
registerDoParallel(cl)
set.seed(1001)
s1 <- system.time({fit1 <- train(classe~., data=training, method="rf", allowParallel=TRUE)})[3] ## random forest
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
stopCluster(cl)
pred1 <- predict(fit1, validation)
cm1 <- confusionMatrix(pred1, validation$classe)[2:3]
print(cm1)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    1 1131    9    0    0
##          C    0    2 1016   22    0
##          D    0    0    1  941    4
##          E    0    0    0    1 1078
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9921835      0.9901116      0.9895875      0.9942718      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
print(s1)
## elapsed 
## 327.909

While this method is extremely accurate, correctly predicting the classe variable with 99% accuracy, it was computationally expensive. Using 32 cores, it still took 328 seconds to train. In machine learning, there is a tradeoff between computational expense and model accuracy. To reduce the amount of time spent training, I re-ran the model using 3-fold cross validation (4, 6).

cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(1001)
s1.1 <- system.time({fit1.1 <- train(classe~., data=training, method="rf", allowParallel=TRUE,
                trControl = trainControl(method="cv", number=3))})[3] ## rf with 3-fold CV
stopCluster(cl)
pred1.1 <- predict(fit1.1, validation)
cm1.1 <- confusionMatrix(pred1.1, validation$classe)[2:3]
print(cm1.1)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    5    0    0    1
##          B    2 1130    7    1    2
##          C    0    3 1016   17    2
##          D    0    1    3  945    5
##          E    0    0    0    1 1072
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9915038      0.9892523      0.9888140      0.9936876      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
print(s1.1)
## elapsed 
##  99.045

By cutting the size of the training data through 3-fold cross-validation, my time to train decreased super-linearly from 328 to 99 seconds (70% reduction). The trade-off was a 0.07% decrease in accuracy. I plotted the model’s converged error to determine the optimum number of trees to train.

plot(fit1.1$finalModel) 

The overall error for fit1.1 converged at around 100 trees (4). I tuned my number of trees to 100 for a third random forest model, fit1.2, to generate a faster algorithm.

cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(1001)
s1.2 <- system.time({fit1.2 <- train(classe~., data=training, method="rf", allowParallel=TRUE,
                ntree = 100, trControl = trainControl(method="cv", number=3))})[3] ## rf, 3-fold CV, 100 trees
stopCluster(cl)
pred1.2 <- predict(fit1.2, validation)
cm1.2 <- confusionMatrix(pred1.2, validation$classe)[2:3]
print(cm1.2)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    7    0    0    0
##          B    1 1132    3    1    0
##          C    1    0 1022   17    4
##          D    0    0    1  946    7
##          E    0    0    0    0 1071
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9928632      0.9909719      0.9903653      0.9948517      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
print(s1.2)
## elapsed 
##  26.325

The tuned random forest model with cross-validation performed with an out-of-sample accuracy of 99% and only took 26 seconds to train.

Next, I explored meta-algorithms by creating a boosting algorithm with 3-fold cross-validation.

cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(100)
s2 <- system.time({fit2 <- train(classe~., data=training, method="gbm", verbose=FALSE, 
                                 trControl=trainControl(method="cv", number=3))})[3] ## boosting with 3-fold CV
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
stopCluster(cl)
pred2 <- predict(fit2, validation)
cm2 <- confusionMatrix(pred2, validation$classe)[2:3]
print(cm2)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1645   33    0    1    3
##          B   25 1080   35    4   14
##          C    2   23  978   35   15
##          D    2    2   12  920   22
##          E    0    1    1    4 1028
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.602379e-01   9.496970e-01   9.549273e-01   9.650858e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   1.440974e-08
print(s2)
## elapsed 
##  54.211

The model generated 96% out-of-sample accuracy and trained in 54 seconds.

Finally, I generated a bagging algorithm to predict the classe variable.

cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(1001)
s3 <- system.time({fit3 <- train(classe~., data=training, method="treebag")})[3] ## bagging
## Loading required package: ipred
## Loading required package: e1071
stopCluster(cl)
pred3 <- predict(fit3, validation)
cm3 <- confusionMatrix(pred3, validation$classe)[2:3]
print(cm3)
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    5    0    0    1
##          B    3 1120    8    1    3
##          C    0    8 1012   21    3
##          D    1    4    6  942    2
##          E    1    2    0    0 1073
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9882753      0.9851696      0.9851848      0.9908663      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
print(s3)
## elapsed 
##  50.488

The bagging model had 99% accuracy out-of-sample and trained in 50 seconds in parallel.

Model Evaluation

To select the best performing model, I generated a data frame of the models’ out-of-sample accuracy and elapsed time to train. The random forest model without cross validation was the most accurate but also the most costly model. The random forest with cross-validation (untuned, rfcv) was a close second in terms of accuracy, but it was still more computationally expensive than the tuned random forest model (rfcv100), the boosting (boost), and the bagging (bag) models. The rfcv100 model was the ensemble method with the lowest time to train and was only 0.1% less accurate than the most accurate model.

eval <- data.frame(tree=cm0$overall[1], rf=cm1$overall[1], rfcv=cm1.1$overall[1], 
                   rfcv100=cm1.2$overall[1], boost=cm2$overall[1], bag=cm3$overall[1])
time <- data.frame(tree=s0, rf=s1, rfcv=s1.1, rfcv100=s1.2, boost=s2, bag=s3)
res <- rbind(eval, time)
print(res)
##                tree          rf       rfcv    rfcv100      boost
## Accuracy  0.4927782   0.9921835  0.9915038  0.9928632  0.9602379
## elapsed  15.2850000 327.9090000 99.0450000 26.3250000 54.2110000
##                 bag
## Accuracy  0.9882753
## elapsed  50.4880000

I plotted the top ten most important variables from the rfcv100 model. Then, I viewed the final model and plotted its cross-validation graph and its error convergence.

plot(varImp(fit1.2), top=10)

fit1.2$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry, allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.79%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3899    6    0    0    1 0.001792115
## B   25 2623    9    0    1 0.013167795
## C    0   16 2369   11    0 0.011268781
## D    0    2   22 2225    3 0.011989343
## E    0    2    4    7 2512 0.005148515
plot(fit1.2)

plot(fit1.2$finalModel)