Proceedure

Data Aggregation

Raw data was gathered from accelerometers located on the arm, belt and forearm of six subjects. Each subject was required to perform barbell lifts correctly and incorrectly utilizing five different techniques. The original dataset can be found at: http://groupware.les.inf.puc-rio.br/har

dat <- read.csv("pml-training.csv")

A quick review of the data reveals a lot of holes (NA) that will skew our prediction algorithm so we need to clean the data. In this case any factors that are missing more than 90% of their values are removed.

dat <- dat[, -which(colSums(is.na(dat)) > 0.9 * nrow(dat))]

Next we remove predictors that have zero or near zero variance.

library(caret, quietly=T)
dat <- dat[-nearZeroVar(dat)]

Before we proceed we need to further refine our feature selection to those properties that most closely relate to predicting our outcome. After reviewing and comparing the meanDecreaseAccuracy and MeanDecreaseGini on the entire dataset at stage we took the primary modifiers that resulted in each column. Though this assuredly reduced our accuracy it enabled us to run the results in my limited hardware.

cols <- c(6, 7, 8, 9, 41, 44, 45, 46, 47, 59)
names(dat)[cols]
##  [1] "num_window"        "roll_belt"         "pitch_belt"       
##  [4] "yaw_belt"          "accel_dumbbell_y"  "magnet_dumbbell_y"
##  [7] "magnet_dumbbell_z" "roll_forearm"      "pitch_forearm"    
## [10] "classe"
dat <- dat[,cols]

Finally, we need to divide up our training data into a test set before we run our model. We determined the top 9 factors by comparing the MeanDecreaseAccuracy and the MeanDecreaseGini lists using this call:

library(caret, quietly=T)
inTrain <- createDataPartition(dat$classe, p=0.7, list=F)
training <- dat[inTrain,]
testing <- dat[-inTrain,]

We were able to compress our data using cross validation and center and scaling before training our data on the full dataset.

set.seed(33433)
library(doParallel, quietly=T)
cl = makeCluster(detectCores())
registerDoParallel(cl)

modFit <- train(classe ~ ., method="rf", preProcess=c("center", "scale"), 
                trControl=trainControl(method = "cv", number = 4), data=training)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
print(modFit$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.12%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3905    1    0    0    0    0.000256
## B    3 2654    1    0    0    0.001505
## C    0    4 2391    1    0    0.002087
## D    0    0    1 2251    0    0.000444
## E    0    3    0    2 2520    0.001980
confusionMatrix(predict(modFit, testing), testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    0 1136    0    0    1
##          C    1    0 1024    3    0
##          D    0    2    2  960    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                         
##                Accuracy : 0.998         
##                  95% CI : (0.997, 0.999)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.998         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.997    0.998    0.996    0.999
## Specificity             1.000    1.000    0.999    0.999    1.000
## Pos Pred Value          0.999    0.999    0.996    0.996    0.999
## Neg Pred Value          1.000    0.999    1.000    0.999    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.174    0.163    0.184
## Detection Prevalence    0.284    0.193    0.175    0.164    0.184
## Balanced Accuracy       1.000    0.999    0.999    0.998    0.999

As you can see, even using this tiny subset of variables yields susprisingly accurate results.

Test Answers

predict(modFit, read.csv("pml-testing.csv"))
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E