library(tidyverse)
library(corrplot)
library(randomForest)
library(knitr)
library(rgl) # for 3D scatterplot
$set(webgl = hook_webgl)
knit_hooks::opts_chunk$set(cache=TRUE)
knitr# setupKnitr()
I share here the full code and output for an exploratory data visualization in R Markdown submitted for a contest in January 2017.
The code includes data exploration, feature generation / simplification, summary statistics, plotting distributions of predictors, assessment of correlations, and initial exploration of predictive modeling using random forests.
In January 2017, I had the pleasure of attending the 2017 ComputeFest, hosted by the Institute for Applied Computational Science at Harvard University. The skill-building workshops were excellent, including introductions to a number of topics, e.g.:
- Data Science in Python
- Introduction to MATLAB
- Data Science in Python
- Introduction to R, R Programming for Data Analysis
- Deep Learning for Image Segmentation with TensorFlow
There was also a contest in exploratory data visualization hosted by Prodigy Finance. From the contest instructions:
The idea here is to do exploratory visualization in preparation for data analysis. The dataset we will use is from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/Covertype. The dataset contains cartographic variables that might be useful in predicting the kind of tree cover. Your visualizations should be useful to an analyst in creating machine learning models to analyze this dataset: you are looking for correlations between the “features” and the kind of tree cover.
My wife and I decided to use this as an opportunity to practice our data exploration skills in R. It was quite a suprise when our submission was chosen as the winner – the prize was two Amazon Echos!
I share below our submission. Because it was produced using RMarkdown, it was a simple copy / paste to have it built with Quarto.
Edited to add in 2022-08: slight changes were made to the tidyverse
code to remove deprecated functions like mutate_each
and guide(<scale> = FALSE)
. Otherwise, the code is as submitted in 2017.
Without further ado:
Introduction
Exploratory visualization in preparation for data analysis, of the Covertype Data Set from the UCI machine learning repository with cartographic features for predicting the classes of tree cover.
Goals:
- look for correlations between the features and the kind of tree cover
- label axes clearly and caption and legend your images
- report which puts together these visualizations with some description in PDF or HTML format
As this is only for practice, a test set is not being set aside, as would be normal practice prior to exploratory data analysis in preparation for machine learning.
Dataset loading
The dataset consists of 581,012 rows with the target (forest cover type class) and 54 predictors consisting of 10 continuous variables and 44 factors. The 44 factors actually only represent one of four wilderness areas and one of forty soil types. The forty soil types, in-turn, specify a climatic zone and a geologic zone.
# d <- read.csv(file = 'covtype.data', header = FALSE)
# save(d, file = 'data.Rda')
load('data.Rda')
# Name Data Type Measurement Description
# Elevation quantitative meters Elevation in meters
# Aspect quantitative azimuth Aspect in degrees azimuth
# Slope quantitative degrees Slope in degrees
# Horizontal_Distance_To_Hydrology quantitative meters Horz Dist to nearest surface water features
# Vertical_Distance_To_Hydrology quantitative meters Vert Dist to nearest surface water features
# Horizontal_Distance_To_Roadways quantitative meters Horz Dist to nearest roadway
# Hillshade_9am quantitative 0 to 255 index Hillshade index at 9am, summer solstice
# Hillshade_Noon quantitative 0 to 255 index Hillshade index at noon, summer soltice
# Hillshade_3pm quantitative 0 to 255 index Hillshade index at 3pm, summer solstice
# Horizontal_Distance_To_Fire_Points quantitative meters Horz Dist to nearest wildfire ignition points
# Wilderness_Area (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation
# Soil_Type (40 binary columns) qualitative 0 (absence) or 1 (presence) Soil Type designation
# Cover_Type (7 types) integer 1 to 7 Forest Cover Type designation
names(d) <- c(
'elevation','aspect','slope','hydro_h','hydro_v','road_h','shade_09','shade_12','shade_15','fire_h',
'wilderness_1','wilderness_2','wilderness_3','wilderness_4',
paste0(c('soil_'), c(1:40)),
'cover'
)
Dataset pre-processing
The data preprocessing extracts the wilderness areas, soil types, climatic zones, and geologic zones, and replaces the category codes for with human-readable labels.
<- c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir','Krummholz')
cover_types <- c('Rawah', 'Neota', 'Comanche Peak', 'Cache la Poudre')
wilderness_areas
<- c(2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8)
climate_zone <- c('lower montane dry','lower montane','montane dry','montane','montane dry and montane','montane and subalpine','subalpine','alpine')
climatic_zones <- climatic_zones[climate_zone]
climate_zone
<- c(7,7,7,7,7,7,5,5,2,7,7,7,7,1,1,1,1,7,1,1,1,2,2,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7)
geologic_zone <- c('alluvium','glacial','shale','sandstone','mixed sedimentary','unspecified','igneous and metamorphic','volcanic')
geologic_zones <- geologic_zones[geologic_zone]
geologic_zone
<- d %>%
d mutate(
id = row_number(),
cover = cover_types[cover]
)
<- d %>%
d_soil select(id, starts_with('soil_')) %>%
gather(key = 'soil_type', value = 'value', -id) %>%
filter(value != 0) %>%
select(-value) %>%
mutate(
soil_type = sub('soil_', '', soil_type) %>% as.numeric(),
climate = climate_zone[soil_type],
climate = factor(climate,
levels = c('alpine','subalpine', 'montane and subalpine',
'montane dry and montane', 'montane', 'montane dry', 'lower montane')),
geologic = geologic_zone[soil_type]
)
<- d %>%
d_wilderness select(id, starts_with('wilderness_')) %>%
gather(key = 'wilderness_type', value = 'value', -id) %>%
filter(value != 0) %>%
select(-value) %>%
mutate(
wilderness_type = sub('wilderness_', '', wilderness_type) %>% as.numeric(),
wilderness_type = wilderness_areas[wilderness_type]
)
<- d[, c(1:10, 55, 56)] %>%
d inner_join(d_soil, by = 'id') %>%
inner_join(d_wilderness, by = 'id') %>%
mutate(across(11:16, factor)) %>%
select(-id)
Initial univariate data exploration
Summary statistics
summary(d)
elevation aspect slope hydro_h
Min. :1859 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.:2809 1st Qu.: 58.0 1st Qu.: 9.0 1st Qu.: 108.0
Median :2996 Median :127.0 Median :13.0 Median : 218.0
Mean :2959 Mean :155.7 Mean :14.1 Mean : 269.4
3rd Qu.:3163 3rd Qu.:260.0 3rd Qu.:18.0 3rd Qu.: 384.0
Max. :3858 Max. :360.0 Max. :66.0 Max. :1397.0
hydro_v road_h shade_09 shade_12
Min. :-173.00 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 7.00 1st Qu.:1106 1st Qu.:198.0 1st Qu.:213.0
Median : 30.00 Median :1997 Median :218.0 Median :226.0
Mean : 46.42 Mean :2350 Mean :212.1 Mean :223.3
3rd Qu.: 69.00 3rd Qu.:3328 3rd Qu.:231.0 3rd Qu.:237.0
Max. : 601.00 Max. :7117 Max. :254.0 Max. :254.0
shade_15 fire_h cover soil_type
Min. : 0.0 Min. : 0 Aspen : 9493 29 :115247
1st Qu.:119.0 1st Qu.:1024 Cottonwood/Willow: 2747 23 : 57752
Median :143.0 Median :1710 Douglas-fir : 17367 32 : 52519
Mean :142.5 Mean :1980 Krummholz : 20510 33 : 45154
3rd Qu.:168.0 3rd Qu.:2550 Lodgepole Pine :283301 22 : 33373
Max. :254.0 Max. :7173 Ponderosa Pine : 35754 10 : 32634
Spruce/Fir :211840 (Other):244333
climate geologic
alpine : 40437 alluvium : 20987
subalpine :401983 glacial : 92272
montane and subalpine : 8166 igneous and metamorphic:467469
montane dry and montane: 602 mixed sedimentary : 284
montane : 93593
montane dry : 284
lower montane : 35947
wilderness_type
Cache la Poudre: 36968
Comanche Peak :253364
Neota : 29884
Rawah :260796
Distribution of the 10 continuous predictor variables
A plot of the distributions of all 10 continuous predictors (on different scales) confirms that each predictor has adequate variance.
1:10] %>%
d[, gather(key = measure, value = value) %>%
ggplot(aes(value)) +
geom_histogram(fill = 'steelblue3', bins = 30) +
facet_wrap(~measure, scales = 'free', ncol = 3) +
ylab('') +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank()
)
Marked class imbalance of prediction target
A histogram shows marked class imbalance of the seven forest cover categories. This is important to be aware of because of its potential to affect classifier training later on.
<- d %>%
g ggplot(aes(cover, color = cover, fill = cover)) +
geom_bar(alpha = 0.5) +
labs(title = 'Frequency of Forest Cover Types in the Dataset', y = 'Frequency') +
theme_bw() +
guides(fill = "none", color = "none") +
theme(
axis.text.x = element_text(angle=45, hjust = 1),
axis.title.x = element_blank()
) g
Multivariate data exploration
Assessment of correlation between numeric predictors
A correlation matrix of the 10 continuous predictor features shows some degree of correlation, both positive and negative, between some pairs of predictors, but overall there does not appear to be an excessive amount.
<- cor(d[, 1:10])
c corrplot(c, order = 'hclust', tl.cex = 0.75)
Pairwise analysis of continuous variables by balanced sampling
The following plot shows pairwise comparisons of each of the 10 continuous predictors with 30 randomly selected points from each of the 7 types of forest cover (color-coded). Selecting 30 at random from each of the forest covers counteracts the severe class imbalance, making it easier to see clusters of forest cover.
The plot works better on a large display, but is quite effective at demonstrating interesting associations between predictors as well as clustering of forest cover type as related to pairs of predictors.
Note the non-linear correlation between aspect or slope and the three shade-related features, which makes sense given the relationship between direction to sun + generation of shade by slope. Some of the U-shaped associations between the predictors are easily appreciated here, while not being apparent in the correlation matrix above.
# library(scales)
# show_col(hue_pal()(7)) # show the default ggplot2 color palette for 7 colors
<- d[, c(1:11)] %>% group_by(cover) %>% sample_n(size = 30) # sample equal number from each class?a
d2 c(1:10)] %>%
d2[,pairs(
main = 'Pairs plot of continuous predictors and forest cover type',
pch = 21, cex = 0.55,
bg = c('#F8766D','#C49A00','#53B400','#00C094','#00B6EB','#A58AFF','#FB61D7')[unclass(d2$cover)],
col = alpha('#FFFFFF', 0.0), upper.panel = NULL)
Correlations of target with predictors
Variation in distribution of forest cover by each predictor
A density plot of each of the 7 classes of forest cover was examined for each of the 10 numeric predictors.
Some interesting features are immediately apparent; for example, in the panel for elevation, one can see that cottonwoods are predominantly found at lower elevations, aspens in the mid-range, and krummholz at higher elevations.
c(1:11)] %>%
d[, gather(key = measure, value = value, -cover) %>%
ggplot(aes(value, color = cover, fill = cover)) +
geom_density(alpha = 0.2) + theme_bw() +
facet_wrap(~measure, scales = 'free', ncol = 3) +
labs(title = 'Distribution of Forest Cover by Continuous Numeric Predictors') +
ylab('') +
theme(
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
legend.title = element_blank()
# legend.position = c(0.9, 0.1), legend.justification = c("right", "bottom"),
# legend.direction = "horizontal"
)
Elevation
%>%
d ggplot(aes(elevation, fill = cover, color = cover)) +
geom_density(alpha = 0.3) +
labs(
title = 'Distribution of Forest Cover by Elevation',
x = 'Elevation', y = 'Density'
+
) theme_bw() +
theme(
legend.title = element_blank(),
axis.text.y = element_blank()
)
Violin plot of distribution of cover types by numeric predictor
To reduce the effect of the overlapping density distributions, violin plots are another way to visualize the relationship between each type of forest cover and the continuous variables.
Elevation shows significantly more distinct profiles for the different forest cover types than any other single predictor.
c(1:11)] %>%
d[, gather(key = measure, value = value, -cover) %>%
ggplot(aes(cover, value, color = cover)) +
geom_violin(aes(fill = cover), alpha = 0.5) +
# geom_boxplot(width = 0.15) +
facet_wrap(~measure, scales = 'free', ncol = 3) +
labs(
title = 'Distribution of forest cover by continuous numeric predictors',
x = '', y = '') +
theme_bw() +
theme(
axis.text.y = element_blank(), axis.text.x = element_blank(),
axis.ticks = element_blank(), legend.title = element_blank())
Density plot of cover by pairs of predictors
Visualizing forest cover types based on the density plot of two predictors shows better separation between forest cover types than with single predictors.
For the sake of space, only the density of forest cover with respect to elevation and fire point distance are shown, although other pairs of predictors were also assessed.
%>% ggplot(aes(elevation, fire_h, color = cover)) +
d geom_density_2d() +
facet_wrap(~cover) +
labs(
title = '2D-density of forest cover type by elevation and fire point distance',
x = 'Elevation', y = 'Horizontal distance to fire points'
+
) theme_bw() +
guides(color = "none")
Interactive 3D scatter plot of forest cover by 3 predictors
Clustering of forest types can be appreciated upon rotation of the 3D scatterplot by 3 predictor features – elevation, aspect, and fire point distance. Again, to deal with the forest cover class imbalance, 60 of each forest class were randomly sampled from the dataset to ensure all classes are represented.
<- d[, c(1:11)] %>% group_by(cover) %>% sample_n(size = 60)
d2
with(d2,
plot3d(
elevation, aspect, fire_h, type="s", size = 1.2,
col = c('#F8766D','#C49A00','#53B400','#00C094','#00B6EB','#A58AFF','#FB61D7')[unclass(cover)]
))rglwidget()
Absolute counts of forest cover types by soil type
The soil types representation is very unbalanced within the data set, as shown below. Because of this, the subsequent plots will show the proportions of each type of forest cover, within particular predictor classes.
%>% ggplot(aes(soil_type, fill = cover, color = cover)) +
d geom_bar(alpha = 0.5) +
theme_bw() +
labs(
title = 'Counts of forest cover type by soil type',
x = 'Soil type', y = 'Counts'
+
) theme(
panel.grid = element_blank(),
axis.text.x = element_text(size = 8),
legend.title = element_blank(),
axis.text.y = element_blank()
)
Proportion of forest cover types by soil type
%>% ggplot(aes(soil_type, fill = cover, color = cover)) +
d geom_bar(alpha = 0.5, position = 'fill') +
theme_bw() +
labs(
title = 'Proportion of forest cover type by soil type',
x = 'Soil type', y = 'Proportion'
+
) theme(
panel.grid = element_blank(),
axis.text.x = element_text(size = 8),
legend.title = element_blank(),
axis.text.y = element_blank()
)
Forest cover by climatic zone
%>% ggplot(aes(climate, fill = cover, color = cover)) +
d geom_bar(position = 'fill', alpha = 0.5) +
labs(title = 'Proportion of Forest Cover by Climatic Zone', y = 'Proportion') +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle=20, hjust = 1),
legend.title = element_blank(), axis.title.x = element_blank())
Forest cover by geologic zone
%>% ggplot(aes(geologic, fill = cover, color = cover)) +
d geom_bar(position = 'fill', alpha = 0.5) +
labs(title = 'Proportion of Forest Cover by Geologic Zone', y = 'Proportion') +
theme_bw() +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle=20, hjust = 1),
axis.title.x = element_blank(),
legend.title = element_blank()
)
Forest cover by wilderness area
%>% ggplot(aes(wilderness_type, fill = cover, color = cover)) +
d geom_bar(position = 'fill', alpha = 0.5) +
labs(title = 'Proportion of Forest Cover by Wilderness Area', y = 'Proportion') +
theme_bw() +
theme(
panel.grid = element_blank(),
axis.title.x = element_blank(),
legend.title = element_blank()
)
Proportion of forest cover type by climatic zone, geologic zone, and wilderness area
The following plot demonstrates the interactions between the target category and three different categorical predictors.
%>% ggplot(aes(wilderness_type, fill = cover, color = cover)) +
d geom_bar(position = 'fill', alpha = 0.5) +
facet_grid(climate ~ geologic) +
theme_bw() +
labs(
title = 'Proportion of forest cover type by climatic, geologic zone, and wilderness area',
y = 'Climatic Zone and Forest Cover', x = 'Geologic Zone and Wilderness Area'
+
) theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle=55, hjust = 1),
legend.title = element_blank(),
# axis.title.x = element_blank(),
axis.text.y = element_blank(),
# axis.text.x = element_blank(),
axis.ticks = element_blank()
)
Initial exploration of predictive modeling
Just for fun: Random Forests for Forest Cover Prediction
(It just seems right to use this particular model to predict this particular target, right?)
The data dictionary reports classification performance from ~1998, using the first 15,120 records used for training and validation, and the remaining 565,892 records for model performance testing, with the following classification accuracy:
- 70% Neural Network (backpropagation)
- 58% Linear Discriminant Analysis
Without doing proper cross-validation for determination of optimal hyperparameters, the first 15,120 records were used for training a Random Forest classifier.
A training set out-of-bag accuracy of 83.5% was obtained.
<- d[c(1:15120), ]
train <- d[-c(1:15120), ]
test <- randomForest(cover ~ ., data = train, importance = TRUE, ntree = 300)
rfmodel # save(rfmodel, file = 'rfmodel.Rda')
# load('rfmodel.Rda')
rfmodel
Call:
randomForest(formula = cover ~ ., data = train, importance = TRUE, ntree = 300)
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 3
OOB estimate of error rate: 12.95%
Confusion matrix:
Aspen Cottonwood/Willow Douglas-fir Krummholz Lodgepole Pine
Aspen 2062 0 20 0 44
Cottonwood/Willow 0 2097 22 0 0
Douglas-fir 8 49 1907 0 17
Krummholz 0 0 0 2091 5
Lodgepole Pine 149 0 49 17 1538
Ponderosa Pine 19 88 234 0 10
Spruce/Fir 37 0 3 123 337
Ponderosa Pine Spruce/Fir class.error
Aspen 30 4 0.04537037
Cottonwood/Willow 41 0 0.02916667
Douglas-fir 178 1 0.11712963
Krummholz 0 64 0.03194444
Lodgepole Pine 49 358 0.28796296
Ponderosa Pine 1809 0 0.16250000
Spruce/Fir 2 1658 0.23240741
Variable importance in random forest model
importance(rfmodel) %>%
%>%
as.data.frame cbind(list(names = row.names(importance(rfmodel)))) %>%
select(c(10,8,9)) %>%
arrange(desc(MeanDecreaseAccuracy))
names MeanDecreaseAccuracy MeanDecreaseGini
elevation elevation 82.05646 2932.6349
hydro_h hydro_h 76.21345 769.4528
shade_12 shade_12 68.13971 520.2648
road_h road_h 59.76714 1141.5291
fire_h fire_h 57.97839 953.2702
shade_09 shade_09 56.82135 643.0753
soil_type soil_type 55.83001 1486.8193
hydro_v hydro_v 53.17607 625.0303
shade_15 shade_15 52.53195 544.1503
aspect aspect 50.83801 597.2513
slope slope 39.22942 401.2883
wilderness_type wilderness_type 38.58556 759.7641
climate climate 35.37959 1368.8726
geologic geologic 24.12719 158.3731
Accuracy on test set
Using the random forest model trained on the first 15,120 samples yielded a test set (remaining 565,892 samples) accuracy of 75.0%, similar to the performance of the neural network from ~1998. Optimizing model training would likely yield substantial improvements.
The difference in classification accuracy from the OOB training set and the test set seems a bit extreme. One possibility is that the ordering of the samples is NOT random, so choosing the first 15,120 samples may not be a representative sample.
<- predict(rfmodel, test)
pred sum(pred == test$cover) / nrow(test)
[1] 0.7539318