Exploring a 3-D Synthetic Dataset

R-bloggers 2025-04-12

[This article was first published on John Russell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exploring the HistData package

Over on BlueSky, I have been working through a few challenges. For the months of February and March, I participated in the DuBois Challenge, where you take a week to recreate some of the powerful visualizations that came out of the Paris Exposition from W.E.B. Du Bois. My work there, complete with code, can be found in my github

Inspired by this, I’ve also been doing the #30DayChartChallenge, where you make a chart a day on a theme that changes each day. I have taken this as an opportunity to explore Michael Friendly’s HistData package, which draws from his excellent book with Howard Wainer. I have done posts on John Snow, the Trial of the Pyx, Florence Nightingale, and others on my github. However, one dataset that a simple plot doesn’t do justice to is the Pollen dataset. This dataset, like mtcars and flights, are synthetic datasets that were used as data challenges (the other two are now basic datasets for reprexes as well).

This dataset, however, shows the power of plotly.

Code in R

library(tidyverse)
library(HistData)
library(plotly)

data("Pollen")
head(Pollen)

# A tibble: 6 × 5
  ridge   nub crack  weight density
  <dbl> <dbl> <dbl>   <dbl>   <dbl>
1 -2.35  3.63  5.03  10.9     -1.39
2 -1.15  1.48  3.24  -0.594    2.12
3 -2.52 -6.86 -2.80   8.46    -3.41
4  5.75 -6.51 -5.15   4.35   -10.3 
5  8.75 -3.90 -1.38 -14.9     -2.42
6 10.4  -3.16 12.8  -14.9     -6.49

The first three variables are meant to be plotted on the x, y and z axis, where the other variables are meant to describe the grains of pollen. Doing a quick correlation shows that there is at least one strong correlation that can be seen through the use of color, where weight is highly correlated with the x-axis.

Code in R

res <- cor(Pollen)
round(res,2)

        ridge   nub crack weight density
ridge    1.00  0.13 -0.13  -0.90   -0.57
nub      0.13  1.00  0.08  -0.17    0.33
crack   -0.13  0.08  1.00   0.27   -0.15
weight  -0.90 -0.17  0.27   1.00    0.24
density -0.57  0.33 -0.15   0.24    1.00

However, when you plot the dataset, something else shows up. Thankfully, plotly allows you to drag a plot around to explore it.

Code in R

plot_ly(Pollen, x = ~ridge, y = ~nub, z = ~crack)  |> 
  add_markers(color = ~weight, size=2) |> 
  layout(title="David Coleman's Synthetic Pollen Dataset")|>
  config(displayModeBar=FALSE)

To leave a comment for the author, please follow the link and comment on their blog: John Russell.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Exploring a 3-D Synthetic Dataset