Using fuzzyjoin to work with NCES data

R-bloggers 2025-01-19

[This article was first published on John Russell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidy data structure, messy data

Like families, tidy datasets are all alike but every messy dataset is messy in its own way. – Hadley Wickham

We often work with datasets where there is a unique identifier that can be used to link or filter the data, typically involving some type of a join.

However, sometimes the identifiers that we wish to use to join are near matches. For instance, maybe you are joining a dataset by names, where one dataset has John, and the dataset you are joining has john. Or maybe there are text entry errors and it is instead Jon.

What you need is a way to join data that is a little bit fuzzy. Enter (David Robinson)[https://github.com/dgrtwo/fuzzyjoin]’s excellent fuzzyjoin package.

Introducing an example – schools in the Virgin Islands

There are 20 schools listed in the public school database for the National Center for Education Statistics in the Virgin Islands. I’ve downloaded the dataset and you can pull it from my github repo through the code below. A scrollable DT table to explore below (disabling a lot of features just to see the table):

Code in R
library(tidyverse)
library(fuzzyjoin)
library(kableExtra)
VI <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/VIschools.csv")

VI |> 
  kbl() |> 
  kable_paper() |> ##nice theme
  kable_styling(bootstrap_options = c("striped", "hover")) |> 
  scroll_box(height = "200px") ##adds scrolling
Public Schools of Virgin Islands schoolname nces district address state low_grade hi_grade total Alfredo Andrews Elementary School 780000200002 Saint Croix School District RFD 1 KINGSHILL VI PK 06 495 Claude O. Markoe Elementary School 780000200006 Saint Croix School District PLOTS 71 75 MARS HILL VI PK 06 403 Eulalie Rivera 780000200011 Saint Croix School District ROUTE 1 GROVE PLACE VI PK 08 645 St. Croix Educational Complex High School 780000200013 Saint Croix School District RR2 KINGSHILL VI 09 12 893 Juanita Gardine 780000200021 Saint Croix School District ESTATE RICHMOND VI PK 08 305 Lew Muckle Elementary School 780000200023 Saint Croix School District 317 SION FARM VI PK 06 339 Pearl B. Larsen 780000200028 Saint Croix School District ESTATE ST PETERS VI PK 08 432 Ricardo Richards Elementary School 780000200029 Saint Croix School District 491 BARREN SPOT VI PK 06 392 St. Croix Central High School 780000200030 Saint Croix School District RSD 2 KINGSHILL VI 09 12 717 John H. Woodson Junior High School 780000200037 Saint Croix School District RURAL ROUTE 1 KINGSHILL VI 07 08 461 Charlotte Amalie High School 780003000005 Saint Thomas – Saint John School District 8 and 9 ESTATE THOMAS VI 09 12 1076 Ivanna Eudora Kean High School 780003000015 Saint Thomas – Saint John School District 1 and 2 ESTate NAZARETH VI 09 12 738 Jane E. Tuitt Elementary School 780003000018 Saint Thomas – Saint John School District 19 LEVOKI STRAEDE VI KG 04 154 Joseph Gomez Elementary School 780003000019 Saint Thomas – Saint John School District 142 ANNAS RETREAT VI PK 05 462 Joseph Sibilly Elementary School 780003000020 Saint Thomas – Saint John School District 14 15 16 ESTATE ELIZABETH VI PK 05 226 Julius E. Sprauve School 780003000022 Saint Thomas – Saint John School District 14 18 ESTATE ENIGHED VI PK 08 225 Lockhart Elementary School 780003000024 Saint Thomas – Saint John School District 41 ESTATE THOMAS VI KG 08 977 Ulla F. Muller Elementary School 780003000026 Saint Thomas – Saint John School District 7B ESTATE CONTANT VI PK 05 401 Yvonne E. Milliner-Bowsky Elementary School 780003000027 Saint Thomas – Saint John School District 15B and 16 ESTATE MANDAHL VI PK 05 433 Bertha C. Boschulte Middle School 780003000034 Saint Thomas – Saint John School District 9 1 and 12A BOVONI VI 06 08 538

Let’s say we have a dataset with the following names of schools, and we want to pull in information from NCES.

Code in R
sample <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/sample.csv")

sample |> 
  kbl() |> 
  kable_paper()
Dataset to join schoolname state eulalie rivera VI Joseph Elementary School VI Alfredo Andrews Elementary School NY

Enter the fuzzyjoin package, which allows the data to be messy in many ways, depending on what you need. Some of the ways to join outlined in the package are as follows:

  • difference_join – joins that are numeric and within a specified distance
  • geo_join – joins that use distances based on latitude and longitude
  • regex_join – joins that look for common regex patterns (text and position)
  • stringdist_join – joins that take into account small differences in the string

Let’s focus on stringdist_join for special case uses.

Joins that ignore case

Joins that ignore case, where you don’t have to mutate using something like the stringr str_to_lower on the data to change it is a big win. We can use it to match eulalie rivera.

This is what would happen as is:

Code in R
sample[1,1] |> 
  inner_join(VI,by=c("schoolname")) |> 
  kbl() |> 
  kable_paper()
An empty inner join schoolname nces district address state low_grade hi_grade total

On the other hand, we can use the stringdist_join and set ignore_case to equal TRUE.

Code in R
sample[1,1] |> 
  stringdist_join(VI,
                  by=c("schoolname"),
                  max_dist=0,
                  mode="inner",
                  ignore_case=TRUE) |> 
  kbl() |> 
  kable_paper()
A successful join schoolname.x schoolname.y nces district address state low_grade hi_grade total eulalie rivera Eulalie Rivera 780000200011 Saint Croix School District ROUTE 1 GROVE PLACE VI PK 08 645

Joins that take advantage of string distance

The way that I’ve taken advantage of this is in dealing with typos or data where someone may have subtly different names for a school (e.g., one has the word school while the other drops it). What is nice about the fuzzyjoin package is that you can do the join. What is less nice is that it requires you to do a line check afterwards, especially when you are loose with the distances.

Let’s look at the second row, Joseph Elementary School, and have a maximum distance of 8.

Code in R
sample[2,1] |> 
  stringdist_join(VI,
                   by="schoolname",
                   max_dist=8,
                   mode="inner",
                   ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()
Multiple rows loosely match schoolname.x schoolname.y nces district address state low_grade hi_grade total stringdistance Joseph Elementary School Joseph Gomez Elementary School 7.80003e+11 Saint Thomas – Saint John School District 142 ANNAS RETREAT VI PK 05 462 6 Joseph Elementary School Joseph Sibilly Elementary School 7.80003e+11 Saint Thomas – Saint John School District 14 15 16 ESTATE ELIZABETH VI PK 05 226 8 Joseph Elementary School Lockhart Elementary School 7.80003e+11 Saint Thomas – Saint John School District 41 ESTATE THOMAS VI KG 08 977 7

You can see the power, and the danger, of the fuzzyjoin package here. It’s amazing that it picks up the two schools that also have Joseph in their name, but it also suggests that you could change a few letters in the beginning and form the name of another school.

When we used fuzzyjoin, we would arrange the data by the school name and then the distance_col variable so that we could easily do a line check and choose the best match (if available).

Matching off of multiple columns

As a side note, it is nice to use the fuzzyjoin package for seeing errors in multiple columns. Here we can see how it works with the third row of the dataset.

Code in R
sample[3,] |> 
  stringdist_join(VI,
                  by=c("schoolname","state"),
                  mode="inner",
                  max_dist = 4,
                  ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()
schoolname.x state.x schoolname.y nces district address state.y low_grade hi_grade total schoolname.stringdistance state.stringdistance stringdistance Alfredo Andrews Elementary School NY Alfredo Andrews Elementary School 780000200002 Saint Croix School District RFD 1 KINGSHILL VI PK 06 495 0 2 NA

What is nice about this is that, in review, you can see easily through the distance_col variable where the changes were found.

Others have used fuzzyjoin to great effect – I am inspired reading this vignette in particular on geo_joins

Citation

BibTeX citation:
@online{russell2025,
  author = {Russell, John},
  title = {Using Fuzzyjoin to Work with {NCES} Data},
  date = {2025-01-17},
  url = {https://drjohnrussell.github.io/posts/2025-01-17-fuzzyjoin-in-action/},
  langid = {en}
}
For attribution, please cite this work as:
Russell, John. 2025. “Using Fuzzyjoin to Work with NCES Data.” January 17, 2025. https://drjohnrussell.github.io/posts/2025-01-17-fuzzyjoin-in-action/.
To leave a comment for the author, please follow the link and comment on their blog: John Russell.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Using fuzzyjoin to work with NCES data