Mapping research landscapes and dynamics: Some basic bibliometric analyses with R

R-bloggers 2025-05-06

[This article was first published on R Code – Geekcologist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding how scientific knowledge develops requires more than merely counting papers and citations. It requires a careful evaluation of how research topics and themes interconnect and transform over time. This is where bibliometric analysis becomes essential. As the volume of scientific journals and papers continues to grow exponentially, bibliometric analyses become indispensable for mapping and synthesizing an increasingly complex information landscape.

Through the analysis of publication information and citation patterns, bibliometric analyses allow us not only to assess scholarly productivity and impact, but more importantly, to quantify the scientific communication processes and to analyze and create indicators that reveal the dynamics and evolution of scientific information within specific disciplines and research programs, organizations, research teams or geographical regions. These kinds of tools are especially valuable for gaining a clearer understanding of research dynamics, which is essential when conducting literature reviews or shaping research strategies.

In this post, I’ll share some R code I developed for a recent bibliometric analysis project I have been involved with. While it’s not as comprehensive or user-friendly as established R packages such as Bibliometrix, which offers a rich suite of tools and a easy to use interface, this custom approach gave me the flexibility and control I needed for more tailored data handling and visualization. In this post, we’ll walk through a handful of simple bibliometric analysis and visualization techniques using R to reveal key patterns in research data, focusing on keywords and publication trends. Of course, this approach can also be extended to other important data fields, such as words in titles or abstracts. More specifically we will look at:

1. Word Cloud

We’ll start by building a basic word cloud that highlights the most frequently used keywords across the bibliographic dataset. Here, word size will be proportional to its frequency, offering a fast, intuitive snapshot of the field’s dominant terminology.

2. Keyword Co-Occurrence Network

Next, we’ll construct a keyword co-occurrence network, a visual map that shows how often keywords appear together in academic papers. Each keyword is represented as a node in the network, and edges (or links) are drawn between them when they co-occur in the same study. The size of a node reflects how frequently a keyword appears, while the thickness of an edge indicates how strongly two keywords are associated. We’ll also apply a community detection algorithm, the Louvain method, to identify clusters within the network— that is, groups of keywords that frequently appear together across documents. These clusters represent thematic groupings or potential research subfields, revealing the underlying conceptual structure of the literature and highlighting how different topics are connected. This approach might help to reveal the structure of a research field, showing which themes are more central, which are less developed, and how different areas of research are interconnected.

3. Thematic Map

Based on the co-occurrence network, we’ll generate a thematic map, using Callon’s centrality and density metrics. In this map, each cluster, named after the most common word in a cluster, from the co-occurrence network is represented as a bubble, and its size is determined by the frequency of words in the cluster. The X-axis represents the centrality of the cluster in the network, that is, the degree of interaction with other clusters in the graph, measuring the importance of a research topic. The Y-axis represents density, a metric of the internal strength of a cluster’s network and the growth of the topic. When mapping the themes in this plot, we can identify:

Motor themes (top right corner): Themes in this quadrant have high centrality and density, indicating that the themes are well-developed and crucial for structuring the research field. Niche themes (top left corner): Themes that are highly specialized and well-developed in terms of internal research but have more limited interaction with other themes. Peripheral them (bottom left corner): Themes with low centrality and low density, suggesting that they are underdeveloped and marginal, representing themes that either emerging or in decline in the literature. Basic themes (bottom right corner): These themes have high centrality and low density. They are often essential for transdisciplinary research, meaning they may serve as foundational topics that cross the boundaries of multiple themes, but despite their central role in the network, these themes have low density of connections.

4. Yearly Keyword Trends

To understand the temporal dynamics of research fields, we’ll build a yearly keyword trend diagram using a Sankey plot. This diagram maps the flow of the most frequent keywords (here, I will be using a cut off of the ten most common key words, but this could be done for as many keywords as necessary), revealing how interest in specific topics rises or fades over time.

5. Decade-Based Keyword Evolution

At last, we’ll take a look at how the research field is changing through time, by aggregating keyword data by decade ( or whatever time frame one might want to look at). This decade-based evolution diagram shows the progression of top keywords (once again, I will use a cut off of the ten most common key words per decade) from one decade to the next, capturing long-term shifts and the persistence or disappearance of major research themes.

To start off, we’ll simulate a simple bibliographic dataset to work with. This will consist of a data frame containing 500 publications, each tagged with a publication year ranging from 2000 to 2025, along with a set of keywords. For the purpose of this basic tutorial, I’ve created a list of keywords that might typically appear in an evolutionary ecology or eco-evolutionary research paper, so let’s pretend that we are conducting a review on something like “coevolutionary dynamics in ecological communities”. Of course, in your real-world application, you’ll be working with your own bibliographic data, which will likely include additional fields such as authors, titles, abstracts, journals, citation counts, etc. That’s perfectly fine—the analyses here will use columns in this simulated data frame named "Year" and "Author_Keywords", but you can easily adapt the code by modifying the column header to match the structure of your dataset. And as I mention before, some of these analyses can be used for other bibliographic information, besides keywords. And it goes without saying: in your real-world application, you’ll be working with messy, inconsistent data, so you’ll likely need to do a lot of data cleaning/handling, such as combining similar keywords, handling typos and linguistic variations, and so on — so keep that in mind as you build your solution.

So, let’s start by loading the necessary packages, “creating” and organizing our dataset, and defining parameters for the analysis:

#### 1. Load packageslibrary(dplyr)library(tidyr)library(stringr)library(igraph)library(ggplot2)library(ggrepel)library(RColorBrewer) library(wordcloud)  library(networkD3)library(htmlwidgets)### 2.Parameterscat("--- 2. Setting Parameters ---\n")# Simulation Parametersn_studies = 500             # Number of simulated studiesstart_year = 2000           # Start year for publicationsend_year = 2025             # End year for publicationskeywords_per_study = 5      # Number of keywords per study# Keyword Poolkeyword_pool <- c(  "coevolution", "arms race", "mutualism", "antagonism", "host-parasite",  "plant-pollinator", "Red Queen hypothesis", "selection pressure", "adaptation",  "phylogeny", "gene flow", "speciation", "ecological interaction", "community structure","community assembly","community disassembly",  "predator-prey", "ecophylogenetics", "co-speciation", "evolutionary dynamics", "co-phylogenetic analysis",  "adaptive dynamics", "local adaptation", "trait evolution", "phylogenetic signal",  "functional traits", "ecological networks", "niche differentiation", "coevolutionary networks")# Analysis Parameterskeyword_column_sim = "Author_Keywords" # Name for the keyword column in simulated datayear_column_sim = "Year"               # Name for the year column in simulated datamin_cooccurrence = 3                 # Min times two keywords must appear togethermin_keyword_freq_network = 3           # Min total frequency for a keyword to be in the network plotmax_words_cloud = 75                   # Max words to display in the word cloud# Sankey Parametersnum_top_keywords_yearly = 10           # Number of top keywords for the YEARLY Sankey diagramnum_top_keywords_per_decade = 10       # Number of top keywords PER DECADE for the DECADE Sankey diagram### 3. Simulate Bibliographic Datacat("--- 3. Simulating Study Data ---\n")simulated_studies <- tibble(  paper_id = 1:n_studies,  Year = sample(start_year:end_year, n_studies, replace = TRUE))# Generate keywords for each studykeywords_list <- lapply(1:n_studies, function(i) {  sample(keyword_pool, keywords_per_study, replace = FALSE)})# Combine into a long format data frame (one row per keyword per paper)keywords_long_sim <- simulated_studies %>%  mutate(keywords = keywords_list) %>%  unnest(keywords) %>%  rename(!!sym(keyword_column_sim) := keywords,         !!sym(year_column_sim) := Year) %>%  select(paper_id, !!sym(year_column_sim), !!sym(keyword_column_sim)) %>%  mutate(keyword = str_trim(tolower(!!sym(keyword_column_sim)))) %>%  select(paper_id, year = !!sym(year_column_sim), keyword) %>%  distinct(paper_id, year, keyword) # Ensure unique keyword per paper/year instancecat("Generated", n_studies, "studies (", start_year, "-", end_year, ") with",    nrow(keywords_long_sim), "unique keyword instances per paper/year.\n")cat("Example simulated data:\n")print(head(keywords_long_sim))# Use this simulated data for the rest of the analysiskeywords_long <- keywords_long_sim#### 4. Calculate Overall Keyword Frequenciescat("\n--- 4. Calculating Overall Keyword Frequencies ---\n")keyword_total_freq <- keywords_long %>%  # Count unique keywords per paper first, then sum across papers  distinct(paper_id, keyword) %>%  count(keyword, name = "total_freq", sort = TRUE)cat("Top overall keywords (based on number of papers):\n")print(head(keyword_total_freq))

Now that we have our “dataset”, let’s create a visual representation of the word data, building a simple word cloud:

### 5. Word Cloud cat("\n--- 5. Generating Word Cloud ---\n")if (exists("keyword_total_freq") && nrow(keyword_total_freq) > 0) {  cat("   - Creating Word Cloud (Check RStudio Plots Pane)...\n")  tryCatch({        wordcloud(words = keyword_total_freq$keyword,              freq = keyword_total_freq$total_freq,              min.freq = 2, # Show words appearing in at least in 2 papers              max.words = max_words_cloud,              random.order = FALSE, # Plot most frequent words first              rot.per = 0.30,      # Percentage of words rotated              colors = brewer.pal(8, "Dark2")) # Color palette    title(main = "", line = -1) # If you want, add a title near the top  }, error = function(e) {    cat("     > Error generating word cloud:", conditionMessage(e), "\n")  })  } else {  cat("   - Skipping word cloud (no keyword frequency data available).\n")}

Now, let’s create the co-ocurrence network, and identify clusters:

### 6. Keyword Co-occurrence Network Analysiscat("\n--- 6. Building Keyword Co-occurrence Network ---\n")# (Uses keywords_long which contains year info, but pairs are per paper regardless of year)# 6a. Generate keyword pairs within each papercat("   - Generating keyword pairs...\n")keyword_pairs_unnested <- keywords_long %>%  group_by(paper_id) %>%  filter(n() >= 2) %>%  summarise(pairs = list(combn(keyword, 2, simplify = FALSE)), .groups = 'drop') %>%  unnest(pairs) %>%  mutate(    keyword1 = sapply(pairs, `[`, 1),    keyword2 = sapply(pairs, `[`, 2)  ) %>%  select(keyword1, keyword2)# 6b. Standardize & Count Pairscat("   - Counting co-occurrences (min =", min_cooccurrence, ")...\n")keyword_pair_counts <- keyword_pairs_unnested %>%  mutate(    temp_kw1 = pmin(keyword1, keyword2),    temp_kw2 = pmax(keyword1, keyword2)  ) %>%  select(keyword1 = temp_kw1, keyword2 = temp_kw2) %>%  count(keyword1, keyword2, name = "weight") %>%  filter(weight >= min_cooccurrence)# 6c. Create and Filter Graphgraph_plot_obj <- NULLcommunities <- NULLif(nrow(keyword_pair_counts) > 0) {  cat("   - Creating graph object...\n")  graph_obj <- graph_from_data_frame(keyword_pair_counts, directed = FALSE)    cat("   - Filtering graph (min degree =", min_keyword_freq_network, ") & detecting communities...\n")  node_degrees <- degree(graph_obj, mode = "all")  nodes_to_keep <- names(node_degrees[node_degrees >= min_keyword_freq_network])    if(length(nodes_to_keep) > 0){    graph_filtered <- induced_subgraph(graph_obj, V(graph_obj)$name %in% nodes_to_keep)    graph_filtered <- delete.vertices(graph_filtered, degree(graph_filtered) == 0)        if (vcount(graph_filtered) > 0 && ecount(graph_filtered) > 0) {      communities <- cluster_louvain(graph_filtered)      num_communities <- length(unique(membership(communities)))      cat("     > Detected", num_communities, "communities (Louvain).\n")            node_data <- tibble(name = V(graph_filtered)$name) %>%        left_join(keyword_total_freq, by = c("name" = "keyword")) %>%        mutate(total_freq = ifelse(is.na(total_freq), 1, total_freq))            V(graph_filtered)$size <- log1p(node_data$total_freq) * 2.5      V(graph_filtered)$label <- V(graph_filtered)$name      V(graph_filtered)$community <- membership(communities)      V(graph_filtered)$total_freq <- node_data$total_freq            # Assign colors based on community      if (num_communities > 0) {        num_colors_needed = length(unique(V(graph_filtered)$community))        if (num_colors_needed > 8) {          community_colors <- colorRampPalette(brewer.pal(8, "Set2"))(num_colors_needed)        } else if (num_colors_needed > 2) {          community_colors <- brewer.pal(max(3, num_colors_needed), "Set2")[1:num_colors_needed]        } else if (num_colors_needed == 2) {          community_colors <- brewer.pal(3, "Set2")[1:2]        } else { # num_colors_needed == 1          community_colors <- brewer.pal(3, "Set2")[1]        }        community_map <- setNames(community_colors, sort(unique(V(graph_filtered)$community)))        V(graph_filtered)$color <- community_map[as.character(V(graph_filtered)$community)]      } else {        V(graph_filtered)$color <- "grey"        community_map <- NULL      }            graph_plot_obj <- graph_filtered      cat("     > Filtered graph ready:", vcount(graph_plot_obj), "nodes,", ecount(graph_plot_obj), "edges.\n")          } else {      cat("     > Warning: Graph empty after filtering.\n")      graph_plot_obj <- NULL      communities <- NULL    }  } else {    cat("     > Warning: No nodes met minimum degree requirement.\n")    graph_plot_obj <- NULL    communities <- NULL  }} else {  cat("   - Warning: No keyword pairs met the minimum co-occurrence threshold.\n")  graph_plot_obj <- NULL  communities <- NULL}# 6d. Visualize Networkif (!is.null(graph_plot_obj)) {  cat("   - Plotting Co-occurrence Network (Check RStudio Plots Pane)...\n")  tryCatch({    par(mar=c(1, 1, 3, 1))    plot(graph_plot_obj,         layout = layout_nicely(graph_plot_obj),         vertex.frame.color = "grey40", vertex.label.color = "black",         vertex.label.cex = 0.7, vertex.label.dist = 0.4,         edge.color = rgb(0.5, 0.5, 0.5, alpha = 0.4), edge.curved = 0.1,         edge.width = scales::rescale(E(graph_plot_obj)$weight, to = c(0.3, 3.0)),         main = "Keyword Co-occurrence Network (Simulated Data)",         sub = paste("Nodes sized by log(# Papers), Min Degree >=", min_keyword_freq_network)    )    if (!is.null(community_map) && length(community_map) <= 12 && length(community_map) > 1) {      legend("bottomleft", legend = paste("Cluster", names(community_map)),             fill = community_map, bty = "n", cex = 0.7, title="Communities")    }    par(mar=c(5.1, 4.1, 4.1, 2.1)) # Reset margins  }, error = function(e){    cat("     > Error plotting network:", conditionMessage(e), "\n")    par(mar=c(5.1, 4.1, 4.1, 2.1)) # Reset margins on error  })} else {  cat("   - Skipping network plot (no valid graph).\n")}

In this example, the clustering algorithm identified three distinct clusters—groups of words that frequently co-occur across the papers. Based on these clusters, we will create a thematic map, where each cluster is represented as a bubble, visually illustrating the relationships and centrality of research themes within the broader network of keywords. This map will help us to better understand the underlying structure of the field and how different research topics are interconnected.

### 7. Thematic Map Analysiscat("\n--- 7. Generating Thematic Map (Callon's Metrics) ---\n")# Helper functioncalculate_callon_metrics <- function(graph, communities_object, cluster_id) {  if (is.null(graph) || !is.igraph(graph) || is.null(communities_object)) {    return(list(centrality = 0, density = 0, n_keywords = 0))  }  cluster_nodes_indices <- which(membership(communities_object) == cluster_id)  if (length(cluster_nodes_indices) == 0) {    return(list(centrality = 0, density = 0, n_keywords = 0))  }  n_nodes_in_cluster <- length(cluster_nodes_indices)  subgraph <- induced_subgraph(graph, cluster_nodes_indices)  internal_weight_sum <- if (ecount(subgraph) > 0) sum(E(subgraph)$weight, na.rm = TRUE) else 0  density <- internal_weight_sum  external_weight_sum <- 0  all_incident_edges_indices <- E(graph)[.inc(cluster_nodes_indices)]  if (length(all_incident_edges_indices) > 0) {    all_incident_edges <- E(graph)[all_incident_edges_indices]    ends_matrix <- ends(graph, all_incident_edges, names = FALSE)    mem <- membership(communities_object)    is_external <- (mem[ends_matrix[,1]] != cluster_id) | (mem[ends_matrix[,2]] != cluster_id)    external_edges <- all_incident_edges[is_external]    if (length(external_edges) > 0) {      external_weight_sum <- sum(E(graph)$weight[external_edges], na.rm = TRUE)    }  }  centrality <- external_weight_sum  return(list(centrality = centrality, density = density, n_keywords = n_nodes_in_cluster))}thematic_plot_obj <- NULLif (!is.null(graph_plot_obj) && !is.null(communities) && length(unique(membership(communities))) > 0) {  cat("   - Calculating Centrality and Density for communities...\n")    community_ids <- unique(membership(communities))  thematic_metrics <- lapply(community_ids, function(comm_id) {    metrics <- calculate_callon_metrics(graph_plot_obj, communities, comm_id)    nodes_in_comm_indices <- which(membership(communities) == comm_id)    community_node_names <- V(graph_plot_obj)$name[nodes_in_comm_indices]    community_node_freqs <- V(graph_plot_obj)$total_freq[nodes_in_comm_indices]        if(length(community_node_names) > 0 && length(community_node_freqs) > 0 && !all(is.na(community_node_freqs))){      most_frequent_keyword <- community_node_names[which.max(community_node_freqs)]      community_label <- str_trunc(most_frequent_keyword, 30)    } else {      community_label <- paste("Cluster", comm_id)    }        return(tibble(      community_id = comm_id, label = community_label,      Centrality = metrics$centrality, Density = metrics$density,      n_keywords = metrics$n_keywords    ))  })    thematic_data <- bind_rows(thematic_metrics) %>%    mutate(Centrality = as.numeric(Centrality), Density = as.numeric(Density)) %>%    filter(!is.na(community_id), n_keywords > 0, is.finite(Centrality), is.finite(Density))    if(nrow(thematic_data) > 0) {    cat("   - Creating Thematic Map plot object...\n")    median_centrality <- median(thematic_data$Centrality, na.rm = TRUE)    median_density <- median(thematic_data$Density, na.rm = TRUE)    median_centrality <- ifelse(is.finite(median_centrality), median_centrality, 0)    median_density <- ifelse(is.finite(median_density), median_density, 0)    cat("     > Quadrant thresholds (Medians): Centrality=", round(median_centrality,2), ", Density=", round(median_density,2), "\n")        thematic_plot_obj <- ggplot(thematic_data, aes(x = Centrality, y = Density)) +      geom_hline(yintercept = median_density, linetype = "dashed", color = "grey50") +      geom_vline(xintercept = median_centrality, linetype = "dashed", color = "grey50") +      geom_point(aes(size = n_keywords), alpha = 0.7, color = "steelblue") +      geom_text_repel(aes(label = label), size = 3.0, max.overlaps = 15,                      box.padding = 0.4, point.padding = 0.6) +      scale_size_continuous(range = c(4, 12), name = "# Keywords\nin Theme") +      ggplot2::annotate("text", x = median_centrality, y = Inf, label = "Motor Themes", hjust = 0.5, vjust = 1.5, size = 3.5, color = "grey40", fontface="bold") +      ggplot2::annotate("text", x = -Inf, y = Inf, label = "Niche Themes", hjust = -0.1, vjust = 1.5, size = 3.5, color = "grey40", fontface="bold") +      ggplot2::annotate("text", x = -Inf, y = -Inf, label = "Emerging/\nDeclining", hjust = -0.1, vjust = -0.5, size = 3.5, color = "grey40", fontface="bold") +      ggplot2::annotate("text", x = median_centrality, y = -Inf, label = "Basic Themes", hjust = 0.5, vjust = -0.5, size = 3.5, color = "grey40", fontface="bold") +      labs(        title = "Thematic Map (Callon's Centrality & Density)",        subtitle = "Keyword Clusters from Co-occurrence Network (Simulated Data)",        x = "Centrality (Links to other themes)",        y = "Density (Internal theme links)"      ) +      theme_minimal(base_size = 12) +      theme(        plot.title = element_text(hjust = 0.5, face = "bold"),        plot.subtitle = element_text(hjust = 0.5),        plot.margin = margin(20, 20, 20, 20)      )        # Visualize Thematic Map (Print to RStudio Plots Pane)    cat("   - Plotting Thematic Map (Check RStudio Plots Pane)...\n")    print(thematic_plot_obj)      } else {    cat("   - Warning: No valid thematic data to plot.\n")  }} else {  cat("   - Skipping Thematic Map (network or communities missing).\n")}

In this simulated example, “ecological networks” was positioned at the center of the plot, indicating its central role within the research landscape, while “ecophylogenetics” was classified as a motor theme, reflecting its importance and well-developed nature in the field. On the other hand, “evolutionary dynamics” appeared as a peripheral theme, suggesting that it is underdeveloped or marginal in the current body of research.

Now, let’s see the evolution of the research field, by first building a Sankey diagram, with the association between years, and the “most”used keywords in our simulated example:

### 8. Yearly Keyword Trend Analysiscat("\n--- 8. Generating Yearly Keyword Trend Sankey Diagram ---\n")sankey_plot_obj_yearly <- NULL# 8a. Count Keywords per Yearcat("   - Counting keyword frequency per year (using unique paper/keyword counts)...\n")keyword_yearly_counts <- keywords_long %>%  distinct(paper_id, year, keyword) %>% # Count keyword once per paper per year  count(year, keyword, name = "yearly_count") %>%  filter(yearly_count > 0)# 8b. Identify Top Keywords Overall (Using paper frequency calculated earlier)cat("   - Identifying top", num_top_keywords_yearly, "keywords overall for yearly Sankey...\n")if(exists("keyword_total_freq") && inherits(keyword_total_freq, "data.frame") && nrow(keyword_total_freq) > 0) {  top_keywords_df_yearly <- keyword_total_freq} else {  cat("     > Warning: 'keyword_total_freq' not found. Recalculating based on yearly counts (less accurate representation of 'overall').\n")  top_keywords_df_yearly <- keyword_yearly_counts %>% group_by(keyword) %>% summarise(total_freq = sum(yearly_count)) %>% arrange(desc(total_freq))}top_keywords_yearly <- top_keywords_df_yearly %>%  slice_head(n = num_top_keywords_yearly) %>%  pull(keyword)if(length(top_keywords_yearly) > 0){  cat("     > Top keywords for Yearly Sankey:", paste(top_keywords_yearly, collapse = ", "), "\n")    # 8c. Prepare Data for Yearly Sankey  sankey_data_yearly <- keyword_yearly_counts %>%    filter(keyword %in% top_keywords_yearly)    if(nrow(sankey_data_yearly) == 0) {    cat("   - Warning: No yearly counts found for the top keywords. Skipping Yearly Sankey.\n")  } else {    cat("   - Preparing data for Yearly Sankey diagram...\n")    year_nodes_chr_yr <- as.character(sort(unique(sankey_data_yearly$year)))    keyword_nodes_sankey_yr <- unique(sankey_data_yearly$keyword)    # Prefix years to distinguish from keywords if necessary    all_node_names_yr <- c(paste0("Y:", year_nodes_chr_yr), keyword_nodes_sankey_yr)        nodes_df_yr <- data.frame(name = all_node_names_yr, stringsAsFactors = FALSE) %>%      mutate(id = row_number() - 1)        links_df_yr <- sankey_data_yearly %>%      mutate(        source_name = paste0("Y:", as.character(year)),        target_name = keyword      ) %>%      left_join(nodes_df_yr %>% select(name, source_id = id), by = c("source_name" = "name")) %>%      left_join(nodes_df_yr %>% select(name, target_id = id), by = c("target_name" = "name")) %>%      filter(!is.na(source_id), !is.na(target_id)) %>%      transmute(        source = source_id, target = target_id,        value = yearly_count, group = target_name # Color links by target keyword      ) %>%      filter(value > 0)        if(nrow(links_df_yr) == 0) {      cat("   - Warning: Failed to create valid links for Yearly Sankey diagram. Skipping.\n")    } else {      # 8d. Generate Yearly Sankey Diagram Object      cat("   - Creating Yearly Sankey plot object...\n")      sankey_plot_obj_yearly <- sankeyNetwork(        Links = links_df_yr, Nodes = nodes_df_yr, Source = "source",        Target = "target", Value = "value", NodeID = "name",        NodeGroup = NULL, LinkGroup = "group", units = "Papers",        fontSize = 11, nodeWidth = 30, nodePadding = 15, sinksRight = TRUE,        margin = list(top=5, bottom=5, left=5, right=5)      )            # 8e. Visualize Yearly Sankey (Print to RStudio Viewer Pane)      if (!is.null(sankey_plot_obj_yearly)) {        cat("   - Plotting Yearly Sankey Diagram (Check RStudio Viewer Pane)...\n")        sankey_title_yr <- paste0("Flow of Top ", num_top_keywords_yearly, " Keywords Over Time (Yearly, Simulated)")        sankey_plot_obj_yr_title <- htmlwidgets::prependContent(sankey_plot_obj_yearly,                                                                htmltools::h3(sankey_title_yr, style = "text-align:center;"))        print(sankey_plot_obj_yr_title)      } else {        cat ("   - Warning: Yearly Sankey plot object could not be created.\n")      }    }  }} else {  cat("   - Warning: No top keywords identified for yearly Sankey. Skipping.\n")}

At last, let’s see how the importance of keywords has shifted over the past decades by visualizing the changing importance of the most frequently used keywords in our simulated example. To do this, we’ll build another Sankey diagram that illustrates the flow and evolution, and the importance of these keywords across time. Please have in mind , that although I will set the number of top keywords per decade to 10, the Sankey diagram may display more than 10 keywords overall because it includes the top 10 from each decade, and if different keywords are dominant in different decades, this can lead to a larger combined set of unique keywords across the entire timeline.

### 9. Decade-Based Keyword Evolutioncat("\n--- 9. Generating Decade-Based Keyword Evolution Sankey ---\n")sankey_plot_obj_decades <- NULL# 9a. Aggregate by Decade and Calculate Frequenciescat("   - Aggregating by Decade and Calculating Frequencies (using unique paper/keyword counts per decade)...\n")keywords_decades <- keywords_long %>%  filter(!is.na(year)) %>%  mutate(decade = floor(year / 10) * 10) %>% # Calculate decade  select(paper_id, decade, keyword) %>%  distinct() # Count each keyword only once per paper within a decadekeyword_decade_counts <- keywords_decades %>%  count(decade, keyword, name = "count") %>%  arrange(decade, desc(count))if(nrow(keyword_decade_counts) == 0){  cat("   - Warning: No keyword counts per decade found. Skipping Decade Sankey.\n")} else {  cat("   - Decade counts calculated.\n")    # 9b. Identify Top Keywords for Each Decade  cat("   - Identifying top", num_top_keywords_per_decade, "keywords per decade...\n")  top_keywords_per_decade <- keyword_decade_counts %>%    group_by(decade) %>%    slice_max(order_by = count, n = num_top_keywords_per_decade, with_ties = FALSE) %>%    ungroup()    keywords_to_track <- unique(top_keywords_per_decade$keyword)    if(length(keywords_to_track) == 0) {    cat("   - Warning: No top keywords identified across decades to track. Skipping Decade Sankey.\n")  } else {    cat("     > Total unique keywords to track (top", num_top_keywords_per_decade, "in any decade):", length(keywords_to_track), "\n")        # Filter the counts to only include these keywords    sankey_base_data_dec <- keyword_decade_counts %>%      filter(keyword %in% keywords_to_track)        if(nrow(sankey_base_data_dec) == 0){      cat("   - Warning: No counts found for selected keywords to track. Skipping Decade Sankey.\n")    } else {            # 9c. Prepare Nodes and Links for Decade Sankey      cat("   - Preparing nodes and links for Decade Sankey...\n")      nodes_df_dec <- sankey_base_data_dec %>%        mutate(name = paste0(decade, "s: ", keyword)) %>% # Node label: "2000s: coevolution"        select(name) %>%        distinct() %>%        mutate(id = row_number() - 1)            decade_list <- sort(unique(sankey_base_data_dec$decade))      links_list_dec <- list()            if (length(decade_list) > 1) {        for (i in 1:(length(decade_list) - 1)) {          current_decade <- decade_list[i]          next_decade <- decade_list[i+1]                    current_decade_data <- sankey_base_data_dec %>% filter(decade == current_decade)          next_decade_data <- sankey_base_data_dec %>% filter(decade == next_decade)          common_keywords <- intersect(current_decade_data$keyword, next_decade_data$keyword)                    if (length(common_keywords) > 0) {            temp_links_dec <- tibble(keyword = common_keywords) %>%              mutate(source_name = paste0(current_decade, "s: ", keyword)) %>%              left_join(nodes_df_dec %>% select(name, source_id = id), by = c("source_name" = "name")) %>%              mutate(target_name = paste0(next_decade, "s: ", keyword)) %>%              left_join(nodes_df_dec %>% select(name, target_id = id), by = c("target_name" = "name")) %>%              # Link value is the count in the *target* decade (flow into that decade)              left_join(next_decade_data %>% select(keyword, value = count), by = "keyword") %>%              filter(!is.na(source_id), !is.na(target_id), !is.na(value), value > 0) %>%              select(source = source_id, target = target_id, value = value, group = keyword)                        if(nrow(temp_links_dec) > 0){              links_list_dec[[as.character(current_decade)]] <- temp_links_dec            }          }        } # End for loop      } # End if length(decade_list) > 1            if (length(links_list_dec) > 0) {        links_df_dec <- bind_rows(links_list_dec)      } else {        links_df_dec <- tibble(source = integer(), target = integer(), value = numeric(), group = character()) # Empty tibble      }            if (nrow(nodes_df_dec) == 0 || nrow(links_df_dec) == 0) {        cat("   - Warning: Could not create valid nodes or links for Decade Sankey. Skipping.\n")      } else {        cat("     > Decade Nodes:", nrow(nodes_df_dec), "; Decade Links:", nrow(links_df_dec), "created.\n")                # 9d. Generate Decade Sankey Diagram        cat("   - Creating Decade Sankey plot object...\n")        num_groups_dec <- length(unique(links_df_dec$group))        if (num_groups_dec <= 12 && num_groups_dec > 0) {          color_palette_dec <- RColorBrewer::brewer.pal(max(3, num_groups_dec), "Paired")[1:num_groups_dec]          color_scale_js_dec <- paste0('d3.scaleOrdinal(["', paste(color_palette_dec, collapse = '","'), '"]);')        } else if (num_groups_dec > 12) {          color_scale_js_dec <- 'd3.scaleOrdinal(d3.schemeCategory10);'          cat("     > Warning: >12 keyword groups for Decade Sankey, colors may repeat.\n")        } else {          color_scale_js_dec <- 'd3.scaleOrdinal(["#cccccc"]);' # Default grey        }                        sankey_plot_obj_decades <- sankeyNetwork(          Links = links_df_dec, Nodes = nodes_df_dec, Source = "source",          Target = "target", Value = "value", NodeID = "name",          LinkGroup = "group", NodeGroup = NULL, units = "Papers",          fontSize = 10, nodeWidth = 35, nodePadding = 10, # Adjusted node width/padding          sinksRight = FALSE, # Keep temporal flow L->R          colourScale = JS(color_scale_js_dec),          margin = list(top=5, bottom=5, left=5, right=5)        )                # 9e. Visualize Decade Sankey (Print to RStudio Viewer Pane)        if(!is.null(sankey_plot_obj_decades)){          cat("   - Plotting Decade Sankey Diagram (Check RStudio Viewer Pane)...\n")          sankey_title_dec <- paste0("Evolution of Top ", num_top_keywords_per_decade, " Keywords by Decade (Simulated)")          sankey_plot_obj_dec_title <- htmlwidgets::prependContent(sankey_plot_obj_decades,                                                                   htmltools::h3(sankey_title_dec, style = "text-align:center;"))          print(sankey_plot_obj_dec_title)        } else {          cat("     > Warning: Decade Sankey plot object is NULL.\n")        }      }     }   } }

To leave a comment for the author, please follow the link and comment on their blog: R Code – Geekcologist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Mapping research landscapes and dynamics: Some basic bibliometric analyses with R