tabulapdf: Extract Tables from PDF Documents

R-bloggers 2024-04-30

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

I had to extract multiple tables from PDF files and do some data analysis in R. I found that updating tabulizer (now retired from CRAN) to use a Java version newer than Java 8 (deprecated) was worth it to complete this task.

tabulapdf is a reworked version of tabulizer that works with OpenJDK 11 and newer. I wanted to share it here and show how to use it to extract tables from PDF files.

About

tabulapdf provides R bindings to the Tabula java library, which can be used to computationally extract tables from PDF documents. The main function extract_tables() mimics the command-line behavior of the Tabula, by extracting all tables from a PDF file and, by default, returns those tables as a list of character tibbles in R.

library("tabulapdf")

# set Java memory limit to 600 MB (optional)
options(java.parameters = "-Xmx600m")

f <- system.file("examples", "data.pdf", package = "tabulapdf")

# extract table from first page of example PDF
tab <- extract_tables(f, pages = 1)

tab[[1]]
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.21  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

The pages argument allows you to select which pages to attempt to extract tables from. By default, Tabula (and thus tabulapdf) checks every page for tables using a detection algorithm and returns all of them. pages can be an integer vector of any length; pages are indexed from 1.

It is possible to specify a remote file, which will be copied to R’s temporary directory before processing:

f2 <- "https://raw.githubusercontent.com/ropensci/tabulapdf/main/inst/examples/data.pdf"
extract_tables(f2, pages = 2)
[[1]]
# A tibble: 1 × 5
  Sepal.Length                      Sepal.Width Petal.Length Petal.Width Species
  <chr>                             <chr>       <chr>        <chr>       <chr>  
1 "5.10\r4.90\r4.70\r4.60\r5.00\r5… "3.50\r3.0… "1.40\r1.40… "0.20\r0.2… "setos…

Changing the Method of Extraction

The default method used by extract_tables() mimics the behaviour of Tabula. For each page the algorithm decides whether it contains one consistent table and then extracts it by using spreadsheet-tailored algorithm method = "lattice". The correct recognition of a table depends on whether the page contains a table grid. If it doesn’t and the table is a matrix of cells with values without borders, it might not be able to recognise it. This also happens when multiple tables with different number of columns are present on the same page. In those cases another, more general, algorithm method = "stream" is used, which relies on the distances between text characters on the page.

# Extract tables by deciding for each page individually
extract_tables(f2, method = "decide")
[[1]]
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.21  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

[[2]]
# A tibble: 1 × 5
  Sepal.Length                      Sepal.Width Petal.Length Petal.Width Species
  <chr>                             <chr>       <chr>        <chr>       <chr>  
1 "5.10\r4.90\r4.70\r4.60\r5.00\r5… "3.50\r3.0… "1.40\r1.40… "0.20\r0.2… "setos…

[[3]]
# A tibble: 15 × 3
     len supp   dose
   <dbl> <chr> <dbl>
 1   4.2 VC      0.5
 2  11.5 VC      0.5
 3   7.3 VC      0.5
 4   5.8 VC      0.5
 5   6.4 VC      0.5
 6  10   VC      0.5
 7  11.2 VC      0.5
 8  11.2 VC      0.5
 9   5.2 VC      0.5
10   7   VC      0.5
11  16.5 VC      1  
12  16.5 VC      1  
13  15.2 VC      1  
14  17.3 VC      1  
15  22.5 VC      1  

It is possible to specify the preferred algorithm which might be a better option for more difficult cases.

# Extract tables by using "lattice" method
extract_tables(f2, pages = 2, method = "lattice")
[[1]]
# A tibble: 1 × 5
  Sepal.Length                      Sepal.Width Petal.Length Petal.Width Species
  <chr>                             <chr>       <chr>        <chr>       <chr>  
1 "5.10\r4.90\r4.70\r4.60\r5.00\r5… "3.50\r3.0… "1.40\r1.40… "0.20\r0.2… "setos…
# Extract tables by using "stream" method
extract_tables(f2, pages = 2, method = "stream")
[[1]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 

[[2]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <chr>    
1          6.7         3.3          5.7         2.5 virginica
2          6.7         3            5.2         2.3 virginica
3          6.3         2.5          5           1.9 virginica
4          6.5         3            5.2         2   virginica
5          6.2         3.4          5.4         2.3 virginica
6          5.9         3            5.1         1.8 virginica

Modifying the Return Value

By default, extract_tables() returns a list of character tibbles. This is because many tables might be malformed or irregular and thus not be easily coerced to an R data.frame. This can easily be changed by specifying the output argument:

# attempt to coerce tables to data.frames
extract_tables(f, pages = 2)
[[1]]
# A tibble: 1 × 5
  Sepal.Length                      Sepal.Width Petal.Length Petal.Width Species
  <chr>                             <chr>       <chr>        <chr>       <chr>  
1 "5.10\r4.90\r4.70\r4.60\r5.00\r5… "3.50\r3.0… "1.40\r1.40… "0.20\r0.2… "setos…

Tabula itself implements three “writer” methods that write extracted tables to disk as CSV, TSV, or JSON files. These can be specified by output = "csv", output = "tsv", and output = "json", respectively. For CSV and TSV, one file is written to disk for each table and R session’s temporary directory tempdir() is used by default (alternatively, the directory can be specified through output argument). For JSON, one file is written containing information about all tables. For these methods, extract_tables() returns a path to the directory containing the output files.

# extract tables to CSVs
extract_tables(f, output = "csv")
[1] "/tmp/Rtmp9LdMjd"

If none of the standard methods works well, you can specify output = "asis" to return an rJava “jobjRef” object, which is a pointer to a Java ArrayList of Tabula Table objects. Working with that object might be quite awkward as it requires knowledge of Java and Tabula’s internals, but might be useful to advanced users for debugging purposes.

Extracting Areas

By default, tabulapdf uses Tabula’s table detection algorithm to automatically identify tables within each page of a PDF. This automatic detection can be toggled off by setting guess = FALSE and specifying an “area” within each PDF page to extract the table from. Here is a comparison of the default settings, versus extracting from two alternative areas within a page:

# this does not return the desired tables on page 2
extract_tables(f, pages = 2, guess = TRUE)
[[1]]
# A tibble: 1 × 5
  Sepal.Length                      Sepal.Width Petal.Length Petal.Width Species
  <chr>                             <chr>       <chr>        <chr>       <chr>  
1 "5.10\r4.90\r4.70\r4.60\r5.00\r5… "3.50\r3.0… "1.40\r1.40… "0.20\r0.2… "setos…

The area argument should be a list either of length 1 (to use the same area for each specified page) or equal to the number of pages specified. This also means that you can extract multiple areas from one page, but specifying the page twice and indicating the two areas separately:

# this returns the desired tables on page 2
extract_tables(
  f,
  pages = c(2, 2),
  area = list(c(58, 125, 182, 488), c(387, 125, 513, 492)),
  guess = FALSE
)
[[1]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 

[[2]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <chr>    
1          6.7         3.3          5.7         2.5 virginica
2          6.7         3            5.2         2.3 virginica
3          6.3         2.5          5           1.9 virginica
4          6.5         3            5.2         2   virginica
5          6.2         3.4          5.4         2.3 virginica
6          5.9         3            5.1         1.8 virginica

Interactive Table Extraction

In addition to the programmatic extraction offered by extract_tables(), it is also possible to work interactively with PDFs via the extract_areas() function. This function triggers a process by which each (specified) page of a PDF is converted to a PNG image file and then loaded as an R graphic. From there, you can use your mouse to specify upper-left and lower-right bounds of an area on each page. Pages are cycled through automatically and, after selecting areas for each page, those areas are extracted auto-magically (and the return value is the same as for extract_tables()).

locate_areas() handles the area identification process without performing the extraction, which may be useful as a debugger, or simply to define areas to be used in a programmatic extraction.

# same as the previous example
# use locate_areas(f, pages = 2) to select the area in the web app
# don't forget to click "done" when you're finished selecting areas

# first_table <- locate_areas(f, pages = 2)
# second_table <- locate_areas(f, pages = 2)
first_table <- c(58.15032, 125.26869, 182.02355, 488.12966)
second_table <- c(387.7791, 125.2687, 513.7519, 492.3246)

extract_tables(f, pages = 2, area = list(first_table), guess = FALSE)
[[1]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 
extract_tables(f, pages = 2, area = list(second_table), guess = FALSE)
[[1]]
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <chr>    
1          6.7         3.3          5.7         2.5 virginica
2          6.7         3            5.2         2.3 virginica
3          6.3         2.5          5           1.9 virginica
4          6.5         3            5.2         2   virginica
5          6.2         3.4          5.4         2.3 virginica
6          5.9         3            5.1         1.8 virginica
# alternatively, use extract_areas(f, pages = 2) to do the same in less steps

Miscellaneous Functionality

Tabula is built on top of the Java PDFBox library), which provides low-level functionality for working with PDFs. A few of these tools are exposed through tabulapdf, as they might be useful for debugging or generally for working with PDFs. These functions include:

  • extract_text() converts the text of an entire file or specified pages into an R character vector.
  • split_pdf() and merge_pdfs() split and merge PDF documents, respectively.
  • extract_metadata() extracts PDF metadata as a list.
  • get_n_pages() determines the number of pages in a document.
  • get_page_dims() determines the width and height of each page in pt (the unit used by area and columns arguments).
  • make_thumbnails() converts specified pages of a PDF file to image files.
To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: tabulapdf: Extract Tables from PDF Documents