Iterating over multiple database tables with the tidyverse

Iterating over multiple database tables with the tidyverse

Extracting data from multiple database tables with dplyr and purrr.

You want to run a query over multiple tables in your database / warehouse and then process and visualise the combined results of those queries. You need to be able to switch between servers, databases, schemas and tables, selecting different columns and applying different conditions in the WHERE clause. You don’t have permsissions to write complex / dynamic SQL (for example you can’t create stored procedures or query system tables).

Are you stumped?

Not if you know R!

Here’s one way to do it..you’‘ll need these packages:

library(dbplyr)
library(dplyr)
library(odbc)
library(lubridate)
library(DBI)
library(data.table)
library(purrr)
library(tibble)
library(glue)
library(rlang)
library(yaml)

Step 1 Set up a config file. This is a YAML file, where you store all the parameters, and possible values for each of those, in the order you want to loop through them. There is an RStudio package, named config, that makes this easy for you, and it allows you to have default and alternative setups. However, I just created it manually in VSCode.

Why YAML?
It’s portable, and readable. Keeping it outside of R makes it easier to access, edit and share.

You can set up whatever parameters you need - the connection (con) example may not be strictly necessary but it doesn’t do any harm. I need to be able to switch between our dev and live servers as required, so having these in the configuration gives me a reminder of where I’m currently pointed at. The schema name is important to me as we do not have dbo access.

con:
  Server: servername
  Database: db
  Trusted_Connection: 'True'
  Port: 1433


tablename:
- table_1
- table_2
- table_3


schema:
- schema_name

date_vars:
- date_1
- date_2
- date_3

col_vars:
  tbl1: 
  - date_1
  - col1
  - col2
  - col3

  tbl2:
  - date_2
  - col4
  - col5
  - col6

  tbl3:
  - date_3
  - col7
  - col8
  - col9

Step 2 Write a SQL query using glue_SQL to allow for interpolating these parameters - you can see how the parameters above are used in the function arguments. You’re on your own with this bit, but see my last post Dynamic SQL queries with R for some pointers.
I only need a simple query here, as I am going to produce a variable / value dataset to explore several tables.


run_query <- function(server_name = "server", 
                      db = "db", 
                      start_date = '2022-01-03',
                      end_date = Sys.Date(),
                      schema_name = "schema",
                      tablename  = "table",
                      date_var = "date_co1", 
                      col_vars = c(
                        "date_col1",
                        "col1",
                        "col2",
                        "col3")) {
  
 con <- make_connection(server_name,db) 
# internal wrapper function that make a regular SQL Server connection
# this function uses the 'server_name' and 'db' variables from the configuration file
  
  start_date <- as.Date(start_date)
  end_date <- as.Date(end_date)
  
  query <- glue_sql(.con = con, 
           "SELECT {`col_vars`*}, 
           '{`schema_name`}.{`tablename`}' AS 'tablename'
            FROM {`schema_name`}.{`tablename`} with(nolock)
            WHERE CAST({`schema_name`}.{`tablename`}.{`date_var`} AS DATE) BETWEEN {start_date} AND {end_date}")
  
  res <- DBI::dbGetQuery(con, query)
  
  dbDisconnect(con)

  return(res)  
}

Note - I have a custom function to make my connection, but you can copy /paste and edit the code below to make yours, assuming you’re using SQL Server and inheriting credentials from your network login:

con <- DBI::dbConnect(odbc::odbc(), 
                      Driver = "SQL Server", 
                      Server = "SERVER", 
                      Database = "DB_NAME", 
                      Trusted_Connection = "True", 
                      Port = 1433)

Step 3 Write any other queries you need to work with your data. Below, I create a column with the date as of the start of the week (based on a Monday), and the ISO week. It then removes the original date column, as it is no longer required, but because it’s getting passed in as a string, I need to use sym.

convert_dates_string <- function(.df, datecol) {
  .df %>%
    mutate(!!datecol := lubridate::date(!!rlang::sym(datecol))) %>%
    mutate(week_starting := lubridate::floor_date(!!rlang::sym(datecol),'week', 1)) %>%
    mutate(isoweek := lubridate::isoweek(week_starting),
    year := lubridate::year(week_starting),
    !!rlang::sym(datecol) := NULL)
}

Step 4 Read in the config file and create a named list.
Each element refers to the arguments of my original function, which, in turn, refer to the configuration parameters.

configuration <- yaml::read_yaml("config.yml")

datalist <- list(server_name = configuration$con$Server, 
               db = configuration$con$Database, 
               schema_name = configuration$schema,  
               tablename  = configuration$tablename, 
               date_var = configuration$date_vars, 
               col_vars = configuration$col_vars)

col_vars becomes a list of 3 lists - one for each table. That sounds messy, but it’s OK - purrr can handle it.

Step 5. Pass the list to purrr’s pmap() function.
We’re using pmap because we have more than 2 function arguments. Because our list elements match the function arguments, it’s really easy.

output <- pmap(datalist, run_query) 
# no need to specify all the function elements, because we've done our setup work. 

That is all you need.
When you pass in a single value, e.g. the server name, it gets recycled for each iteration.
For example, I don’t need to list it 3 times in my config file, even though I am iterating over 3 tables.

You might also want to wrap your query using purrr’s safely function, to ensure your query runs, even if something goes wrong (e.g. a column or table is removed or not available):

safe_run_query <- purrr::safely(run_query)
output <- pmap(datalist, safe_run_query) 
# guaranteed to work - will return either the data, or an error, to help you troubleshoot
# however if you do this, you will need to ```pluck``` the ```result``` back out
# this is because safely returns both a result and an error list
# hopefully the error list will be empty

# output <- map_dfr(output, pluck, "result")

Success! You will now have a list of dataframes, one for each table you specified.
My custom connection function informs me each time it hits up a new table, and it was a great joy to see the list growing.

Alternatively, this would also work, without creating the named list up front:

pmap(list(server_name, db, schema_name, tablename, date_var, col_vars), 
                            ~ run_query(..1,..2,..3,..4,..5,..6))

You’ve done the hard part now.

At this point, I have some further processing to do.
I pass the output to my second function, in order to create new weekly date columns:

output <- map2(output, configuration$date_vars, convert_dates_string)

Step 6. Convert the data from a list to a data.frame / data.table You can use data table’s rbindlist(), or purrr’s map_dfr().

DT <- rbindlist(output, use.names = TRUE, fill = TRUE, idcol = FALSE)

You can now stop and relax.
I, however, have to keep on wrangling to pivot my data, ready for my simple shiny app :

DT <- DT %>% 
  gather('column','value', -c(week_starting, isoweek, year, tablename))
  
setDT(DT) # make it a data.table

DT[,tablename := gsub('"', "",tablename)][]

# removing extra quote marks

That’s it! Now do what you need to with your new dataframe.
You now know how to iterate over multiple tables from the comfort of R!


© 2016 - 2022. All rights reserved.