An unexpected acquaintance? Data Validation in R

Do Data Scientists do Data Engineering?

Data Engineering is a common task for any Data Scientist. Sure, we rarely do the “real” Data Engineering, e.g. loading data from a business system into a relational database. Rather, we work with data that already is in a database, and sometimes (if we’re lucky) it’s put there by a Data Engineer. We usually transform this data into an analytical format from which we can derive aggregations, visualizations and statistical models. In the start of any Data Science project, these jobs may be simple and not always of any significant concern but as your Data Science products get used more and more the importance of your data will increase.

Therefore it might be a good idea for any Data Scientist to invest some time into data validation. There are many frameworks for data validation in different programming languages. But Data Scientists usually feel comfortable using tools they know, such as R and Python. R is a popular functional programming language for data analysis. However, even though data analysis is what R is famous for, the language can do a lot more besides statistics. In this post we’ll take a look at the R package pointblank that have made data validation and data quality controls easy to do in R. pointblank also produces pretty tables for your data validation steps, and I love pretty things. But there’s more to the package than being easy and pretty, it’s a serious tool for any Data Scientist who wants to get control over their data.

Using Snowflake in R

Data can come in many formats, but for this blog post I’ll focus on databases and especially the increasingly popular database Snowflake.

Once you have created an odbc-DSN for your database it’s usually straight forward to connect to any database in R using the packages odbc and DBI.

library(DBI)
library(odbc)

con <- dbConnect(odbc(), "example_database")

The database we use here is populated with a couple of million rows of district heating data and is a part of a research project that Solita Sweden is assisting. This means that the data cannot be processed in memory in R. Thankfully, the arguably most popular data manipulation package in R dplyr has a database backend that can translate R code into SQL. This means that we don’t have to bother with more than one data manipulation framework at a time. With that said, most Data Scientists should know some SQL, because sometimes the translations don’t work as expected. But in 95% of the cases you can write R code instead of SQL, which usually saves me a lot of time.

For instance, if I want to calculate how many observations there are in a table and group the result by a column I can use dplyr code on a database table:

library(tidyverse) ## dplyr is included in the package tidyverse
library(dbplyr)

metering_tbl <- tbl(con, in_schema("PUBLIC", "METERING_READINGS_TBL"))

## Use dplyr code on the database table
property_tbl <- metering_tbl %>% 
  group_by(PROPERTY, UNIT_OF_MEASURE) %>% 
  summarise(
    n = n(), #count all records
    mean_value = mean(VALUE),
    sd_value = sd(VALUE)
  )

property_tbl

## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

## # Source:   lazy query [?? x 5]
## # Database: Snowflake 4.40.1[filipw@Snowflake/xxx]
## # Groups:   PROPERTY
##   PROPERTY           UNIT_OF_MEASURE         n mean_value sd_value
##   <chr>              <chr>               <dbl>      <dbl>    <dbl>
## 1 return_temperature c               219273351       37.5   3118. 
## 2 flow_temperature   c               223159867       66.2     52.4
## 3 energy             MWh             116629833     6858.  768893. 
## 4 flow               m3              132111599     1837.  961871.

On the backend R sends SQL-code to Snowflake and returns the result. We can see the SQL code generated:

show_query(property_tbl)

## <SQL>
## SELECT "PROPERTY", "UNIT_OF_MEASURE", COUNT(*) AS "n", AVG("VALUE") AS "mean_value", STDDEV_SAMP("VALUE") AS "sd_value"
## FROM PUBLIC.METERING_READINGS_TBL
## GROUP BY "PROPERTY", "UNIT_OF_MEASURE"

When we’ve aggregated all these million rows we can import the data into R and do all the things R is great for: visualizations, statistical modeling and reporting.

But this post is not about that, rather we’ll use R for something it is not famous for, but really good at: data validation.

pointblank

The package pointblank is one of many data validation packages in R, check out assertr, validate and dataMaid for other examples.

However, I’ve been extremely impressed by the work of the creator behind pointblank Rich Iannone. The documentation of pointblank is really something special. Furthermore, the time spent on tests give us an indication of the ambition behind the package, there are more than 3000 unit tests for this package.

Anyways, let’s get down to it. You can use pointblank on data in memory (maybe data comes from a csv-file or a json-file), but since pointblank leverages the dplyr SQL backend you can also use it on database tables.

So let’s do that. We can start by taking a look at a database table in our database. This table holds hourly observations of District Heating consumption. Data is imported from a District Heating network on a daily basis. It comes in JSON but I have an R-script that transforms it into a table that can be used for statistical analysis. Since the original format is in JSON there is a risk that the format might change, so data validation of this table seems like a natural step.

tbl(con, in_schema("PUBLIC", "METERING_READINGS_TBL")) %>% 
  select(-SOURCE_INSTANCE_NAME)

## # Source:   lazy query [?? x 6]
## # Database: Snowflake 4.40.1[filipw@Snowflake/xxx]
##    IMPORT_TIMESTAMP    TIMESTAMP           METERING_POINT_… PROPERTY
##    <dttm>              <dttm>              <chr>            <chr>   
##  1 2020-10-31 02:43:48 2020-07-04 12:00:00 cf767121-4f64-4… flow_te…
##  2 2020-10-31 02:43:48 2020-07-04 12:00:00 cf767121-4f64-4… return_…
##  3 2020-10-31 02:43:48 2020-07-04 13:00:00 cf767121-4f64-4… flow_te…
##  4 2020-10-31 02:43:48 2020-07-04 13:00:00 cf767121-4f64-4… return_…
##  5 2020-10-31 02:43:48 2020-07-04 14:00:00 cf767121-4f64-4… flow_te…
##  6 2020-10-31 02:43:48 2020-07-04 14:00:00 cf767121-4f64-4… return_…
##  7 2020-10-31 02:43:48 2020-07-04 15:00:00 cf767121-4f64-4… flow_te…
##  8 2020-10-31 02:43:48 2020-07-04 15:00:00 cf767121-4f64-4… return_…
##  9 2020-10-31 02:43:48 2020-07-04 16:00:00 cf767121-4f64-4… flow_te…
## 10 2020-10-31 02:43:48 2020-07-04 16:00:00 cf767121-4f64-4… return_…
## # … with more rows, and 2 more variables: UNIT_OF_MEASURE <chr>, VALUE <dbl>

pointblank usage

In pointblank we first create an agent that defines which table we want to validate. Then we use a validation function (these can also be tests and expectations) to do our validation, e.g. col_is_character(). Lastly we use the function interrogate() to do the actual validation.

So for this table we might be interested in checking the format of the columns:

library(pointblank)
agent <- 
  create_agent(
      read_fn = ~tbl(con, in_schema("PUBLIC", "METERING_READINGS_TBL")),
      tbl_name = "METERING_READINGS_TBL"
    ) %>%
  col_is_character(vars(METERING_POINT_ID, PROPERTY, UNIT_OF_MEASURE)) %>% 
  col_is_numeric(vars(VALUE))%>%
  col_is_posix(vars(IMPORT_TIMESTAMP, TIMESTAMP)) %>% 
  interrogate()

When the “interrogation” is complete we can print the agent and we’ll get a pretty table telling us what has happened.

agent
Pointblank Validation
\[2020-12-04|16:02:03\] </p> METERING\_READINGS\_TBL
STEP COLUMNS VALUES TBL EVAL ⋅ ⋅ ⋅ PASS FAIL W S N EXT
1 c  col\_is\_character()

METERING\_POINT\_ID

1 1
1.00
0
0.00

2 c  col\_is\_character()

PROPERTY

1 1
1.00
0
0.00

3 c  col\_is\_character()

UNIT\_OF\_MEASURE

1 1
1.00
0
0.00

4 d  col\_is\_numeric()

VALUE

1 1
1.00
0
0.00

5 T  col\_is\_posix()

IMPORT\_TIMESTAMP

1 1
1.00
0
0.00

6 T  col\_is\_posix()

TIMESTAMP

1 1
1.00
0
0.00

2020-12-04 16:02:10 CET1.9 s2020-12-04 16:02:12 CET

Defining actions

Another useful feature is to define actions with action_level() to decide what should happen if a validation function fails. For instance, we might want to check that all the dates in our table are within a certain range, and if they’re not we’d like to either add a warning or if the amount of fail exceeds a certain limit we’d like to stop.

Below I’ve added a validation function to make sure that all values in TIMESTAMP are between two dates:

agent_warn <- 
  create_agent(
      read_fn = ~tbl(con, in_schema("PUBLIC", "METERING_READINGS_TBL")),
      tbl_name = "METERING_READINGS_TBL",
      actions = action_levels(warn_at = 1000, stop_at = 200000)
    ) %>%
  col_is_character(vars(METERING_POINT_ID, PROPERTY, UNIT_OF_MEASURE)) %>% 
  col_is_numeric(vars(VALUE))%>%
  col_is_posix(vars(IMPORT_TIMESTAMP, TIMESTAMP)) %>% 
  col_vals_between(vars(TIMESTAMP), "2016-01-01", "2020-10-07 23:00:00") %>% 
  interrogate(extract_failed = FALSE)

agent_warn
Pointblank Validation
\[2020-12-04|16:02:14\] </p> METERING\_READINGS\_TBLWARN 1,000 STOP 200,000 NOTIFY
STEP COLUMNS VALUES TBL EVAL ⋅ ⋅ ⋅ PASS FAIL W S N EXT
1 c  col\_is\_character()

METERING\_POINT\_ID

1 1
1.00
0
0.00

2 c  col\_is\_character()

PROPERTY

1 1
1.00
0
0.00

3 c  col\_is\_character()

UNIT\_OF\_MEASURE

1 1
1.00
0
0.00

4 d  col\_is\_numeric()

VALUE

1 1
1.00
0
0.00

5 T  col\_is\_posix()

IMPORT\_TIMESTAMP

1 1
1.00
0
0.00

6 T  col\_is\_posix()

TIMESTAMP

1 1
1.00
0
0.00

7  col\_vals\_between()

TIMESTAMP

[2016-01-01, 2020-10-07 23:00:00]

691M 691M
0.99
183K
0.01

2020-12-04 16:02:20 CET4.2 s2020-12-04 16:02:24 CET

As you can see we have 183K observations that are not in the date range which will only cause a warning.

Creating notification emails

Many data validation tasks, at least for a Data Scientist, look something like this: you want to check that the data behaves as you expect, and if not you would like to get some kind of notification. pointblank has great functionality for this and makes it easy to write data validation reports that may be sent out by email through the blastula package, which is also created by Rich Iannone.

To create a skeleton for an email we just use email_create() on the agent.

Below I create a simple function that simply sends an email if the not all steps are passed.

library(blastula)

send_validation_email <- function(agent){
  
  if(!isTRUE(all_passed(agent))){
    
    email_agent <- email_create(agent)  
    
    table_name <- agent$tbl_name
    
    subject <- glue::glue("There was at least a warning in the data validation of {table_name}")
    
    email_agent %>%
      smtp_send(
        to = "filip.wastberg@solita.fi", 
        from = "filip.wastberg@solita.fi",
        subject = subject,
        credentials = creds_key(id = "outlook")
      )
  } else {
    message("Everything passed, no need to send email.")
  }
}

send_validation_email(agent_warn)

Which, if there are any warnings, result in this mail:

Of course, this is a simple example and can be further developed. This can also be configured to be run with Docker or a Data Science platform such as RStudio Connect, which would be the recommendation if something like this is to be run in “production”.

And it doesn’t end here. pointblank can generate YAML-files for you and the validation functions are really flexible if your needs are more specific.

Summary

R is often regarded as a niche programming language. And there is no reason to sugar code it, R is a weird language. But today it is one of the top ten most popular programming languages (not bad for a niche language), it has over 15 000 open-source libraries and is used for everything from building machine learning models to advanced web applications and, as we have showed here, data validation. Furthermore, the development of R the last few years have had a lot of focus on syntax and making it easy to use, an often overlooked part of programming development, and this, I think, is one reason it continues to attract more users. pointblank is one of those packages that makes R useful and a joy to use, it takes a complex task and make it look easy through a carefully thought through API.

And last but not least, it is pretty. And we love pretty things.