--- title: "Data cleaning" description: > To understand how surveygraph handles irregular data... output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data cleaning} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) ``` _This page is under development. Stay tuned!_ This vignette gives an overview of how data is preprocessed by the surveygraph package, following a number of optional arguments that specify how certain data is to be handled. We'll start by loading surveygraph, ```{r, message = FALSE} library(surveygraph) ``` and assume data `S` that we attempt to supply to surveygraph. ```{r} df <- data.frame( item1 = c(2, -99, 1, 1, 100, 5, 5, 4, 3), item2 = c(1, 3, 1, 2, 4, 3, 4, 5, 4), item3 = c(2, 1, 3, -99, 5, 6, 8, 4, 10) ) df ``` ## Data frame input The first this we check is that the input data `S` is a dataframe. If it's not the program is halted and an error is output. Future versions may attempt to coerce other formats to dataframes. For instance, if we attempt to run the `make_projection()` routine on a list, we get the following error. ```{r error = TRUE} make_projection(list(c(1, 2, 3))) ``` Similarly, an error is output if an empty data frame is provided. ```{r error = TRUE} make_projection(data.frame()) ``` ## Coercion Our approach is to coerce all data to floating point types, and to set them to NA otherwise. ### Character strings If columns happen to string literals of numeric data, these are coerced to floating point numbers, otherwise they are set to `NA`. ### Logical values If survey entries contain `TRUE` or `FALSE`, then these are coerced to 1 and 0, respectively. ## Dummy coding This is a flag that if set to `TRUE`, dummy codes everything that falls outside the range specified by the `likert` flag. ## Likert range The `likert` optional argument allows us to specify the range of the values that we are to interpret as valid input data. The idea is that anything that falls outside of this range is set to `NA`, or is dummy coded. ```{r} l <- data.frame( minval = apply(df, 2, min, na.rm = TRUE), maxval = apply(df, 2, max, na.rm = TRUE) ) ``` This creates the following data frame. ```{r} l ``` The idea is that by visually inspecting the limiting values for each item, it is obvious which columns contain flags, such as `-99` and `100` in our data. As such, we might set ```{r} # set the minimum value of items one and three to 1 l$minval[1] <- 1 l$minval[3] <- 1 # set the maximum value of item one to 10 l$maxval[1] <- 10 ``` Following these changes, we interpret the Likert ranges to be ```{r} l ``` Now, we provide the Likert specification `l` to `make_projection` to tell surveygraph how to handle the outliers. ```{r} # verbose = T #make_projection(df, likert=l, showdata=T) ```