Data cleaning

This page is under development. Stay tuned!

This vignette gives an overview of how data is preprocessed by the surveygraph package, following a number of optional arguments that specify how certain data is to be handled.

We’ll start by loading surveygraph,

library(surveygraph)

and assume data S that we attempt to supply to surveygraph.

df <- data.frame(
    item1 = c(2, -99, 1,   1, 100, 5, 5, 4,  3),
    item2 = c(1,   3, 1,   2,   4, 3, 4, 5,  4), 
    item3 = c(2,   1, 3, -99,   5, 6, 8, 4, 10)
)

df
#>   item1 item2 item3
#> 1     2     1     2
#> 2   -99     3     1
#> 3     1     1     3
#> 4     1     2   -99
#> 5   100     4     5
#> 6     5     3     6
#> 7     5     4     8
#> 8     4     5     4
#> 9     3     4    10

Data frame input

The first this we check is that the input data S is a dataframe. If it’s not the program is halted and an error is output. Future versions may attempt to coerce other formats to dataframes.

For instance, if we attempt to run the make_projection() routine on a list, we get the following error.

make_projection(list(c(1, 2, 3)))
#> Error in make_projection(list(c(1, 2, 3))): Input data must be provided as a data frame.

Similarly, an error is output if an empty data frame is provided.

make_projection(data.frame())
#> Error in make_projection(data.frame()): Data frame cannot be empty.

Coercion

Our approach is to coerce all data to floating point types, and to set them to NA otherwise.

Character strings

If columns happen to string literals of numeric data, these are coerced to floating point numbers, otherwise they are set to NA.

Logical values

If survey entries contain TRUE or FALSE, then these are coerced to 1 and 0, respectively.

Dummy coding

This is a flag that if set to TRUE, dummy codes everything that falls outside the range specified by the likert flag.

Likert range

The likert optional argument allows us to specify the range of the values that we are to interpret as valid input data. The idea is that anything that falls outside of this range is set to NA, or is dummy coded.

l <- data.frame(
    minval = apply(df, 2, min, na.rm = TRUE),
    maxval = apply(df, 2, max, na.rm = TRUE)
)

This creates the following data frame.

l
#>       minval maxval
#> item1    -99    100
#> item2      1      5
#> item3    -99     10

The idea is that by visually inspecting the limiting values for each item, it is obvious which columns contain flags, such as -99 and 100 in our data. As such, we might set

# set the minimum value of items one and three to 1
l$minval[1] <- 1
l$minval[3] <- 1

# set the maximum value of item one to 10
l$maxval[1] <- 10

Following these changes, we interpret the Likert ranges to be

l
#>       minval maxval
#> item1      1     10
#> item2      1      5
#> item3      1     10

Now, we provide the Likert specification l to make_projection to tell surveygraph how to handle the outliers.

# verbose = T
#make_projection(df, likert=l, showdata=T)