---
title: "Data cleaning"
description: >
  To understand how surveygraph handles irregular data...
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data cleaning}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
```

_This page is under development. Stay tuned!_

This vignette gives an overview of how data is preprocessed by the surveygraph
package, following a number of optional arguments that specify how certain data
is to be handled.

We'll start by loading surveygraph,

```{r, message = FALSE}
library(surveygraph)
```

and assume data `S` that we attempt to supply to surveygraph.

```{r}
df <- data.frame(
	item1 = c(2, -99, 1,   1, 100, 5, 5, 4,  3),
	item2 = c(1,   3, 1,   2,   4, 3, 4, 5,  4), 
	item3 = c(2,   1, 3, -99,   5, 6, 8, 4, 10)
)

df
```

## Data frame input

The first this we check is that the input data `S` is a dataframe. If it's not
the program is halted and an error is output. Future versions may attempt to 
coerce other formats to dataframes.

For instance, if we attempt to run the `make_projection()` routine on a list,
we get the following error.

```{r error = TRUE}
make_projection(list(c(1, 2, 3)))
```

Similarly, an error is output if an empty data frame is provided.

```{r error = TRUE}
make_projection(data.frame())
```

## Coercion

Our approach is to coerce all data to floating point types, and to set them to NA
otherwise.

### Character strings

If columns happen to string literals of numeric data, these are coerced to 
floating point numbers, otherwise they are set to `NA`.

### Logical values

If survey entries contain `TRUE` or `FALSE`, then these are coerced to 1 and 0,
respectively.

## Dummy coding

This is a flag that if set to `TRUE`, dummy codes everything that falls outside
the range specified by the `likert` flag.

## Likert range

The `likert` optional argument allows us to specify the range of the values
that we are to interpret as valid input data. The idea is that anything that
falls outside of this range is set to `NA`, or is dummy coded.

```{r}
l <- data.frame(
	minval = apply(df, 2, min, na.rm = TRUE),
	maxval = apply(df, 2, max, na.rm = TRUE)
)
```
This creates the following data frame.
```{r}
l
```
The idea is that by visually inspecting the limiting values for each item, it
is obvious which columns contain flags, such as `-99` and `100` in our data. 
As such, we might set 
```{r}
# set the minimum value of items one and three to 1
l$minval[1] <- 1
l$minval[3] <- 1

# set the maximum value of item one to 10
l$maxval[1] <- 10
```
Following these changes, we interpret the Likert ranges to be
```{r}
l
```
Now, we provide the Likert specification `l` to `make_projection` to tell surveygraph
how to handle the outliers.
```{r}
# verbose = T
#make_projection(df, likert=l, showdata=T)
```