Rmonize 2.0.0
  • Process
  • Functions
  • Glossary
  • Updates

Data Processing Elements

The Data Processing Elements (DPE) is a table that defines and documents information about the processing used to generate harmonized datasets. It is typically prepared from an Excel template, where each row specifies if the input dataset can generate the DataSchema variable, and if so, the input variables and processing algorithms to generate the harmonized variable as defined in the DataSchema. This page explains the basic methods to fill out the DPEs to be used correctly by Rmonize functions.

General Structure of the DPE

The DPE has five mandatory columns that are required for data processing with Rmonize functions to work properly. Additional columns for documentation can be present but must use different names and are not used by Rmonize functions.

Rule Categories

  • id_creation
  • direct_mapping
  • recode
  • case_when
  • paste
  • operation
  • other
  • impossible
    undetermined

id_creation creates a unique id per row of the input dataset and must be the first rule for each input dataset to initiate data processing. The user provides the input variable to use.


Notes:

  • The values of the harmonized variable are taken directly from the input values.

  • The input dataset should have an identifier created before harmonization in the desired format.

direct_mapping generates the harmonization variable by replicating one input variable.


Note:

  • Uses only one input variable.

recode generates the harmonized variable by recoding values from one input variable.


Notes:

  • Uses only one input variable.

  • • The input variable should generally be a categorical variable. To create a categorical harmonized variable from a continuous input variable, use case_when instead.

  • • If all values in the input variable are the same as in the DataSchema variable, use direct_mapping instead.

  • • Specify each input value to output value recoding with an equal sign =

  • • Separate each recoding with a semi-colon ; .

  • • Use ELSE = NA to attribute NA to all remaining input values.

If equal signs = already exist in the data, use _= to escape them. Similarly, if semi-colons ; already exist in the data, use _; to escape them.

Numerical input values can be gathered using R syntax to recode multiple values at the same time.

recode(
0            = "low"   ;
c(1:10)      = "mid"   ;
c(-7, -99)   = NA    )

If more complex coding is required, use case_when or other instead.

case_when generates the harmonized variable from one or more if-else conditions, using one or more input variable.


Notes:

  • Multiple input variables can be used. Separate the name of each in the input_variables column with a semi-colon ;

  • If only one input variable is used, consider using recode or direct_mapping instead.

  • Separate the left and right side (input/output) of each statement combination with a tilde ~

  • Separate each case-when statement with a semi-colon ; .

  • Use ELSE ~ NA to attribute NA to all remaining input values.

case_when is sensitive to data type. Each output value generated by statements must have the same data type, including NA values.

case_when(
var_x == 1                 ~ 1L
var_x == 0 & !is.na(var_y) ~ 0L
ELSE                       ~ NA_integer_  )

case_when(
var_x == 1                 ~ "1"
var_x == 0 & !is.na(var_y) ~ "0"
ELSE                       ~ NA_character_)

If the statement requires more complex coding, consider use other instead.

paste generates the harmonized variable by setting the same value for all observations, not using values from any input variables.


Notes:

  • • This function does not use any input variables. The value __BLANK__ must be entered in the column input_variables.

  • • This rule category is often used when the harmonized variable is an identifier for the whole dataset, e.g., a study or population identifier.

operation generates the harmonized variable by applying an operation to one or more input variables.


Notes:

  • Multiple input variables can be used. Separate the name of each in the input_variables column with a semi-colon ;

  • • This rule category is intended to document relatively simple transformations. If the operation requires more complex or longer coding, consider using other instead.

other generates the harmonized variable from a non-standard or complex processing rule, not covered by other rule categories.


Note:

  • This rule category is equivalent to launching an R script.

The user environment needs to be controlled carefully when using this rule category to avoid unexpected results. Use double assignment "<<-" to place the result in the user environment.

my_harmo_var <- runif(20) + ... # complex lines of code

# Double assignment to modify the environment.
harmonized_dossier$DATASET$variable_F <<- my_harmo_var

This rule category can also be used to source code from a script with more complex data processing.

source("my_file.R")

These rule categories handle cases where the DataSchema variable cannot be generated for an input dataset or the status is undetermined. An empty column is generated for the DataSchema variable in the harmonized dataset. This ensures that a row for each DataSchema variable and each input dataset is completed, that there are no missing arguments, and that data processing can proceed even if not all algorithms are finalized..


Notes:

  • __BLANK__ : A value must be provided in input_variables. If no input variable is needed to generate the harmonized variable, __BLANK__ is used as a placeholder in input_variables.

  • impossible : If the input dataset does not contain relevant or compatible information to generate the DataSchema variable, the rule_category and algorithm are marked ‘impossible’.

  • undetermined : If any of the data processing elements needs further investigation or information to be determined, the rule_category and algorithm can be marked ‘undetermined’ to allow running harmo_process() without error and documenting the undetermined status.

Examples of Rule Categories

Contents

Developed by  
Site built with pkgdown 2.0.7.