Understand the workflow • harmonizR

Generalities

In a typical harmonization process, ‘harmonizR’ functions use a few main inputs: study-specific datasets, study-specific data dictionaries,the DataSchema, and data processing elements. These inputs are generally created in other programs, imported into R, and prepared as needed to match the specific formats required by harmonizR functions.

Study-specific datasets contain variables in original formats (e.g., as collected by separate studies), which provide the source variables for harmonization processing. Ideally, study-specific data dictionaries that contain metadata about variables (e.g., labels, units, categories) are also provided by the user and associated with the datasets but are not required.
The DataSchema defines the list and attributes of harmonized variables to be generated (guidelines Step 2).
The data processing elements contain the processing rules and metadata that will be used to generate harmonized variables from study-specific variables (guidelines Step 3).

These inputs provide the information used to evaluate and document the harmonization process and harmonized datasets produced (Steps 4 and 5). The DataSchema and data processing elements are first created as Excel files from templates and imported into R. These element are mandatory in the technical process of harmonization in this package.

How does work the process of harmonization in the package ?

To function properly, the harmonization process needs to know where to extract variables from an input dataset(s) to generate a harmonized variable in output harmonized dataset(s).

In other words, the harmonization process requires a clear mapping or association between the study-specific variables in the input dataset and the harmonized variables to be generated in the output harmonized dataset. This mapping is typically defined using the Data Processing Element (DPE) and the DataSchema.

The DPE specifies the rules and transformations necessary to derive the harmonized variables from the study-specific variables. It outlines how the data should be processed, transformed, or combined to create the harmonized variables.

The DataSchema, on the other hand, provides a structured description of the harmonized variables to be generated. It includes information such as variable names, data types, and any other relevant metadata needed for the harmonization process.

By using the DPE in conjunction with the DataSchema, the harmonization process can accurately extract, process, and create harmonized variables in the output datasets based on the variables found in the input datasets. This ensures that the harmonization is done correctly and consistently across different datasets from various studies.

At the end of the process, the harmonized_dossier contains as many datasets as there are study-specific datasets in the input. Each harmonized dataset has the same number and names of columns, which correspond to the names of the harmonized variables declared in the DataSchema.

In other words, for each study-specific dataset provided as input, the harmonization process generates a corresponding harmonized dataset in the harmonized_dossier. The structure of each harmonized dataset is standardized, ensuring that they all have the same variables, and the names of these variables are consistent with the information specified in the DataSchema.

This consistency and standardization in the structure of the harmonized datasets make it easier for further analyses and comparisons across different studies. Researchers can work with a unified dataset format, facilitating the integration and synthesis of data from diverse sources while ensuring that the harmonized variables are appropriately named and aligned according to the predefined DataSchema.

See how to fill the Data processing elements