a-Glossary-and-templates.Rmd
The objects used by Rmonize to process inputs into harmonized outputs are described below. The components of each object that are used by the package are listed and must have the names as presented, except where indicated in square brackets ([…]). An asterisk (*) indicates an object or column that must be provided by the user. If not provided, information in columns without asterisks will generally be filled with default values as needed in functions. Additional columns with different names can be present but are not used in data processing.
You can download templates or find additional documentation where available using the links provided.
List of core variables to generate across datasets, and related metadata.To be compatible with Rmonize, the DataSchema is typically prepared from an Excel template including two separate sheets. The first one is used to document variables and the second to provide information related to categorical variables.
Columns to be included in the Excel document.
Name | Description |
---|---|
index | Index to order variables in the table. |
name * | Name of the DataSchema variable. Each entry must be unique. The first entry must be the primary identifier variable (e.g., participant unique ID). |
label | Short description of the DataSchema variable. A language can be specified using a language code, such as ‘label:en’ for English or ‘label:fr’ for French. |
valueType | Value type of the variable (e.g., text, integer, decimal, boolean, date, datetime). See additional details. |
Columns to be included in the Excel document. If there are no categories to define, this table can be blank or the Categories sheet can be excluded.
Name | Description |
---|---|
variable * | Name of the DataSchema variable to which the category belongs. This column is required if the Categories table is present. The value must also be present in the column ‘name’ in the Variables table. |
name * | Category code value. This column is required if the table Categories is present. The combination of ‘variable’ and ‘name’ within the Categories table (i.e., the combination of DataSchema variable and category code value) must be unique. |
label | Short description of the category code value. A language can be specified using a language code, such as ‘label:en’ for English or ‘label:fr’ for French. |
missing | Boolean value (TRUE/FALSE or 1/0) indicating if the value in ‘name’ is interpreted as a missing value (e.g., question skipped by design in a questionnaire or a response option “Prefer not to answer”). |
Data table containing a collection of variables to process under the DataSchema format.
Name | Description |
---|---|
[col_1] * | First variable in the input dataset, typically the identifier or index. A dataset must have at least one variable. |
[col_2] … | Additional variable(s) in the input dataset. |
List of variables in an input dataset. To be compatible with Rmonize, the input data dictionary can be prepared as an Excel template including two separate sheets. The first one is used to document variables and the second to provide information related to categorical variables.
Columns to be included in the Excel document.
Name | Description |
---|---|
index | Index to order variables in the table. |
name * | Name of the input dataset variable. Each entry must be unique. The first entry is typically the primary identifier variable (e.g., participant unique ID). |
label | Short description of the input dataset variable. A language can be specified using a language code, such as ‘label:en’ for English or ‘label:fr’ for French. |
valueType | Value type of the variable (e.g., text, integer, decimal, boolean, date, datetime). See additional details. |
Metadata table containing the list of categories and related metadata (coding and description of the response options) defined for categorical variables (if any). If there are categorical variables defined, this table is required and uses the following columns.
Name | Description |
---|---|
variable * | Name of the input dataset variable to which the category belongs. This column is required if the Categories table is present. The value must also be present in the column ‘name’ in the Variables table. |
name * | Category code value. This column is required if the table Categories is present. The combination of ‘variable’ and ‘name’ within the Categories table (i.e., the combination of DataSchema variable and category code value) must be unique. |
label | Short description of the category code value. A language can be specified using a language code, such as ‘label:en’ for English or ‘label:fr’ for French. |
missing | Boolean value (TRUE/FALSE or 1/0) indicating if the value in ‘name’ is interpreted as a missing value (e.g., question skipped by design in a questionnaire or a response option “Prefer not to answer”). |
Metadata table specifying the input elements and instructions to process input data into DataSchema variables, with columns indicating whether or not each DataSchema variable can be generated in each dataset and, where applicable, the algorithms used for data processing.
See additional documentation for Data Processing Elements.
Name | Description |
---|---|
index | Index to order algorithms in the table. |
dataschema_variable * | Name of the DataSchema variable being generated (must match a variable in the DataSchema).The first entry must be the primary identifier variable (e.g., participant unique ID). |
valueType | Value type of the DataSchema variable (as in the DataSchema). |
input_dataset * | Name of the Input Dataset used to generate the DataSchema variable (as named in the Dossier). |
input_variables * | Name of the variable(s) in the ‘input_dataset’ used to generate the DataSchema variable. |
Mlstr_harmo:rule_category * | Type of algorithm used to generate the DataSchema variable from the input variables. The first entry must be the creation of a harmonized primary identifier variable (e.g., participant unique ID). |
Mlstr_harmo:algorithm * | Algorithm used to generate the DataSchema variable from the input variables. |
Mlstr_harmo:status | Possibility to generate the DataSchema variable from the input dataset. This is considered “complete” if the DataSchema variable can be generated from the input dataset or “impossible” if not. |
Mlstr_harmo:status_detail | Additional information about the possibility to generate the DataSchema variable from the input dataset. If ‘Mlstr_harmo:status’ is “complete”, the information could be considered “identical” or “compatible” with the DataSchema variable. If ‘Mlstr_harmo:status’ is “impossible”, the information could be considered “incompatible” or “unavailable” for harmonization. |
Mlstr_harmo:comment | Additional information about the inputs or algorithms to document with the harmonized variable. |
Set of one or more input dataset(s) and their associated input data dictionary(ies).
Name | Description |
---|---|
[input_dataset_1] * | Data table containing a collection of variables to process under the DataSchema formats and its associated input data dictionary. At least one input dataset is required. The input dataset name is defined by the user and is indicated in the Data Processing Elements column ‘input_dataset’. This name identifies the source of input variables for data processing. |
[input_dataset_2] … | Additional input dataset and associated data dictionary. |
The main objects generated by Rmonize and their primary components are described below.
Data table containing a collection of harmonized variables processed under the DataSchema formats.
Name | Description |
---|---|
[harmonized_variable_1] | First harmonized variable. This is the primary identifier variable (e.g., participant unique ID). Variables in the harmonized dataset are generated in the order defined in the DataSchema. |
[harmonized_variable_2] … | Additional harmonized variable. |
List of variables in a harmonized dataset and related metadata (taken from the DataSchema and Data Processing Elements). Two tables are included--the first one documents variables and the second provides information related to categorical variables.
Columns included in the table.
Name | Description |
---|---|
index | Index to order variables in the table (taken from the DataSchema). |
name | Name of the harmonized variable (taken from the DataSchema). |
label | Short description of the harmonized variable (taken from the DataSchema). |
valueType | Value type of the harmonized variable (taken from the DataSchema). |
Mlstr_harmo:rule_category | Type of algorithm used to generate the DataSchema variable from the input variables (taken from the Data Processing Elements). |
Mlstr_harmo:algorithm | Algorithm used to generate the harmonized variable from the input variables (taken from the Data Processing Elements). |
Mlstr_harmo:status | Possibility to generate the DataSchema variable from the input dataset (taken from the Data Processing Elements). |
Mlstr_harmo:status_detail | Additional information about the possibility to generate the DataSchema variable from the input dataset (taken from the Data Processing Elements). |
Mlstr_harmo:comment | Additional information about the inputs or algorithms to document with the harmonized variable (taken from the Data Processing Elements). |
Columns included in the table.
Name | Description |
---|---|
variable | Name of the harmonized variable to which the category belong (taken from the DataSchema). |
name | Category code value (taken from the DataSchema). |
label | Short description of the category code value (taken from the DataSchema). |
missing | Boolean value (TRUE/FALSE or 1/0) indicating if the value in ‘name’ is interpreted as a missing value (taken from the DataSchema). |
Set of one or more harmonized dataset(s) and their associated data dictionary(ies).
Name | Description |
---|---|
[harmonized_dataset_1] | Data table containing a collection of harmonized variables processed under the DataSchema format and its associated data dictionary. There is one harmonized dataset per input dataset. |
[harmonized_dataset_2] … | Additional harmonized dataset and its associated data dictionary. |
Combined data table containing multiple harmonized datasets processed under the same DataSchema formats.
Name | Description |
---|---|
[harmonized_dataset_1] | First harmonized variable. This is the primary unique identifier variable. Variables in the harmonized dataset are generated in the order defined in the DataSchema. |
[harmonized_dataset_2] … | Additional harmonized variable. |
List of variables in a pooled harmonized dataset and related metadata (taken from the DataSchema). Two tables are included--the first one documents variables and the second provides information related to categorical variables.
Columns included in the table.
Name | Description |
---|---|
index | Index to order variables in the table (taken from the DataSchema). |
name | Name of the harmonized variable (taken from the DataSchema). |
label | Short description of the harmonized variable (taken from the DataSchema). |
valueType | Value type of the harmonized variable (taken from the DataSchema). |
Columns included in the table.
Name | Description |
---|---|
variable | Name of the harmonized variable to which the category belong (taken from the DataSchema). |
name | Category code value (taken from the DataSchema). |
label | Short description of the category code value (taken from the DataSchema). |
missing | Boolean value (TRUE/FALSE or 1/0) indicating if the value in ‘name’ is interpreted as a missing value (taken from the DataSchema). |