Generate a report and summary of a harmonized dossier — harmonized_dossier

Assesses and summarizes the content and structure of a harmonized dossier (list of harmoninized datasets) and reports potential issues to facilitate the assessment of input data. The report can be used to help assess data structure, presence of fields, coherence across elements, and taxonomy or data dictionary formats. The summary provides additional information about variable distributions and descriptive statistics. This report is compatible with Excel and can be exported as an Excel spreadsheet.

harmonized_dossier_summarise(
  harmonized_dossier,
  dataschema = NULL,
  taxonomy = NULL,
  valueType_guess = FALSE
)

Arguments

harmonized_dossier: List of tibble(s), each of them being harmonized dataset.
dataschema: A list of tibble(s) representing meta data of an associated harmonized dossier.
taxonomy: A tibble identifying the scheme used for variables classification.
valueType_guess: Whether the output should include a more accurate valueType that could be applied to the dataset. FALSE by default.

Value

A list of tibbles of report for each harmonized dataset.

Details

A harmonized dossier must be a named list containing at least one data frame or data frame extension (e.g. a tibble), each of them being harmonized dataset(s). It is generally the product of applying harmonization processing to a dossier object. The name of each tibble will be use as the reference name of the dataset. A harmonized dossier has four attributes : harmonizR::class which is ""harmonized_dossier"" ; harmonizR::Dataschema (provided by user) ; harmonizR::data processing elements ; harmonizR::harmonized_col_id (provided by user) which refers to the column in each dataset which identifies unique combination observation/dataset. This id column name is the same across the dataset(s), the DataSchema and the data processing elements (created by using 'id_creation') and is used to initiate the process of harmonization.

A DataSchema defines the harmonized variables to be generated, representing meta data of an associated harmonized dossier. It must be a list of data frame like objects with elements named 'Variables' (required) and 'Categories' (if any). The 'Variables' element must contain at least the 'name' column, and the 'Categories' element must contain at least the 'variable' and 'name' columns to be usable in any function. To be considered as a minimum workable DataSchema, in 'Variables' the 'name' column must also have unique and non-null entries, and in 'Categories' the combination of 'variable' and 'name' columns must also be unique.

A taxonomy is classification scheme that can be defined for variable attributes. If defined, a taxonomy must be a data frame like object. It must be compatible with (and is generally extracted from) an Opal environment. To work with certain functions, a valid taxonomy must contain at least the columns 'taxonomy', 'vocabulary', and 'terms'. In addition, the taxonomy may follow Maelstrom research taxonomy, and its content can be evaluated accordingly, such as naming convention restriction, tagging elements, or scales, which are specific to Maelstrom Research. In this particular case, the tibble must also contain 'vocabulary_short', 'taxonomy_scale', 'vocabulary_scale' and 'term_scale' to work with some specific functions.

The valueType is a property of a variable and is required in certain functions to determine the handling of the variables. The valueType refers to the OBiBa-internal type of a variable. It is specified in a data dictionary in a column valueType and can be associated with variables as attributes. Acceptable valueTypes include 'text', 'integer', 'decimal', 'boolean', 'datetime', 'date'). The full list of OBiBa valueType possibilities and their correspondance with R data types are available using madshapR::valueType_list.

Examples

{

library(haven)
harmonized_dossier <- DEMO_files_harmo$harmonized_dossier

# summary harmonization
harmonized_dossier_summarise(harmonized_dossier)

}
#> Warning: package 'haven' was built under R version 4.2.3
#> - DOSSIER SUMMARY: -----------------------------------------------------
#> - DATA DICTIONARY ASSESSMENT: data_dict --------------
#>     Assess the standard adequacy of naming
#>     Assess the uniqueness of variable names
#>     Assess the presence of possible duplicated columns
#>     Assess the presence of empty rows in the data dictionary
#>     Assess the presence of empty columns in the data dictionary
#>     Assess the completion of `label(:xx)` column in 'Variables'
#>     Assess the `valueType` column in 'Variables'
#>     Generate report
#> 
#>     The data dictionary contains no error/warning.
#> 
#>   - WARNING MESSAGES (if any): --------------------------------------------
#> 
#> - DATASET ASSESSMENT: dataset_MELBOURNE_1 --------------------------
#>     Assess the standard adequacy of naming
#>     Assess the presence of variable names both in dataset and data dictionary
#>     Assess the presence of possible duplicated variable in the dataset
#>     Assess the presence of duplicated participants in the dataset
#> Error in df_append(out, united, after = after): `after` must be a whole number, not an integer `NA`.
#> ℹ This is an internal error that was detected in the tidyr package.
#>   Please report it at <https://github.com/tidyverse/tidyr/issues> with a reprex
#>   (<https://tidyverse.org/help/>) and the full backtrace.