R/03-harmonized_data_evaluate.R
dataschema_evaluate.Rd
Assesses the content and structure of a DataSchema and reports potential issues to facilitate the assessment of input data. The report can be used to help assess data structure, presence of fields, coherence across elements, and taxonomy or data dictionary formats. This report is compatible with Excel and can be exported as an Excel spreadsheet.
dataschema_evaluate(dataschema, taxonomy = NULL)
A list of tibble(s) representing meta data of an * associated harmonized dossier.
A tibble identifying the scheme used for variables classification.
A list of tibbles of report for the DataSchema.
A DataSchema defines the harmonized variables to be generated, representing meta data of an associated harmonized dossier. It must be a list of data frame like objects with elements named 'Variables' (required) and 'Categories' (if any). The 'Variables' element must contain at least the 'name' column, and the 'Categories' element must contain at least the 'variable' and 'name' columns to be usable in any function. To be considered as a minimum workable DataSchema, in 'Variables' the 'name' column must also have unique and non-null entries, and in 'Categories' the combination of 'variable' and 'name' columns must also be unique.
A taxonomy is classification scheme that can be defined for variable attributes. If defined, a taxonomy must be a data frame like object. It must be compatible with (and is generally extracted from) an Opal environment. To work with certain functions, a valid taxonomy must contain at least the columns 'taxonomy', 'vocabulary', and 'terms'. In addition, the taxonomy may follow Maelstrom research taxonomy, and its content can be evaluated accordingly, such as naming convention restriction, tagging elements, or scales, which are specific to Maelstrom Research. In this particular case, the tibble must also contain 'vocabulary_short', 'taxonomy_scale', 'vocabulary_scale' and 'term_scale' to work with some specific functions.
{
library(dplyr)
library(madshapR) # data_dict_filter
dataschema <-
DEMO_files_harmo$`dataschema - final` %>%
data_dict_filter("name == 'adm_unique_id'")
dataschema_evaluate(dataschema)
}
#> Warning: package 'dplyr' was built under R version 4.2.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> - DATA DICTIONARY ASSESSMENT: dataschema --------------
#> Assess the standard adequacy of naming
#> Assess the uniqueness of variable names
#> Assess the presence of possible duplicated columns
#> Assess the presence of empty rows in the data dictionary
#> Assess the presence of empty columns in the data dictionary
#> Assess the completion of `label(:xx)` column in 'Variables'
#> Assess the `valueType` column in 'Variables'
#> Generate report
#>
#> - WARNING MESSAGES (if any): --------------------------------------------
#>
#> $`Study-specific Dataschema summary`
#> # A tibble: 1 × 25
#> index name `label:en` valueType unit other_format individual_entity
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 adm_unique_id Unique ide… text NA NA NA
#> # ℹ 18 more variables: time_period <chr>, source <chr>, informant <chr>,
#> # time_collection <chr>, wording <chr>, measures <chr>, procedures <chr>,
#> # instructions <chr>, comments <chr>, `Mlstr_additional::Source` <chr>,
#> # `Mlstr_additional::Target` <chr>, `Mlstr_area::1` <chr>,
#> # `Mlstr_area::1.term` <chr>, `Mlstr_area::1.scale` <chr>,
#> # `Mlstr_area::2` <chr>, `Mlstr_area::2.term` <chr>, `Mlstr_area::3` <chr>,
#> # `Mlstr_area::3.term` <chr>
#>
#> $`Study-specific Dataschema assessement`
#> # A tibble: 19 × 5
#> sheet col_name name_var Quality assessment commen…¹ value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Variables Mlstr_additional::Source NA [INFO] - Empty column(s) NA
#> 2 Variables Mlstr_additional::Target NA [INFO] - Empty column(s) NA
#> 3 Variables Mlstr_area::1.scale NA [INFO] - Empty column(s) NA
#> 4 Variables Mlstr_area::2 NA [INFO] - Empty column(s) NA
#> 5 Variables Mlstr_area::2.term NA [INFO] - Empty column(s) NA
#> 6 Variables Mlstr_area::3 NA [INFO] - Empty column(s) NA
#> 7 Variables Mlstr_area::3.term NA [INFO] - Empty column(s) NA
#> 8 Variables comments NA [INFO] - Empty column(s) NA
#> 9 Variables individual_entity NA [INFO] - Empty column(s) NA
#> 10 Variables informant NA [INFO] - Empty column(s) NA
#> 11 Variables instructions NA [INFO] - Empty column(s) NA
#> 12 Variables measures NA [INFO] - Empty column(s) NA
#> 13 Variables other_format NA [INFO] - Empty column(s) NA
#> 14 Variables procedures NA [INFO] - Empty column(s) NA
#> 15 Variables source NA [INFO] - Empty column(s) NA
#> 16 Variables time_collection NA [INFO] - Empty column(s) NA
#> 17 Variables time_period NA [INFO] - Empty column(s) NA
#> 18 Variables unit NA [INFO] - Empty column(s) NA
#> 19 Variables wording NA [INFO] - Empty column(s) NA
#> # ℹ abbreviated name: ¹`Quality assessment comment`
#>