Assesses the content and structure of a DataSchema and reports potential issues to facilitate the assessment of input data. The report can be used to help assess data structure, presence of fields, coherence across elements, and taxonomy or data dictionary formats. This report is compatible with Excel and can be exported as an Excel spreadsheet.

dataschema_evaluate(dataschema, taxonomy = NULL)

Arguments

dataschema

A list of tibble(s) representing meta data of an * associated harmonized dossier.

taxonomy

A tibble identifying the scheme used for variables classification.

Value

A list of tibbles of report for the DataSchema.

Details

A DataSchema defines the harmonized variables to be generated, representing meta data of an associated harmonized dossier. It must be a list of data frame like objects with elements named 'Variables' (required) and 'Categories' (if any). The 'Variables' element must contain at least the 'name' column, and the 'Categories' element must contain at least the 'variable' and 'name' columns to be usable in any function. To be considered as a minimum workable DataSchema, in 'Variables' the 'name' column must also have unique and non-null entries, and in 'Categories' the combination of 'variable' and 'name' columns must also be unique.

A taxonomy is classification scheme that can be defined for variable attributes. If defined, a taxonomy must be a data frame like object. It must be compatible with (and is generally extracted from) an Opal environment. To work with certain functions, a valid taxonomy must contain at least the columns 'taxonomy', 'vocabulary', and 'terms'. In addition, the taxonomy may follow Maelstrom research taxonomy, and its content can be evaluated accordingly, such as naming convention restriction, tagging elements, or scales, which are specific to Maelstrom Research. In this particular case, the tibble must also contain 'vocabulary_short', 'taxonomy_scale', 'vocabulary_scale' and 'term_scale' to work with some specific functions.

Examples

{

library(dplyr)
library(madshapR) # data_dict_filter

dataschema <- 
  DEMO_files_harmo$`dataschema - final` %>%
  data_dict_filter("name == 'adm_unique_id'")
  
dataschema_evaluate(dataschema)

}
#> Warning: package 'dplyr' was built under R version 4.2.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> - DATA DICTIONARY ASSESSMENT: dataschema --------------
#>     Assess the standard adequacy of naming
#>     Assess the uniqueness of variable names
#>     Assess the presence of possible duplicated columns
#>     Assess the presence of empty rows in the data dictionary
#>     Assess the presence of empty columns in the data dictionary
#>     Assess the completion of `label(:xx)` column in 'Variables'
#>     Assess the `valueType` column in 'Variables'
#>     Generate report
#> 
#>   - WARNING MESSAGES (if any): --------------------------------------------
#> 
#> $`Study-specific Dataschema summary`
#> # A tibble: 1 × 25
#>   index name          `label:en`  valueType unit  other_format individual_entity
#>   <int> <chr>         <chr>       <chr>     <chr> <chr>        <chr>            
#> 1     1 adm_unique_id Unique ide… text      NA    NA           NA               
#> # ℹ 18 more variables: time_period <chr>, source <chr>, informant <chr>,
#> #   time_collection <chr>, wording <chr>, measures <chr>, procedures <chr>,
#> #   instructions <chr>, comments <chr>, `Mlstr_additional::Source` <chr>,
#> #   `Mlstr_additional::Target` <chr>, `Mlstr_area::1` <chr>,
#> #   `Mlstr_area::1.term` <chr>, `Mlstr_area::1.scale` <chr>,
#> #   `Mlstr_area::2` <chr>, `Mlstr_area::2.term` <chr>, `Mlstr_area::3` <chr>,
#> #   `Mlstr_area::3.term` <chr>
#> 
#> $`Study-specific Dataschema assessement`
#> # A tibble: 19 × 5
#>    sheet     col_name                 name_var Quality assessment commen…¹ value
#>    <chr>     <chr>                    <chr>    <chr>                       <chr>
#>  1 Variables Mlstr_additional::Source NA       [INFO] - Empty column(s)    NA   
#>  2 Variables Mlstr_additional::Target NA       [INFO] - Empty column(s)    NA   
#>  3 Variables Mlstr_area::1.scale      NA       [INFO] - Empty column(s)    NA   
#>  4 Variables Mlstr_area::2            NA       [INFO] - Empty column(s)    NA   
#>  5 Variables Mlstr_area::2.term       NA       [INFO] - Empty column(s)    NA   
#>  6 Variables Mlstr_area::3            NA       [INFO] - Empty column(s)    NA   
#>  7 Variables Mlstr_area::3.term       NA       [INFO] - Empty column(s)    NA   
#>  8 Variables comments                 NA       [INFO] - Empty column(s)    NA   
#>  9 Variables individual_entity        NA       [INFO] - Empty column(s)    NA   
#> 10 Variables informant                NA       [INFO] - Empty column(s)    NA   
#> 11 Variables instructions             NA       [INFO] - Empty column(s)    NA   
#> 12 Variables measures                 NA       [INFO] - Empty column(s)    NA   
#> 13 Variables other_format             NA       [INFO] - Empty column(s)    NA   
#> 14 Variables procedures               NA       [INFO] - Empty column(s)    NA   
#> 15 Variables source                   NA       [INFO] - Empty column(s)    NA   
#> 16 Variables time_collection          NA       [INFO] - Empty column(s)    NA   
#> 17 Variables time_period              NA       [INFO] - Empty column(s)    NA   
#> 18 Variables unit                     NA       [INFO] - Empty column(s)    NA   
#> 19 Variables wording                  NA       [INFO] - Empty column(s)    NA   
#> # ℹ abbreviated name: ¹​`Quality assessment comment`
#>