Reads the DataSchema and data processing elements objects to generate harmonized dataset(s) and annotated data processing elements object with harmonization statuses and any processing errors. The function uses the DataSchema and data processing elements specifications to process input variables into output harmonized variables for each study. Documentation of each data processing action is generated in the console to support the identification of errors and correction of the data processing elements file and objects, as needed. An annotated data processing elements is also produced, providing harmonization statuses (complete/impossible) for each DataSchema variable in input dataset, which can be used to create a summary of the harmonization potential of the DataSchema variables across input dataset(s).

harmo_process(dossier, dataschema = NULL, data_proc_elem)

Arguments

dossier

List of tibble(s), each of them being datasets to be harmonised.

dataschema

A list of tibble(s) representing meta data of an associated harmonized dossier.

data_proc_elem

A tibble, identifying the input data processing elements.

Value

A list of tibbles of each harmonized dataset that has been harmonized from input datataset.

Details

A dossier must be a named list containing at least one data frame or data frame extension (e.g. a tibble), each of them being datasets. The name of each tibble will be use as the reference name of the dataset.

A DataSchema defines the harmonized variables to be generated, representing meta data of an associated harmonized dossier. It must be a list of data frame like objects with elements named 'Variables' (required) and 'Categories' (if any). The 'Variables' element must contain at least the 'name' column, and the 'Categories' element must contain at least the 'variable' and 'name' columns to be usable in any function. To be considered as a minimum workable DataSchema, in 'Variables' the 'name' column must also have unique and non-null entries, and in 'Categories' the combination of 'variable' and 'name' columns must also be unique.

A data processing element contains the rules and metadata that will be used to perform harmonization of input datasets in accordance with the DataSchema. It must be a data frame or data frame extension (e.g. a tibble) and it must contain certain columns which participate to the process, including the dataschema_variable, ss-table,ss_variables, Mlstr_harmo::rule_category and Mlstr_harmo::algorithm. The mandatory first processing element must be ""id_creation"" in Mlstr_harmo::rule_category followed by the name of the column taken as identifier of each dataset to initiate the process of harmonization.

Examples

# \donttest{

# You can use our demonstration files to run examples

dataset_MELBOURNE_1 <- DEMO_files_harmo$dataset_MELBOURNE_1
dataset_MELBOURNE_2 <- DEMO_files_harmo$dataset_MELBOURNE_2
dossier <- dossier_create(list(dataset_MELBOURNE_1, dataset_MELBOURNE_2))

dataschema <- DEMO_files_harmo$`dataschema - final`

data_proc_elem <- 
DEMO_files_harmo$`data_processing_elements - final`

# perform harmonization
harmo_process(dossier,dataschema,data_proc_elem)
#> - DATA PROCESSING ELEMENTS: ------------------------------------------------
#> 
#> --harmonization of : dataset_MELBOURNE_1 ------------------------------
#>     processing 1/13 : adm_unique_id               id created
#>     processing 2/13 : adm_study                   complete
#>     processing 3/13 : adm_year_dce                complete
#>     processing 4/13 : sdc_age                     complete
#>     processing 5/13 : sdc_gender                  complete
#>     processing 6/13 : phy_height                  impossible
#>     processing 7/13 : phy_weight                  impossible
#>     processing 8/13 : phy_bmi                     complete
#>     processing 9/13 : rep_preg_ever               impossible
#>     processing 10/13 : rep_preg_curr              impossible
#>     processing 11/13 : lsb_smo_ever               impossible
#>     processing 12/13 : lsb_smo_curr               impossible
#>     processing 13/13 : lsb_smo_status             impossible
#> 
#> --harmonization of : dataset_MELBOURNE_2 ------------------------------
#>     processing 1/13 : adm_unique_id               id created
#>     processing 2/13 : adm_study                   complete
#>     processing 3/13 : adm_year_dce                complete
#>     processing 4/13 : sdc_age                     impossible
#>     processing 5/13 : sdc_gender                  impossible
#>     processing 6/13 : phy_height                  impossible
#>     processing 7/13 : phy_weight                  impossible
#>     processing 8/13 : phy_bmi                     impossible
#>     processing 9/13 : rep_preg_ever               impossible
#>     processing 10/13 : rep_preg_curr              complete
#>     processing 11/13 : lsb_smo_ever               complete
#>     processing 12/13 : lsb_smo_curr               complete
#>     processing 13/13 : lsb_smo_status             complete
#> 
#> 
#> - CREATION OF STUDY-SPECIFIC DATASCHEMA: ----------------------------------
#> 
#> dataset_MELBOURNE_1 : done
#> dataset_MELBOURNE_2 : done
#> 
#> 
#> ------------------------------------------------------------------------------
#> 
#> Your harmonization is done. Please check if everything worked correctly.
#> 
#> - WARNING MESSAGES (if any): ----------------------------------------------
#> 
#> $dataset_MELBOURNE_1
#> # A tibble: 19 × 13
#>    adm_unique_id adm_study adm_year_dce sdc_age sdc_gender phy_height phy_weight
#>    <chr>         <chr>     <chr>          <int> <int+lbl>       <dbl>      <dbl>
#>  1 377943        MELBOURNE 2007              52 2 [Female]         NA         NA
#>  2 497013        MELBOURNE 2007              49 1 [Male]           NA         NA
#>  3 927676        MELBOURNE 2007              43 1 [Male]           NA         NA
#>  4 995667        MELBOURNE 2007              59 2 [Female]         NA         NA
#>  5 21829         MELBOURNE 2007              40 2 [Female]         NA         NA
#>  6 209432        MELBOURNE 2007              47 1 [Male]           NA         NA
#>  7 272983        MELBOURNE 2007              NA 2 [Female]         NA         NA
#>  8 580632        MELBOURNE 2007              53 2 [Female]         NA         NA
#>  9 304624        MELBOURNE 2007              35 2 [Female]         NA         NA
#> 10 637551        MELBOURNE 2007              40 1 [Male]           NA         NA
#> 11 279817        MELBOURNE 2007              41 1 [Male]           NA         NA
#> 12 235415        MELBOURNE 2007              34 2 [Female]         NA         NA
#> 13 373673        MELBOURNE 2007              48 2 [Female]         NA         NA
#> 14 485098        MELBOURNE 2007              43 2 [Female]         NA         NA
#> 15 299427        MELBOURNE 2007              NA 1 [Male]           NA         NA
#> 16 854073        MELBOURNE 2007              41 1 [Male]           NA         NA
#> 17 197666        MELBOURNE 2007              33 2 [Female]         NA         NA
#> 18 130327        MELBOURNE 2007              57 2 [Female]         NA         NA
#> 19 220050        MELBOURNE 2007              50 2 [Female]         NA         NA
#> # ℹ 6 more variables: phy_bmi <dbl>, rep_preg_ever <int+lbl>,
#> #   rep_preg_curr <int+lbl>, lsb_smo_ever <int+lbl>, lsb_smo_curr <int+lbl>,
#> #   lsb_smo_status <int+lbl>
#> 
#> $dataset_MELBOURNE_2
#> # A tibble: 19 × 13
#>    adm_unique_id adm_study adm_year_dce sdc_age sdc_gender phy_height phy_weight
#>    <chr>         <chr>     <chr>          <int> <int+lbl>       <dbl>      <dbl>
#>  1 377943        MELBOURNE 2007              NA NA                 NA         NA
#>  2 497013        MELBOURNE 2007              NA NA                 NA         NA
#>  3 927676        MELBOURNE 2007              NA NA                 NA         NA
#>  4 995667        MELBOURNE 2007              NA NA                 NA         NA
#>  5 21829         MELBOURNE 2007              NA NA                 NA         NA
#>  6 209432        MELBOURNE 2007              NA NA                 NA         NA
#>  7 272983        MELBOURNE 2007              NA NA                 NA         NA
#>  8 580632        MELBOURNE 2007              NA NA                 NA         NA
#>  9 304624        MELBOURNE 2007              NA NA                 NA         NA
#> 10 637551        MELBOURNE 2007              NA NA                 NA         NA
#> 11 279817        MELBOURNE 2007              NA NA                 NA         NA
#> 12 235415        MELBOURNE 2007              NA NA                 NA         NA
#> 13 373673        MELBOURNE 2007              NA NA                 NA         NA
#> 14 485098        MELBOURNE 2007              NA NA                 NA         NA
#> 15 299427        MELBOURNE 2007              NA NA                 NA         NA
#> 16 854073        MELBOURNE 2007              NA NA                 NA         NA
#> 17 197666        MELBOURNE 2007              NA NA                 NA         NA
#> 18 130327        MELBOURNE 2007              NA NA                 NA         NA
#> 19 220050        MELBOURNE 2007              NA NA                 NA         NA
#> # ℹ 6 more variables: phy_bmi <dbl>, rep_preg_ever <int+lbl>,
#> #   rep_preg_curr <int+lbl>, lsb_smo_ever <int+lbl>, lsb_smo_curr <int+lbl>,
#> #   lsb_smo_status <int+lbl>
#> 
#> attr(,"harmonizR::class")
#> [1] "harmonized_dossier"
#> attr(,"harmonizR::Dataschema")
#> attr(,"harmonizR::Dataschema")$Variables
#> # A tibble: 13 × 8
#>    name    `label:en` valueType index unit  `Mlstr_area::1` `Mlstr_area::1.term`
#>    <chr>   <chr>      <chr>     <chr> <chr> <chr>           <chr>               
#>  1 adm_un… Unique id… text      1     NA    ADM             Identifiers         
#>  2 adm_st… Indicator… text      2     NA    ADM             Questionnaire_inter…
#>  3 adm_ye… Indicator… text      3     NA    ADM             Questionnaire_inter…
#>  4 sdc_age Participa… integer   4     years SDC             Age                 
#>  5 sdc_ge… Gender of… integer   5     NA    SDC             Sex                 
#>  6 phy_he… participa… decimal   6     cm    PME             Anthropo_measures   
#>  7 phy_we… participa… decimal   7     kg    PME             Anthropo_measures   
#>  8 phy_bmi participa… decimal   8     kg/m  PME             Anthropo_measures   
#>  9 rep_pr… whether t… integer   9     NA    REP             Pregnancy_delivery  
#> 10 rep_pr… whether t… integer   10    NA    REP             Pregnancy_delivery  
#> 11 lsb_sm… whether t… integer   11    NA    LSB             Tobacco             
#> 12 lsb_sm… whether t… integer   12    NA    LSB             Tobacco             
#> 13 lsb_sm… participa… integer   13    NA    LSB             Tobacco             
#> # ℹ 1 more variable: `Mlstr_area::1.scale` <chr>
#> 
#> attr(,"harmonizR::Dataschema")$Categories
#> # A tibble: 13 × 4
#>    variable       name  `label:en`                           missing
#>    <chr>          <chr> <chr>                                <lgl>  
#>  1 sdc_gender     1     Male                                 FALSE  
#>  2 sdc_gender     2     Female                               FALSE  
#>  3 rep_preg_ever  0     never pregnant                       FALSE  
#>  4 rep_preg_ever  1     pregnant once or more                FALSE  
#>  5 rep_preg_curr  0     currently pregnant                   FALSE  
#>  6 rep_preg_curr  1     not currently pregnant               FALSE  
#>  7 lsb_smo_ever   0     never smoked                         FALSE  
#>  8 lsb_smo_ever   1     smoked one pack of cigarette or more FALSE  
#>  9 lsb_smo_curr   0     currently smoker                     FALSE  
#> 10 lsb_smo_curr   1     not currently smoker                 FALSE  
#> 11 lsb_smo_status 0     never smoker                         FALSE  
#> 12 lsb_smo_status 1     former smoker                        FALSE  
#> 13 lsb_smo_status 2     current smoker                       FALSE  
#> 
#> attr(,"harmonizR::Dataschema")attr(,"madshapR::class")
#> [1] "data_dict_mlstr"
#> attr(,"harmonizR::Dataschema")attr(,"harmonizR::class")
#> [1] "Dataschema_mlstr"
#> attr(,"harmonizR::data processing elements")
#> # A tibble: 26 × 11
#>    index dataschema_variable valueType ss_table            ss_variables
#>  * <dbl> <chr>               <chr>     <chr>               <chr>       
#>  1     1 adm_unique_id       text      dataset_MELBOURNE_1 id          
#>  2     2 adm_study           text      dataset_MELBOURNE_1 __BLANK__   
#>  3     3 adm_year_dce        text      dataset_MELBOURNE_1 __BLANK__   
#>  4     4 sdc_age             integer   dataset_MELBOURNE_1 age         
#>  5     5 sdc_gender          integer   dataset_MELBOURNE_1 Gender      
#>  6     6 phy_height          decimal   dataset_MELBOURNE_1 __BLANK__   
#>  7     7 phy_weight          decimal   dataset_MELBOURNE_1 __BLANK__   
#>  8     8 phy_bmi             decimal   dataset_MELBOURNE_1 BMI         
#>  9     9 rep_preg_ever       integer   dataset_MELBOURNE_1 __BLANK__   
#> 10    10 rep_preg_curr       integer   dataset_MELBOURNE_1 __BLANK__   
#> # ℹ 16 more rows
#> # ℹ 6 more variables: `Mlstr_harmo::rule_category` <chr>,
#> #   `Mlstr_harmo::algorithm` <chr>, `Mlstr_harmo::comment` <chr>,
#> #   `Mlstr_harmo::status` <chr>, `harmonizR::r_script` <chr>,
#> #   `Mlstr_harmo::status_detail` <chr>
#> attr(,"harmonizR::harmonized_col_id")
#> [1] "adm_unique_id"

# }