R/02-harmo_process_harmonization.R
harmo_process.Rd
Reads the DataSchema and data processing elements objects to generate harmonized dataset(s) and annotated data processing elements object with harmonization statuses and any processing errors. The function uses the DataSchema and data processing elements specifications to process input variables into output harmonized variables for each study. Documentation of each data processing action is generated in the console to support the identification of errors and correction of the data processing elements file and objects, as needed. An annotated data processing elements is also produced, providing harmonization statuses (complete/impossible) for each DataSchema variable in input dataset, which can be used to create a summary of the harmonization potential of the DataSchema variables across input dataset(s).
harmo_process(dossier, dataschema = NULL, data_proc_elem)
List of tibble(s), each of them being datasets to be harmonised.
A list of tibble(s) representing meta data of an associated harmonized dossier.
A tibble, identifying the input data processing elements.
A list of tibbles of each harmonized dataset that has been harmonized from input datataset.
A dossier must be a named list containing at least one data frame or data frame extension (e.g. a tibble), each of them being datasets. The name of each tibble will be use as the reference name of the dataset.
A DataSchema defines the harmonized variables to be generated, representing meta data of an associated harmonized dossier. It must be a list of data frame like objects with elements named 'Variables' (required) and 'Categories' (if any). The 'Variables' element must contain at least the 'name' column, and the 'Categories' element must contain at least the 'variable' and 'name' columns to be usable in any function. To be considered as a minimum workable DataSchema, in 'Variables' the 'name' column must also have unique and non-null entries, and in 'Categories' the combination of 'variable' and 'name' columns must also be unique.
A data processing element contains the rules and metadata that will be used
to perform harmonization of input datasets in accordance with the DataSchema.
It must be a data frame or data frame extension (e.g. a tibble) and it must
contain certain columns which participate to the process, including the
dataschema_variable
, ss-table
,ss_variables
, Mlstr_harmo::rule_category
and
Mlstr_harmo::algorithm
. The mandatory first processing element must be
""id_creation"" in Mlstr_harmo::rule_category
followed by the name of the column
taken as identifier of each dataset to initiate the process of harmonization.
# \donttest{
# You can use our demonstration files to run examples
dataset_MELBOURNE_1 <- DEMO_files_harmo$dataset_MELBOURNE_1
dataset_MELBOURNE_2 <- DEMO_files_harmo$dataset_MELBOURNE_2
dossier <- dossier_create(list(dataset_MELBOURNE_1, dataset_MELBOURNE_2))
dataschema <- DEMO_files_harmo$`dataschema - final`
data_proc_elem <-
DEMO_files_harmo$`data_processing_elements - final`
# perform harmonization
harmo_process(dossier,dataschema,data_proc_elem)
#> - DATA PROCESSING ELEMENTS: ------------------------------------------------
#>
#> --harmonization of : dataset_MELBOURNE_1 ------------------------------
#> processing 1/13 : adm_unique_id id created
#> processing 2/13 : adm_study complete
#> processing 3/13 : adm_year_dce complete
#> processing 4/13 : sdc_age complete
#> processing 5/13 : sdc_gender complete
#> processing 6/13 : phy_height impossible
#> processing 7/13 : phy_weight impossible
#> processing 8/13 : phy_bmi complete
#> processing 9/13 : rep_preg_ever impossible
#> processing 10/13 : rep_preg_curr impossible
#> processing 11/13 : lsb_smo_ever impossible
#> processing 12/13 : lsb_smo_curr impossible
#> processing 13/13 : lsb_smo_status impossible
#>
#> --harmonization of : dataset_MELBOURNE_2 ------------------------------
#> processing 1/13 : adm_unique_id id created
#> processing 2/13 : adm_study complete
#> processing 3/13 : adm_year_dce complete
#> processing 4/13 : sdc_age impossible
#> processing 5/13 : sdc_gender impossible
#> processing 6/13 : phy_height impossible
#> processing 7/13 : phy_weight impossible
#> processing 8/13 : phy_bmi impossible
#> processing 9/13 : rep_preg_ever impossible
#> processing 10/13 : rep_preg_curr complete
#> processing 11/13 : lsb_smo_ever complete
#> processing 12/13 : lsb_smo_curr complete
#> processing 13/13 : lsb_smo_status complete
#>
#>
#> - CREATION OF STUDY-SPECIFIC DATASCHEMA: ----------------------------------
#>
#> dataset_MELBOURNE_1 : done
#> dataset_MELBOURNE_2 : done
#>
#>
#> ------------------------------------------------------------------------------
#>
#> Your harmonization is done. Please check if everything worked correctly.
#>
#> - WARNING MESSAGES (if any): ----------------------------------------------
#>
#> $dataset_MELBOURNE_1
#> # A tibble: 19 × 13
#> adm_unique_id adm_study adm_year_dce sdc_age sdc_gender phy_height phy_weight
#> <chr> <chr> <chr> <int> <int+lbl> <dbl> <dbl>
#> 1 377943 MELBOURNE 2007 52 2 [Female] NA NA
#> 2 497013 MELBOURNE 2007 49 1 [Male] NA NA
#> 3 927676 MELBOURNE 2007 43 1 [Male] NA NA
#> 4 995667 MELBOURNE 2007 59 2 [Female] NA NA
#> 5 21829 MELBOURNE 2007 40 2 [Female] NA NA
#> 6 209432 MELBOURNE 2007 47 1 [Male] NA NA
#> 7 272983 MELBOURNE 2007 NA 2 [Female] NA NA
#> 8 580632 MELBOURNE 2007 53 2 [Female] NA NA
#> 9 304624 MELBOURNE 2007 35 2 [Female] NA NA
#> 10 637551 MELBOURNE 2007 40 1 [Male] NA NA
#> 11 279817 MELBOURNE 2007 41 1 [Male] NA NA
#> 12 235415 MELBOURNE 2007 34 2 [Female] NA NA
#> 13 373673 MELBOURNE 2007 48 2 [Female] NA NA
#> 14 485098 MELBOURNE 2007 43 2 [Female] NA NA
#> 15 299427 MELBOURNE 2007 NA 1 [Male] NA NA
#> 16 854073 MELBOURNE 2007 41 1 [Male] NA NA
#> 17 197666 MELBOURNE 2007 33 2 [Female] NA NA
#> 18 130327 MELBOURNE 2007 57 2 [Female] NA NA
#> 19 220050 MELBOURNE 2007 50 2 [Female] NA NA
#> # ℹ 6 more variables: phy_bmi <dbl>, rep_preg_ever <int+lbl>,
#> # rep_preg_curr <int+lbl>, lsb_smo_ever <int+lbl>, lsb_smo_curr <int+lbl>,
#> # lsb_smo_status <int+lbl>
#>
#> $dataset_MELBOURNE_2
#> # A tibble: 19 × 13
#> adm_unique_id adm_study adm_year_dce sdc_age sdc_gender phy_height phy_weight
#> <chr> <chr> <chr> <int> <int+lbl> <dbl> <dbl>
#> 1 377943 MELBOURNE 2007 NA NA NA NA
#> 2 497013 MELBOURNE 2007 NA NA NA NA
#> 3 927676 MELBOURNE 2007 NA NA NA NA
#> 4 995667 MELBOURNE 2007 NA NA NA NA
#> 5 21829 MELBOURNE 2007 NA NA NA NA
#> 6 209432 MELBOURNE 2007 NA NA NA NA
#> 7 272983 MELBOURNE 2007 NA NA NA NA
#> 8 580632 MELBOURNE 2007 NA NA NA NA
#> 9 304624 MELBOURNE 2007 NA NA NA NA
#> 10 637551 MELBOURNE 2007 NA NA NA NA
#> 11 279817 MELBOURNE 2007 NA NA NA NA
#> 12 235415 MELBOURNE 2007 NA NA NA NA
#> 13 373673 MELBOURNE 2007 NA NA NA NA
#> 14 485098 MELBOURNE 2007 NA NA NA NA
#> 15 299427 MELBOURNE 2007 NA NA NA NA
#> 16 854073 MELBOURNE 2007 NA NA NA NA
#> 17 197666 MELBOURNE 2007 NA NA NA NA
#> 18 130327 MELBOURNE 2007 NA NA NA NA
#> 19 220050 MELBOURNE 2007 NA NA NA NA
#> # ℹ 6 more variables: phy_bmi <dbl>, rep_preg_ever <int+lbl>,
#> # rep_preg_curr <int+lbl>, lsb_smo_ever <int+lbl>, lsb_smo_curr <int+lbl>,
#> # lsb_smo_status <int+lbl>
#>
#> attr(,"harmonizR::class")
#> [1] "harmonized_dossier"
#> attr(,"harmonizR::Dataschema")
#> attr(,"harmonizR::Dataschema")$Variables
#> # A tibble: 13 × 8
#> name `label:en` valueType index unit `Mlstr_area::1` `Mlstr_area::1.term`
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 adm_un… Unique id… text 1 NA ADM Identifiers
#> 2 adm_st… Indicator… text 2 NA ADM Questionnaire_inter…
#> 3 adm_ye… Indicator… text 3 NA ADM Questionnaire_inter…
#> 4 sdc_age Participa… integer 4 years SDC Age
#> 5 sdc_ge… Gender of… integer 5 NA SDC Sex
#> 6 phy_he… participa… decimal 6 cm PME Anthropo_measures
#> 7 phy_we… participa… decimal 7 kg PME Anthropo_measures
#> 8 phy_bmi participa… decimal 8 kg/m PME Anthropo_measures
#> 9 rep_pr… whether t… integer 9 NA REP Pregnancy_delivery
#> 10 rep_pr… whether t… integer 10 NA REP Pregnancy_delivery
#> 11 lsb_sm… whether t… integer 11 NA LSB Tobacco
#> 12 lsb_sm… whether t… integer 12 NA LSB Tobacco
#> 13 lsb_sm… participa… integer 13 NA LSB Tobacco
#> # ℹ 1 more variable: `Mlstr_area::1.scale` <chr>
#>
#> attr(,"harmonizR::Dataschema")$Categories
#> # A tibble: 13 × 4
#> variable name `label:en` missing
#> <chr> <chr> <chr> <lgl>
#> 1 sdc_gender 1 Male FALSE
#> 2 sdc_gender 2 Female FALSE
#> 3 rep_preg_ever 0 never pregnant FALSE
#> 4 rep_preg_ever 1 pregnant once or more FALSE
#> 5 rep_preg_curr 0 currently pregnant FALSE
#> 6 rep_preg_curr 1 not currently pregnant FALSE
#> 7 lsb_smo_ever 0 never smoked FALSE
#> 8 lsb_smo_ever 1 smoked one pack of cigarette or more FALSE
#> 9 lsb_smo_curr 0 currently smoker FALSE
#> 10 lsb_smo_curr 1 not currently smoker FALSE
#> 11 lsb_smo_status 0 never smoker FALSE
#> 12 lsb_smo_status 1 former smoker FALSE
#> 13 lsb_smo_status 2 current smoker FALSE
#>
#> attr(,"harmonizR::Dataschema")attr(,"madshapR::class")
#> [1] "data_dict_mlstr"
#> attr(,"harmonizR::Dataschema")attr(,"harmonizR::class")
#> [1] "Dataschema_mlstr"
#> attr(,"harmonizR::data processing elements")
#> # A tibble: 26 × 11
#> index dataschema_variable valueType ss_table ss_variables
#> * <dbl> <chr> <chr> <chr> <chr>
#> 1 1 adm_unique_id text dataset_MELBOURNE_1 id
#> 2 2 adm_study text dataset_MELBOURNE_1 __BLANK__
#> 3 3 adm_year_dce text dataset_MELBOURNE_1 __BLANK__
#> 4 4 sdc_age integer dataset_MELBOURNE_1 age
#> 5 5 sdc_gender integer dataset_MELBOURNE_1 Gender
#> 6 6 phy_height decimal dataset_MELBOURNE_1 __BLANK__
#> 7 7 phy_weight decimal dataset_MELBOURNE_1 __BLANK__
#> 8 8 phy_bmi decimal dataset_MELBOURNE_1 BMI
#> 9 9 rep_preg_ever integer dataset_MELBOURNE_1 __BLANK__
#> 10 10 rep_preg_curr integer dataset_MELBOURNE_1 __BLANK__
#> # ℹ 16 more rows
#> # ℹ 6 more variables: `Mlstr_harmo::rule_category` <chr>,
#> # `Mlstr_harmo::algorithm` <chr>, `Mlstr_harmo::comment` <chr>,
#> # `Mlstr_harmo::status` <chr>, `harmonizR::r_script` <chr>,
#> # `Mlstr_harmo::status_detail` <chr>
#> attr(,"harmonizR::harmonized_col_id")
#> [1] "adm_unique_id"
# }