master_list.txt
chr4:124900550:T:C
chr7:82771887:T:C
chr2:177046732:C:A
chr13:91709687:T:G
chr1:23216340:T:C
chr5:28993821:C:T
chr6:108644910:C:T
chr7:83971303:A:T
chr7:9570668:A:G
chr9:130711613:A:G
chr1:199677414:G:A
chr12:101848399:G:A
chr10:101438297:C:A
I have multiple data files that contain some, but not all, IDs in master_list.txt, with additional data for that ID. The additional data will have a consistent number of columns per data file, but a variable number of columns for different data files.
Based on shared IDs in column 1 (all files), I want to populate the master_list.txt with corresponding data from each data file if present. If there is no ID (+data) in a data file that matched with the master_list.txt file, "NAs" should be entered.
Based on the following (example) data files:
data_file_1
chr1:23216340:T:C, 1, 0, 0, 0, 1, 0
chr1:199677414:G:A, 0, 1, 0, 0.2, 0, 0.8
chr5:28993821:C:T, 1, 0, 0, 0.1, 0.1, 0.8
chr6:108644910:C:T, 0.9, 0.1, 0, 0.75, 0.25, 0
chr7:83971303:A:T, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr9:130711613:A:G, 0.8, 0.1, 0.1, 0, 1, 0
chr10:101438297:C:A, 1, 0, 0, 0.9, 0.1, 0
chr13:91709687:T:G, 1, 0, 0, 0.25, 0.25, 0.5
data_file_2
chr1:23216340:T:C, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr3:3141592653:A:T, 0, 1, 0, 0.95, 0.05, 0
chr4:124900550:T:C, 1, 0, 0, 0.9, 0.1, 0
chr6:108644910:C:T, 1, 0, 0, 0, 1, 0
chr7:82771887:T:C, 0.85, 0.1, 0.05, 0, 0, 1
chr10:101438297:C:A, 0.5, 0.5, 0, 1, 0, 0
chr12:101848399:G:A, 0.2, 0.2, 0.6, 0, 0, 1
data_file_3
chr2:177046732:C:A, 0.5, 0.5, 0, 1, 0, 0, 0.5. 0.5, 0
chr6:108644910:C:T, 0.95, 0.05, 0, 1, 0, 0, 1, 0, 0
chr7:83971303:A:T, 0.5, 0.5, 0, 0, 0, 1, 1, 0, 0
chr7:9570668:A:G, 0.9, 0.05, 0.05, 0, 0, 1, 0, 1, 0
chr9:130711613:A:G, 1, 0, 0, 1, 0, 0, 0, 0, 1
chr13:91709687:T:G, 1, 0, 0, 0, 0, 1, 0.3, 0.3, 0.4
Desired output is:
chr1:23216340:T:C, 1, 0, 0, 0, 1, 0, 0.9, 0.1, 0, 0.7, 0.2, 0.1, NA, NA, NA, NA, NA, NA, NA, NA
chr1:199677414:G:A, 0, 1, 0, 0.2, 0, 0.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr2:177046732:C:A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.5, 0.5, 0, 1, 0, 0, 0.5. 0.5, 0
chr4:124900550:T:C, NA, NA, NA, NA, NA, NA, 1, 0, 0, 0.9, 0.1, 0, NA, NA, NA, NA, NA, NA, NA, NA
chr5:28993821:C:T, 1, 0, 0, 0.1, 0.1, 0.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr6:108644910:C:T, 0.9, 0.1, 0, 0.75, 0.25, 0 , 1, 0, 0, 0, 1, 0, 0.95, 0.05, 0, 1, 0, 0, 1, 0, 0
chr7:9570668:A:G, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.9, 0.05, 0.05, 0, 0, 1, 0, 1, 0
chr7:82771887:T:C, NA, NA, NA, NA, NA, NA, 0.85, 0.1, 0.05, 0, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA
chr7:83971303:A:T, 0.9, 0.1, 0, 0.7, 0.2, 0.1, NA, NA, NA, NA, NA, NA, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr9:130711613:A:G, 0.8, 0.1, 0.1, 0, 1, 0, NA, NA, NA, NA, NA, NA, 1, 0, 0, 1, 0, 0, 0, 0, 1
chr10:101438297:C:A, 1, 0, 0, 0.9, 0.1, 0, 0.5, 0.5, 0, 1, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA
chr12:101848399:G:A, NA, NA, NA, NA, NA, NA, 0.2, 0.2, 0.6, 0, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA
chr13:91709687:T:G, 1, 0, 0, 0.25, 0.25, 0.5, NA, NA, NA, NA, NA, NA, 1, 0, 0, 0, 0, 1, 0.3, 0.3, 0.4
Please note that data files can have a variable number of rows and columns and that the master_list.txt file will not always contain an ID found in any particular data file. If a data file has an ID (with data), that is not found in master_list.txt, that data does not end up in the output file.
I can use:
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_1 > file_1_output
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_2 > file_2_output
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_3 > file_3_output
to match data in each data file to the master list, but I get no NAs this way. If I did, I could paste the per-data-file-outputs (which would all have row numbers equal to the master_list.txt file) to get the desired output file. So how do I tweak:
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file
to populate outputs with NAs (equal to the number of columns in the data file) when an ID in the master file is not found in the data file? A script that does not require subsequent pasting of individual outputs would be even better!
question from:
https://stackoverflow.com/questions/65928283/cross-reference-multiple-lists-with-master-list-to-produce-merged-output-with-mi