Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
784 views
in Technique[技术] by (71.8m points)

filter - count on group by on multiple columns and getting the original dataset

2, cornflakes, Regular,General Mills, 12    
3, cornflakes, Mixed Nuts, Post, 14  
4, chocolate syrup, Regular, Hersheys, 5   
5, chocolate syrup, No High Fructose, Hersheys, 8  
6, chocolate syrup, Regular, Ghirardeli, 6  
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

Script

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), count(data) as total; 
filter_data = FILTER data_cnt BY total < 2;

I now need the original data with the filter applied and my desired output is:

4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

filter_data will give you chocolate syrup, Regular.Join the filter_data with original dataset with item,type and get the desired result.

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), COUNT(data) as total; 
filter_data = FILTER data_cnt BY total < 2;
o_data = JOIN data BY (item,type),filter_data BY ($0,$1);
final_data = FOREACH o_data GENERATE $0..$4;
DUMP final_data;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...