Currently have 2 data frames, one stored visit record, while the other stored purchase record.
The objective is to count the number of consumer
for each bag brand
. In Visit record, the visit id is unique. But in purchase record, as one purchase could include more than one brand of bags, the same visit id could have multiple bag id.
Please note that in consumer record, there's a Date field.
As the aim is to calculate how many customers brought certain brand of bag regardless of the time. eg if the customer Amy brought the same brand of bag like Prada in both 2019 and 2020. It would still be counted as 1 towards the customer count for Prada.
But for now, the count show 2 instead of 1. Any idea? Thanks
Visit record [df1]
[
{
"Date":"2019",
"Visit id":"12435",
"Consumer":[
{
"Reference id":"R0009",
"Consumer name":"Sally"
}
]
},
{
"Date":"2020",
"Visit id":"23453",
"Consumer":[
{
"Reference id":"R0009",
"Consumer name":"Sally"
}
]
},
{
"Date":"2019",
"Visit id":"56674",
"Consumer":[
{
"Reference id":"R0010",
"Consumer name":"Amy"
}
]
},
{
"Date":"2020",
"Purchase id":"65432",
"Consumer":[
{
"Reference id":"R0010",
"Consumer name":"Amy"
}
]
}
]
Purchase record [df2]
[
{
"Visit id":"12435",
"Product":[
{
"Bag id":"BAG001",
"Bag name":"Prada"
},
{
"Bag id":"BAG004",
"Bag name":"Gucci"
}
]
},
{
"Visit id":"23453",
"Product":[
{
"Bag id":"BAG006",
"Bag name":"Chanel"
},
{
"Bag id":"BAG007",
"Bag name":"Marc"
}
]
},
{
"Visit id":"56674",
"Product":[
{
"Bag id":"BAG001",
"Bag name":"Prada"
},
{
"Bag id":"BAG006",
"Bag name":"Chanel"
}
]
},
{
"Purchase id":"65432",
"Product":[
{
"Bag id":"BAG001",
"Bag name":"Prada"
},
{
"Bag id":"BAG007",
"Bag name":"Marc"
}
]
}
]
Code
df1=pd.json_normalize(data=json.loads(json1), record_path='Consumer', meta=['Date','Visit id'])
df2=pd.json_normalize(data=json.loads(json2), record_path='Purchase', meta=['Visit id'])
df3=df1.merge(df2, on='Visit id', how='left').dropna()
df3=df3.groupby(['Bag id', 'Bag name']).agg({'Consumer name': [list, 'count']})
df3
Output