I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.
Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.
Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)
file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101
file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
and an example of the desired output:
desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101
In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:
files <- list.files(path, pattern = ".txt")
df <- files %>%
map(function(f) {
print(f)
df <- fread(f)
df <- df %>% mutate(date_as_numb = f)
return(df)
}) %>% bind_rows()
df <- df %>%
mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>%
group_by(key_1, key_2) %>%
filter(date_as_numb == max(date_as_numb))
Thanks in advance!
question from:
https://stackoverflow.com/questions/65945347/r-read-in-multiple-files-and-select-max-value-from-column-by-group