I am using the following grep script to output all the unmatched patterns:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns file contains the following 12-characters long substrings (some instances are shown below):
6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2
large_strings file contains extremely long strings of around 20-100 million characters longs (a small piece of the string is shown below):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
How can we speed up the above script (gnu parallel, xargs, fgrep, etc.)? I tried using --pipepart
and --block
but it doesn't allow you to pipe two grep commands.
Btw these are all hexadecimal strings and patterns.
The working code below is a little faster than the traditional grep:
rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
grep
took an hour to finish the process of pattern matching while it took ripgrep
around 45 mins.
question from:
https://stackoverflow.com/questions/65878582/boosting-the-grep-search-using-gnu-parallel 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…