Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
464 views
in Technique[技术] by (71.8m points)

awk - Splitting file on nth occurence of delimiter and add content from txt file to every new file

I want to split a >500MB ASCII based text file after ~5000 occurrences of a delimiter ("00I" in may case). I am using the code from (https://stackoverflow.com/a/42302328/14957413)

awk -v n=5000 '
   function ofile() {
      if (op) 
         close(op); 
      op = sprintf("file.GES.%d.", ++p)
   } 
   BEGIN{ofile()} 
   /00I/{++i} i>n{i=1; ofile()} 

   { print $0 > op }' 
file

The source file start with around ~1000 lines of variables declarations, that I need to also have in every new file that I create with the snippet from above.

Input

//file header
00K
01Filename
02Fieltype
03Date

//00F describes a variable
00F 
0101
02Variable name 1
03text
04length
00F 
0102
02Variable name 2
03number
04length

//content I want split
00I
01Value for first F, e.g. Test
02Value for second F, e.g. 1
//this repeats a couple of 1.000.000 times
00I
01Value for first F, e.g. TestN
02Value for second F, e.g. N

expected output for first to nth file

//Header
00K
01Filename
02Fieltype
03Date

//Variable declaration
00F 
0101
02Variable name 1
03text
04length
00F 
0102
02Variable name 2
03number
04length

//Content
00I
01Value for first F, e.g. Test
02Value for second F, e.g. 1

Two ideas

  1. Extending awk statement to store the first ~1000 lines of the source file in a variable and to append it in every newly generated file.
  2. Preparing a separate file with the variable declaration and adding its content to every newly generated file.

Questions What is the best way to achieve the task? Can it be done by extending the awk expression? Do I need to run two statements - first the awk and second the sed statement?

Help is very much appreaciated.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Using GNU awk, you can do the following:

awk -v n=5000 'BEGIN{RS="
00I
"}
               (NR==1){h=$0; next;}
               (i%n==0){close(f); f= "file.GES." (++c); printf "%s",h > f}
               {printf "%s%s", RS, $0 > f; ++i}' file

This will create files containing 5000 records.

How does it work?

By defining the record separator to be equal to 00I RS=" 00I ", we split the input file file in a set of multi-line records which are separated by RS. When awk processes a record, the record $0 will contain all lines between two 00I . When awk reads the first record (NR==1) it will store it in the variable h. This will contain the header and the variables (unless RS is found in one of these blocks). From that point on we start counting the records. Each time we have 5000 records, we create a new file with the name file.GES.n where n is an incrementing number per file. This is done in the line

(i%n==0){close(f); f="file.GES." (++c); printf "%s",h > f}

Each time we process a record, we print it to the file and increment the record counter i which is used to check if we need a new file or not.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...