Perl read a large file for use with multi line regex

Question

Welcome To Ask or Share your Answers For Others

Perl read a large file for use with multi line regex

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Perl read a large file for use with multi line regex

I have a 4GB text file with highly variable length lines, this is only a sample file, production files will be much larger. I need to read the file and apply a multi line regex.

What is the best way to read such a large file for the multi line regex?

If I read it line by line, I don't think my multi line regex will work correctly. When I use the read function in 3 argument form my regex results vary as I change the size of length I specify in the the read statement. I believe that the file's size makes it too large to be read into an array or into memory.

Here is my code

package main;
use strict;
use warnings;

our $VERSION = 1.01;
my $buffer;
my $INFILE;
my $OUTFILE;

open $INFILE, '<', ... or die "Bad Input File: $!";
open $OUTFILE, '>',... or die "Bad Output File: $!";

while ( read $INFILE, $buffer, 512  ) {
    if ($buffer =~ /(?m)(^[^
]*R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^
]/) {
        print $OUTFILE $&;
        print $OUTFILE "
";
    }
}

close( $INFILE ); 
close( $OUTFILE );
1;

Here is some sample data:

^%Z("EUD")
S %L=%LO,%N="E1"
^%Z("RT")
This is data that I don't want the regex to find
^%Z("EXY")
X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255  X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%N Q
^%Z("F12")
S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0))  S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)=""
^%Z("F2")
S %=$H>21549+$H-.1,%Y=%365.25+141,%=%#365.251,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)

The lines above are paired, syntactically (line 1 and 2 go together, 3 and 4, etc). I need to find specific pairs, in the above data that's all of the pairs except for:

^%Z("RT")
This is data that I don't want the regex to find

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:05:46+0000

The question is apparently about parsing a DSL, and it seems that in general regex isn't the right tool for that. A quick search did not yield an easy list of accepted approaches, except for pages of CPAN modules and posts like this article. Finding out the best approach is indeed the first step.

However, below is an answer to the question as stated in the title and in the clear description: how to parse a very large file where units to be processed spread over an unknown number of lines.

Keep assembling a 'buffer' and checking it. Once you find a match, process and clear it.

For instance, appeand a line to a variable and check (try to match if you use regex). Keep going and once it does match process and clear the variable.

my $unit;
while (<$fh>) {
    # chomp;            # if suitable, and then add a space
    # $unit .= ' '.$_;  # as a separator that newline was
    $unit .= $_;

    if ( test_unit($unit) ) {
         # process ...
         $unit = undef;
    }
}

The test_unit() sub is a placeholder for code that would decide whether the assembled unit should be processed. If that is regex it can be defined before the loop, my $re = qr/.../; (see qr in perlop), and then test in the loop with if ($unit =~ $re)

A note in the question states that lines to be processed come in pairs, but it is clarificated in a comment that subsequent lines don't always pair up. Thus we can't process pairs of lines.

Categories

Perl read a large file for use with multi line regex

Perl read a large file for use with multi line regex

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags