Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

excel - Conversion of large .csv file to .prn (around 3.5 GB) in Ubuntu using bash

I have a .csv file which is very large and has size about 3.5 GB, as I am dealing with big data and I need to convert this file to .prn file which seperates the columns with space delimiter.

Here is the sample input values in the file -

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,274

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,275

UNT,Gujarat,84716050,25669.69,UNITS,"QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G.",INAMD4,M,2015-05-01,Ahmedabad,Import,MALAYSIA,1,276

KGS,Gujarat,29213090,187897.88,KILOGRAMS,MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003,INAMD4,W,2015-05-01,Ahmedabad,Import,ITALY,5,277

Now here if you look closely each division is a row of the file and you can also observe that each of the cell is seperated by comma. But we can also observe that in row 1 - "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G." contains several commas. So, if I will use comma(,) as a delimiter then I will end up seperating "QX-870" and "IND BARCODE SCANNER" and "SW RSTR" and "LD" and "SRL+ETHNT S/N.:3402030. FIS-0870-1004G." , which I don't want. So, I browse through the internet and found out that we can can change the format of the file using Microsoft Excel by saving the file in a different format(which I choose .prn format which solved my problem) but this great tool cannot convert bigger files(3.5 GB) so, I want my output something like this i.e row no. 1 on line 1, row no. 2 pn line 2 respectively.

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3402030. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
274

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405176. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
275

UNT Gujarat 84716050 25669.69 UNITS "QX-870, IND BARCODE SCANNER, SW RSTR,LD,SRL+ETHNT S/N.:3405181. FIS-0870-1004G."
INAMD4 M 2015-05-01 Ahmedabad Import MALAYSIA 1
276

KGS Gujarat 29213090 187897.88 KILOGRAMS MEMANTINE HYDROCHLORIDE. BATCH NO. 134614003 INAMD4 W 2015-05-01
Ahmedabad Import ITALY 5 277

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It's not clear from your question as you didn't provide sample input/output we could test against but it SOUNDS like all you're trying to do is this:

$ cat tst.awk
BEGIN {
    split("7 10 15 12 4",w)
    FPAT="[^,]*|"[^"]*""
}
{
    gsub(/""/,RS)
    for (i=1;i<=NF;i++) {
        gsub(/"/,"",$i)
        gsub(RS,""",$i)
        printf "<%-*s>", w[i], substr($i,1,w[i])
    }
    print ""
}

$ cat file
abcde,"ab,c,de","ab ""c"" de","a,""b"",c",ab
abcdefghi,"xyab,c,de","xyzab ""c"" de",abc,abcdefg

$ awk -f tst.awk file
<abcde  ><ab,c,de   ><ab "c" de      ><a,"b",c     ><ab  >
<abcdefg><xyab,c,de ><xyzab "c" de   ><abc         ><abcd>

Obviously I added the < and > around each field just to make it clear where each field starts/ends, you'd remove that for your real application and I'm creating the array w to hold specific widths for each field as idk where you get that from otherwise.

The above uses GNU awk for FPAT, with other awks it'd be a while(match()) loop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.8k users

...