The Pandas read_csv
function accepts a custom delimiter sep
which can be a regular expression. Now the task is to articulate a regex which only matches the last n-1 spaces in a line, where n is the number of columns in the file.
q3=pd.read_csv(
"Question2.txt", engine='python', skiprows=2, encoding='unicode_escape',
sep=r's+(?!S+(?:s+S+){10})')
print(q3)
The regular expression matches whitespace (s+
) but only if it is not followed (?!...)
by ten or more whitespace-separated columns.
The sample data you provided doesn't seem to exactly match what your code expects, but with a couple of empty lines added at the top, I get
Name 2000a€“12 2012a€“13 ... 2012.4 2012.5 2012.6
0 Costa Rica 4.7 3.4 ... .. 4.5 49.4
1 C?′te da€?Ivoire 1.2 8.7 ... .. 1.3 39.0
2 Croatia 2.1 .. ... .. 3.4 80.7
3 Cuba 5.8 .. ... .. .. ..
4 Cura?§ao .. .. ... .. .. ..
5 Cyprusb 2.6c .. ... 113.3 2.4 ..
6 Czech Republic 3.3 .. ... 38.3 3.3 77.3
7 Denmark 0.6 .. ... 50.6 2.4 74.6
8 Djibouti 3.5 .. ... .. 3.7 ..
9 Dominica 3.2 1.1 ... .. 1.4 97.4
10 Dominican Republic 5.6 2.5 ... .. 3.7 34.3
11 Ecuador 4.4 4.0 ... .. 5.1 31.6
12 Egypt, Arab Rep. 4.9 1.8 ... .. 7.1 74.1
[13 rows x 11 columns]
(Notice the Unicode mojibake, because of the slightly weird encoding
keyword argument.)
Probably as your very first task save the result in a less inane format, probably as proper CSV (comma-delimited, with quoting around any field which contains literal commas; but Pandas to_csv()
takes care of all of this for you).
As an aside, hardcoding the file name with a full path is probably going to make your script less useful. Maybe take out the path, like I have done above, and run the script in the directory where you have the input file, or (less usefully, but sometimes more practically) put in a relative path to a subdirectory, and run it from the corresponding parent directory.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…