Hope you are well, and thank you in advance.
So I have a '.xls' file and what to read the data into a dataframe.
For other xls files I have successfully used the
df1a = pd.read_excel(sr_file)
For this '.xls' file it doesn't work due to being an unsupported format. Full error:
Traceback (most recent call last):
File "pyt_AST_Recon.py", line 713, in <module>
main()
File "pyt_AST_Recon.py", line 683, in main
subredFile_loc(sr_file, sr_date)
File "pyt_AST_Recon.py", line 329, in subredFile_loc
df1a = pd.read_excel(sr_file)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasutil\_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioexcel\_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioexcel\_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioexcel\_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioexcel\_base.py", line 359, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioexcel\_xlrd.py", line 36, in load_workbook
return open_workbook(filepath_or_buffer)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagesxlrd\__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagesxlrdook.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagesxlrdook.py", line 1271, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagesxlrdook.py", line 1265, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'Trade da'
On further inspection, I believe this is caused because the file underneath is actually a '.csv' file and not a '.xls'. However, when I use read_csv
I get an error.
When I convert the ending from '.xls' to '.csv' I get a horrible csv file:
| Header One Header Two Header Three Header Four | | | |
| -------- | -----------| ------- |------- |
| L1 Data 1 L1 Data 2 | L1 Data 3 | L1 Data 4 | |
| L2 Data 1 | L2 Data 2 | L2 Data 3 | L2 Data 4 |
| L3 Data 1 | L3 Data 2 | L3 Data 3 | L3 Data 4 |
When using read_csv
I get the following error:
Traceback (most recent call last):
File "pyt_AST_Recon.py", line 713, in <module>
main()
File "pyt_AST_Recon.py", line 683, in main
subredFile_loc(sr_file, sr_date)
File "pyt_AST_Recon.py", line 329, in subredFile_loc
df1a = pd.read_csv(sr_file)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioparsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioparsers.py", line 463, in _read
data = parser.read(nrows)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioparsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "C:Usersd.howellsAppDataLocalProgramsPythonPython35libsite-packagespandasioparsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 6
Which does make some sense given the weird looking data I am seeing.
When I use the error_bad_lines
to ignore the data I get the following dataframe:
| Header One Header Two Header Three Header Four |
| -------- |
| L1 Data 4 |
Which obviously ignores most of the data.
This is admittedly a complete mess, and so if anyone can help it would be very much appreciated.