I have a large array of data which I have read into a dask dataframe. This data frame has two columns that I believe to be redundant (i.e., have identical values). These columns are string-valued -- they give the names of growth media used for incubating colonies of cells.
I would like to check my hypothesis that the two columns are identical before dropping one of them.
The simplest solution I could come up with was the following:
(df['growth_media_1'] == df['growth_media_2']).all().compute()
But this gives me the following error:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+--------+---------+----------+
| Column | Found | Expected |
+--------+---------+----------+
| input | float64 | int64 |
| output | float64 | int64 |
+--------+---------+----------+
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'input': 'float64',
'output': 'float64'}
I thought this might be because there were some NaN
's in the columns, so I tried doing .dropna()
before the comparison. But that did not fix the problem.
After extensive flailing, I ended up with this arcane mess:
(df['growth_media_1'].dropna() == df['growth_media_2'].dropna()).astype('bool').all().compute()
but even that did not solve my problem.
The error message really isn't helpful, since neither pd.read_csv
nor pd.read_table
are involved, as far as I can tell. However, pandas.read_text
is in the backtrace, so perhaps dask is writing files for different shards of the data.
(I'm using dask version 1.2.2, if that helps. I'm using this on a high performance cluster, which lags the bleeding edge of software.)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…