Like a dict, a DataFrame's index is backed by a hash table. Looking up rows
based on index values is like looking up dict values based on a key.
In contrast, the values in a column are like values in a list.
Looking up rows based on index values is faster than looking up rows based on column values.
For example, consider
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index']
column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 μs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 μs per loop
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 μs per loop
So having the index is only advantageous when you have many lookups of this type
to perform.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
,
lreshape
, and crosstab
, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash
lookups.