A DataFrame
is defined well with a google search for "DataFrame definition":
A data frame is a table, or two-dimensional array-like structure, in
which each column contains measurements on one variable, and each row
contains one case.
So, a DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
An RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an RDD
via its rdd
method, and you can go from an RDD
to a DataFrame
(if the RDD is in a tabular format) via the toDF
method
In general it is recommended to use a DataFrame
where possible due to the built in query optimization.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…