Difference between DataFrame, Dataset, and RDD in Spark

Question

Welcome To Ask or Share your Answers For Others

Difference between DataFrame, Dataset, and RDD in Spark

1 Answer

深蓝 · Answer 1 · 2021-10-16T22:26:35+0000

A DataFrame is defined well with a google search for "DataFrame definition":

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

In general it is recommended to use a DataFrame where possible due to the built in query optimization.

Categories

Difference between DataFrame, Dataset, and RDD in Spark

Difference between DataFrame, Dataset, and RDD in Spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags