Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
611 views
in Technique[技术] by (71.8m points)

apache spark - Different default persist for Rdd and Dataset

I was trying to find a good answer why default persist for RDD is MEMORY_ONLY and for Dataset MEMORY_AND_DISK. But couldnt find it. I am wondering if any of you know a good reason behind?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.

Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.

The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):

Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...