In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.
Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.
Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.
So I fail to understand, why I have to explicitly cache as,
- It looks ugly
- It can easily be missed
- It can easily be over/under used
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…