pyspark - Does Spark guarantee consistency when reading data from S3?

Question

Welcome To Ask or Share your Answers For Others

pyspark - Does Spark guarantee consistency when reading data from S3?

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.

Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:

.withColumn('id', f.monotonically_increasing_id())

Will the id column have the same values for the same records each time?

question from:https://stackoverflow.com/questions/66055679/does-spark-guarantee-consistency-when-reading-data-from-s3

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:09:25+0000

You state very little, but the following is easily testable and should serve as a guideline:

If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.

Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.

I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

Categories

pyspark - Does Spark guarantee consistency when reading data from S3?

pyspark - Does Spark guarantee consistency when reading data from S3?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags