You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id()
.
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex
would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…