Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
404 views
in Technique[技术] by (71.8m points)

Spark Direct Stream Kafka order of events

I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase.

I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ).

The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered at partition level.

My problem begins when I start reading using Spark.

Using the direct stream approach, each executor will read from one partition. I am not doing any shuffling of data (just parse and save), so my events won't get messed up among the RDD, but I am worried that when the executor reads the partition, it won't maintain the order so I will end up with incorrect data in HBase when I save them.

How can I ensure that the order is kept at executor level, especially if I use multiple cores in one executor (which from my understanding result in multiple threads)?

I think I can also live with 1 core if this fixes the issue and by turning off speculative execution, enabling spark back pressure optimizations and keeping the maximum retries on executor to 1.

I have also thought about implementing a sort on the events at spark partition level using the Kafka offset.

Any advice?

Thanks a lot in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...