Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
307 views
in Technique[技术] by (71.8m points)

java - Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data

I have a BigQuery dimension table (which doesn't change much) and a streaming JSON data from PubSub. What I want to do is to query this dimension table, and enrich the data by joining on the incoming data from PubSub, then write those streams of joined data to another BigQuery table.

As I am new to Dataflow/Beam and the concept is still not that clear to me (or at least I have difficulty starting to write the code), I have a number of questions:

  1. What is best template or pattern I can use to do that? Should I do a PTransform of BigQuery first (followed by PTransform of PubSub) or the PTransform of PubSub first?
  2. How can I do the join? Like ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map)?
  3. What is the best window setting for the PubSub? Is it correct that the window setting for the PTransform part of BigQuery is different from the PTransform part of the Pubsub one?
question from:https://stackoverflow.com/questions/65857840/dataflow-template-pattern-in-enriching-fixed-bigquery-data-by-streaming-pubsub-d

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You need to join two PCollections.

  1. A PCollection that contains data from Pub/Sub. This can be created by using the PubSubIO.Read PTransform.
  2. A PCollection that contains data from BigQuery. If data is static, BigQueryIO.Read transform can be used. If data can change though, the current BigQuery transforms available in Beam probably will not work. One option might be to use transform PeriodicImpulse and your own ParDo to create a periodically changing input. See here for an example (please note that PeriodicImpulse transform was added recently).

You can combine the data in a ParDo where PCollection (1) is the main input and PCollection (2) is a side input (similar to the example above).

Finally you can stream output to BigQuery using the BigQueryIO.Write transform.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...