Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
706 views
in Technique[技术] by (71.8m points)

distributed computing - TensorFlow Extended Kubeflow Multiple Workers

I have an issue with TFX on Kubeflow DAG Runner. Issues is that I managed to start only one pod per run. I don't see any configuration for "workers" except on the Apache Beam arguments, which doesn't help.

Running CSV load on one pod results in OOMKilled error because file has more than 5GB. I tried splitting the file in parts of 100MB but that did not help also.

So my question is: How to run a TFX job/stage on Kubeflow on multiple "worker" pods, or is that even possible?

Here is the code I've been using:

examples = external_input(data_root)
example_gen = CsvExampleGen(input=examples)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

dsl_pipeline = pipeline.Pipeline(
  pipeline_name=pipeline_name,
  pipeline_root=pipeline_root,
  components=[
      example_gen, statistics_gen
  ],
  enable_cache=True,
  beam_pipeline_args=['--num_workers=%d' % 5]
)


if __name__ == '__main__':
    tfx_image = 'custom-aws-imgage:tfx-0.26.0'
    config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
        kubeflow_metadata_config=kubeflow_dag_runner.get_default_kubeflow_metadata_config(),
        tfx_image=tfx_image)
    kfp_runner = kubeflow_dag_runner.KubeflowDagRunner(config=config)
    # KubeflowDagRunner compiles the DSL pipeline object into KFP pipeline package.
    # By default it is named <pipeline_name>.tar.gz
    kfp_runner.run(dsl_pipeline)

Environment:

  • Docker image: tensorflow/tfx:0.26.0 with boto3 installed (aws related issue)
  • Kubernetes: AWS EKS latest
  • Kubeflow: 1.0.4
question from:https://stackoverflow.com/questions/65846689/tensorflow-extended-kubeflow-multiple-workers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It seams that this is not possible at the time. See: https://github.com/kubeflow/kubeflow/issues/1583


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...