I am working to read the data from PostgreSQL and process that data in spark to perform multiple transformations and then create delta to separate records that needs to be inserted/updated/deleted.
Initially I was reading data on a single task without specifying partition column and upper/lower bound values and it was too slow.
Then I specified the partition column with lower and upper bound values to read the data in parallel using spark from PostgreSQL. I am reading the data on 6 threads/tasks by specifying the value for the numPartitions property in the option tag. I am also pushing the filter down to the PostgreSQL for the filter condition and I can see that in the explain plan in spark.
The volume of data is huge, it is around 120-130 millions after filtering. Once I have the data frame in spark I am doing multiple joins with other data frames. I saw that the task of reading the data from PostgreSQL is happening multiple times when an action is called. I am not caching the data frame as it is huge but I wanted to make sure that I don't read that data multiple times as it was taking lot of time.
So, I decided to read the data and save that data frame content as parquet files in S3 bucket and then read that data in new data frame from parquet files in S3 bucket. Doing this the join operation tasks that were taking more time got reduced significantly but I ran into other issue that saving the data as parquet files is almost taking 40-50 minutes. I am repartitioning the data to 48 partitions before I save it in parquet files.
I am running this on databricks cluster with 6 worker nodes.
Driver Node Configuration:
Driver Type = i3.xlarge , 30.5 GB Memory , 4 cores
Worker Nodes Configuration:
Worker Type = i3.2xlarge , 61 GB Memory , 8 Cores
I just wanted to know if the approach that I am using is right or there are more better ways of doing the same task in much more efficient way.
Is there any way I can reduce the time it is taking to save the parquet files?
question from:
https://stackoverflow.com/questions/65935885/is-there-a-optimize-way-of-reading-data-from-postgresql-and-perform-transformati 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…