Thursday, September 10, 2020

 Create Bucket with Broadcast join on pyspark

Created Bucket on join with 3 steps

(1) Create two dataframe df and df2.

(2) Use Broadcast join on them 

(3) Create bucket on it 

Note:- You can directly create bucket on any dataframe. Here I try to show how broadcast join work.

>>> df=spark.read.format('csv').option('header','true').load('D:/Documents/data/sample3.csv')

>>> df2=spark.read.format('csv').option('header','true').load('D:/Documents/data/sample2.csv')

>>>from pyspark.sql.functions import * 

>>>df2.join(broadcast(df),df2.cat==df.cat).select(df2.cat,df.no).write.bucketBy(2,'cat').saveAsTable('buckettab')

>>> spark.sql("select * from buckettab").show()

>>> spark.sql("describe formatted buckettab").show()


No comments:

Post a Comment