Create Bucket with Broadcast join on pyspark
Created Bucket on join with 3 steps
(1) Create two dataframe df and df2.
(2) Use Broadcast join on them
(3) Create bucket on it
Note:- You can directly create bucket on any dataframe. Here I try to show how broadcast join work.
>>> df=spark.read.format('csv').option('header','true').load('D:/Documents/data/sample3.csv')
>>> df2=spark.read.format('csv').option('header','true').load('D:/Documents/data/sample2.csv')
>>>from pyspark.sql.functions import *
>>>df2.join(broadcast(df),df2.cat==df.cat).select(df2.cat,df.no).write.bucketBy(2,'cat').saveAsTable('buckettab')
>>> spark.sql("select * from buckettab").show()
>>> spark.sql("describe formatted buckettab").show()
No comments:
Post a Comment