How to add a new column and update its value based on the other column in the Dataframe in Spark

Hey there!

Welcome to ClearUrDoubt.com.

In this post, we will look at updating a column value based on another column value in a dataframe using when() utility function in Spark.

 

WHEN() function documentation:

static Column when(Column condition, Object value)

Evaluates a list of conditions and returns one of multiple possible result expressions.

source: spark documentation

happy Learning :).

Please leave a reply in case of any queries.

One Reply to “How to add a new column and update its value based on the other column in the Dataframe in Spark”

  1. hi could you help me with following logic
    consider a dataframe

    dataframe

    ============================================================

    number|date|label|opendate|closedate|updateddate

    1|1/1/2019|a|||

    2|1/1/2019|a|||

    3|1/1/2019|a|||

    1|1/1/2019|b|||

    ============================================================

    i need to check the following condition and update the columns of the open close and updated date

    condition1—————->

    =>For all rows the open and updated date is default set as date

    condition2—————->

    =>when number is same we have to check the label if label differs then we have to set the close date and updated date of the first occuring number

    and set updated date alone for the second occuring number

    expected output dataframe:

    ==============================================================

    number|date|label|opendate|closedate|updateddate

    1|1/1/2019|a|1/1/2019|2/1/2019|2/1/2019 —->first occuring number

    2|1/1/2019|a|1/1/2019||1/1/2019

    3|1/1/2019|a|1/1/2019||1/1/2019

    1|2/1/2019|b|2/1/2019||2/1/2019 —-> second occuring number

    ===============================================================

    condition3——–>

    =>when number is same we have to check the label

    if the label is same then set the first occuring updated date as date of the second occuring and delete the second row.

    example dataframe that matched above condition:

    dataframe

    ========================================================

    number|date|label|opendate|closedate|updateddate

    1|1/1/2019|a ||| —->first occuring number

    2|1/1/2019|b|||

    3|1/1/2019|c|||

    1|2/1/2019|a ||| —-> second occuring number with same label

    ==========================================================

    expected output dataframe:

    =================================================================================================

    number|date|label|opendate|closedate|updateddate

    1|1/1/2019|a|1/1/2019||2/1/2019 —-> first occuring number updated date is setted as second occuring numbers date

    2|1/1/2019|a|1/1/2019||1/1/2019

    3|1/1/2019|a|1/1/2019||1/1/2019

    —-> second occuring number row deleted

    =============================================================================================

    i have done the same logic using plain java using arraylist and pojo classes and i need to know how to implement above logic using spark kindly help me with it

    i tried updating it using spark udf functions but it works for only one column

    an example code will be usefull

    thanks in advance ………

Leave a Reply

Your email address will not be published. Required fields are marked *