How to add a new column and update its value based on the other column in the Dataframe in Spark

Hey there!

In this post, we will look at updating a column value based on another column value in a dataframe using when() utility function in Spark.

Spark context available as 'sc' (master = local[*], app id = local-1560028311690).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 12.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.json("/clearurdoubt/output.json");
df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 2 more fields]

scala> df.show(false);
+----------+---+---------+------------+
|first_name|id |last_name|student_type|
+----------+---+---------+------------+
|Rahul     |101|Kumar    |1           |
|Antony    |102|James    |0           |
|Ashok     |103|Kedar    |0           |
|Ravi      |104|Aswin    |1           |
+----------+---+---------+------------+


scala> val new_df = df.withColumn("student_type_description", when(col("student_type").equalTo("1"), lit("Day-Scholar")).otherwise("Residential"));
new_df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 3 more fields]

scala> new_df.show(false);
+----------+---+---------+------------+------------------------+
|first_name|id |last_name|student_type|student_type_description|
+----------+---+---------+------------+------------------------+
|Rahul     |101|Kumar    |1           |Day-Scholar             |
|Antony    |102|James    |0           |Residential             |
|Ashok     |103|Kedar    |0           |Hostler                 |
|Ravi      |104|Aswin    |1           |Residential             |
+----------+---+---------+------------+------------------------+


scala>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

Spark context available as 'sc' (master = local[*], app id = local-1560028311690).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 2.4.3

/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 12.0.1)

Type in expressions to have them evaluated.

Type :help for more information.

scala> val df = spark.read.json("/clearurdoubt/output.json");

df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 2 more fields]

scala> df.show(false);

+----------+---+---------+------------+

+----------+---+---------+------------+

|Rahul |101|Kumar |1 |

|Antony |102|James |0 |

|Ashok |103|Kedar |0 |

|Ravi |104|Aswin |1 |

+----------+---+---------+------------+

scala> val new_df = df.withColumn("student_type_description", when(col("student_type").equalTo("1"), lit("Day-Scholar")).otherwise("Residential"));

new_df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 3 more fields]

scala> new_df.show(false);

+----------+---+---------+------------+------------------------+

+----------+---+---------+------------+------------------------+

+----------+---+---------+------------+------------------------+

scala>

WHEN() function documentation:

static Column

when(Column condition, Object value)

Evaluates a list of conditions and returns one of multiple possible result expressions.

source: spark documentation

happy Learning :).

Please leave a reply in case of any queries.

One Reply to “How to add a new column and update its value based on the other column in the Dataframe in Spark”

pragadeeshwaran says:

July 7, 2019 at 10:56 pm

hi could you help me with following logic
consider a dataframe

dataframe

============================================================

number|date|label|opendate|closedate|updateddate

1|1/1/2019|a|||

2|1/1/2019|a|||

3|1/1/2019|a|||

1|1/1/2019|b|||

============================================================

i need to check the following condition and update the columns of the open close and updated date

condition1—————->

=>For all rows the open and updated date is default set as date

condition2—————->

=>when number is same we have to check the label if label differs then we have to set the close date and updated date of the first occuring number

and set updated date alone for the second occuring number

expected output dataframe:

==============================================================

number|date|label|opendate|closedate|updateddate

1|1/1/2019|a|1/1/2019|2/1/2019|2/1/2019 —->first occuring number

2|1/1/2019|a|1/1/2019||1/1/2019

3|1/1/2019|a|1/1/2019||1/1/2019

1|2/1/2019|b|2/1/2019||2/1/2019 —-> second occuring number

===============================================================

condition3——–>

=>when number is same we have to check the label

if the label is same then set the first occuring updated date as date of the second occuring and delete the second row.

example dataframe that matched above condition:

dataframe

========================================================

number|date|label|opendate|closedate|updateddate

1|1/1/2019|a ||| —->first occuring number

2|1/1/2019|b|||

3|1/1/2019|c|||

1|2/1/2019|a ||| —-> second occuring number with same label

==========================================================

expected output dataframe:

=================================================================================================

number|date|label|opendate|closedate|updateddate

1|1/1/2019|a|1/1/2019||2/1/2019 —-> first occuring number updated date is setted as second occuring numbers date

2|1/1/2019|a|1/1/2019||1/1/2019

3|1/1/2019|a|1/1/2019||1/1/2019

—-> second occuring number row deleted

=============================================================================================

i have done the same logic using plain java using arraylist and pojo classes and i need to know how to implement above logic using spark kindly help me with it

i tried updating it using spark udf functions but it works for only one column

an example code will be usefull

thanks in advance ………

Reply

One Reply to “How to add a new column and update its value based on the other column in the Dataframe in Spark”

Leave a Reply Cancel reply