Hey there!
Welcome to ClearUrDoubt.com.
In this post, we will look at updating a column value based on another column value in a dataframe using when() utility function in Spark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
Spark context available as 'sc' (master = local[*], app id = local-1560028311690). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 12.0.1) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.read.json("/clearurdoubt/output.json"); df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 2 more fields] scala> df.show(false); +----------+---+---------+------------+ |first_name|id |last_name|student_type| +----------+---+---------+------------+ |Rahul |101|Kumar |1 | |Antony |102|James |0 | |Ashok |103|Kedar |0 | |Ravi |104|Aswin |1 | +----------+---+---------+------------+ scala> val new_df = df.withColumn("student_type_description", when(col("student_type").equalTo("1"), lit("Day-Scholar")).otherwise("Residential")); new_df: org.apache.spark.sql.DataFrame = [first_name: string, id: bigint ... 3 more fields] scala> new_df.show(false); +----------+---+---------+------------+------------------------+ |first_name|id |last_name|student_type|student_type_description| +----------+---+---------+------------+------------------------+ |Rahul |101|Kumar |1 |Day-Scholar | |Antony |102|James |0 |Residential | |Ashok |103|Kedar |0 |Hostler | |Ravi |104|Aswin |1 |Residential | +----------+---+---------+------------+------------------------+ scala> |
WHEN() function documentation:
static Column |
when(Column condition, Object value)
Evaluates a list of conditions and returns one of multiple possible result expressions.
|
source: spark documentation
happy Learning :).
Please leave a reply in case of any queries.
hi could you help me with following logic
consider a dataframe
dataframe
============================================================
number|date|label|opendate|closedate|updateddate
1|1/1/2019|a|||
2|1/1/2019|a|||
3|1/1/2019|a|||
1|1/1/2019|b|||
============================================================
i need to check the following condition and update the columns of the open close and updated date
condition1—————->
=>For all rows the open and updated date is default set as date
condition2—————->
=>when number is same we have to check the label if label differs then we have to set the close date and updated date of the first occuring number
and set updated date alone for the second occuring number
expected output dataframe:
==============================================================
number|date|label|opendate|closedate|updateddate
1|1/1/2019|a|1/1/2019|2/1/2019|2/1/2019 —->first occuring number
2|1/1/2019|a|1/1/2019||1/1/2019
3|1/1/2019|a|1/1/2019||1/1/2019
1|2/1/2019|b|2/1/2019||2/1/2019 —-> second occuring number
===============================================================
condition3——–>
=>when number is same we have to check the label
if the label is same then set the first occuring updated date as date of the second occuring and delete the second row.
example dataframe that matched above condition:
dataframe
========================================================
number|date|label|opendate|closedate|updateddate
1|1/1/2019|a ||| —->first occuring number
2|1/1/2019|b|||
3|1/1/2019|c|||
1|2/1/2019|a ||| —-> second occuring number with same label
==========================================================
expected output dataframe:
=================================================================================================
number|date|label|opendate|closedate|updateddate
1|1/1/2019|a|1/1/2019||2/1/2019 —-> first occuring number updated date is setted as second occuring numbers date
2|1/1/2019|a|1/1/2019||1/1/2019
3|1/1/2019|a|1/1/2019||1/1/2019
—-> second occuring number row deleted
=============================================================================================
i have done the same logic using plain java using arraylist and pojo classes and i need to know how to implement above logic using spark kindly help me with it
i tried updating it using spark udf functions but it works for only one column
an example code will be usefull
thanks in advance ………