Hey there!
Welcome to ClearUrDoubt.com.
In this post, we will go through the steps to read a CSV file in Spark SQL using spark-shell.
Step 1: In Spark 1.6.0, to read a CSV file, we need to use a third-party tool(data bricks CSV API). we can import the dependency while opening spark-shell as below
1 |
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 |
Step 2: Import the required classes before using them.
1 |
scala> import org.apache.spark.sql.types._ |
Step 3: Specify the schema of the CSV file records using StructType/StructField classes imported in Step 2.
1 2 3 4 5 6 |
scala> val schema = StructType( | StructField("name", StringType, true):: | StructField("maths", LongType, true):: | StructField("physics", LongType, true):: | StructField("chemistry", LongType, true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(maths,LongType,true), StructField(physics,LongType,true), StructField(chemistry,LongType,true)) |
Step 4: Load the CSV file using sqlContext as below:
1 2 |
scala> val students = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).option("mode","PERMISSIVE").load("/user/workspace/clearurdoubt/practice/markssheet.csv") students: org.apache.spark.sql.DataFrame = [name: string, maths: bigint, physics: bigint, chemistry: bigint] |
Step 5: Display the records read from CSV file
1 2 3 4 5 6 7 8 9 10 11 |
scala> students.show +--------+-----+-------+---------+ | name|maths|physics|chemistry| +--------+-----+-------+---------+ |Student1| 95| 82| 85| |Student2| 99| 88| 92| |Student3| 89| 85| 82| |Student4| 28| 53| 32| |Student5| 97| 88| 88| |Student6| 77| 65| 92| +--------+-----+-------+---------+ |
Step 6: Display the total marks of each student
1 2 3 4 5 6 7 8 9 10 11 |
scala> students.select(students("name").as("Name"), (students("maths")+students("physics")+students("chemistry")).as("Total Marks")).show +--------+-----------+ | Name|Total Marks| +--------+-----------+ |Student1| 262| |Student2| 279| |Student3| 256| |Student4| 113| |Student5| 273| |Student6| 234| +--------+-----------+ |
Step 7: Display the Sum of Total Marks of all students
1 2 3 4 5 6 |
scala> students.agg(sum((students("maths")+students("physics")+students("chemistry")).as("Total Marks")).as("Sum of Total Marks")).show +------------------+ |Sum of Total Marks| +------------------+ | 1417| +------------------+ |
These steps can be performed programmatically using Scala/Java. We will see them in upcoming posts.
Happy Learning :).
Please leave a reply in case of any queries.