Please do everything as said is simple but need help with it. Part 1: Spark Setu
ID: 3873344 • Letter: P
Question
Please do everything as said is simple but need help with it.
Part 1: Spark Setup
In this exercise you will setup a Ubuntu virtual machine and install Spark on it.
Download and install virtual box and ubuntu from the following sites as we did in the class.
https://www.virtualbox.org/wiki/Downloads https://www.ubuntu.com/download/desktop
Once the installation is complete you will need to install latest version of java. Issue the following commands
sudo apt-get update
sudo apt-get install default-jre
after installation is done check the version using the following command
java -version
You need to install scala https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz . It will be downloaded into Downloads folder.
Decompress the tgz archive using the following command
tar -xvzf scala-2.12.3.tgz
file will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder using the following command.
sudo mv scala-2.12.3 /usr/local/scala
You need to set the PATH environment variable to the scala binary using the following command
export PATH=$PATH:/usr/local/scala/bin
test that installation is successful by checking the version
scala -version
Now install spark by downloading it from https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin- hadoop2.7.tgz
Decompress it using
tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz
and move it to /usr/local/spark folder using the following command
sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark
Finally set the path variable
export PATH=$PATH:/usr/local/spark/bin
now issue the following command to check installation was successful.
spark-shell
It will take some time but you should see some messages and screen art saying spark version 2.2.0 and giving you prompt scala>
Part2: Using Spark to work with Dataset
For this exercise please read chapter2 of the text book and use the dataset available at
http://bit.ly/1Aoywaq.
Using the dataset complete the following tasks.
1. Please create a raw RDD for all the CSV files
2. Please remove all headers from the RDD
3. Please convert each record in the RDD to a case class record 4. Please sample 20 records from the RDD.
Explanation / Answer
PART 1:
1.Completed first part of Installation of virtual box and ubuntu from the site given and updated version of java.
2. Run sudo apt-get update and sudo apt-get install default-jre and checked the updated version of java.
3. Installed scala and saved into Download folder and Decompress the tgz archive using the following command
tar -xvzf scala-2.12.3.tgz
4. File will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder.
5. Set the PATH environment variable to the scala binary using the following command and test that installation is successful by checking the version.
6. Installed spark and Decompress and move it to /usr/local/spark folder.
7. Finally set the path variable export PATH=$PATH:/usr/local/spark/bin and issue the following command to check installation was successful by spark-shell and giving prompt scala>
PArt 1 is completed now.
PART 2 :
Not aware about the chapter 2 of which book and topic but using the dataset by above link
http://bit.ly/1Aoywaq.
1. Create a raw RDD for all the CSV files and remove all headers from the RDD
2.Now convert each record in the RDD to a case class record 4.
3.Did sampling for 20 RDD.
Part 2 is also completed now.
As said above all the parts has been completed now.