Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Please do everything as said is simple but need help with it. Part 1: Spark Setu

ID: 3873344 • Letter: P

Question

Please do everything as said is simple but need help with it.

Part 1: Spark Setup
In this exercise you will setup a Ubuntu virtual machine and install Spark on it.

Download and install virtual box and ubuntu from the following sites as we did in the class.

https://www.virtualbox.org/wiki/Downloads https://www.ubuntu.com/download/desktop

Once the installation is complete you will need to install latest version of java. Issue the following commands

sudo apt-get update

sudo apt-get install default-jre

after installation is done check the version using the following command

java -version

You need to install scala https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz . It will be downloaded into Downloads folder.

Decompress the tgz archive using the following command

tar -xvzf scala-2.12.3.tgz

file will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder using the following command.

sudo mv scala-2.12.3 /usr/local/scala

You need to set the PATH environment variable to the scala binary using the following command

export PATH=$PATH:/usr/local/scala/bin

test that installation is successful by checking the version

scala -version

Now install spark by downloading it from https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin- hadoop2.7.tgz

Decompress it using

tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz

and move it to /usr/local/spark folder using the following command

sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark

Finally set the path variable

export PATH=$PATH:/usr/local/spark/bin

now issue the following command to check installation was successful.

spark-shell

It will take some time but you should see some messages and screen art saying spark version 2.2.0 and giving you prompt scala>

Part2: Using Spark to work with Dataset

For this exercise please read chapter2 of the text book and use the dataset available at

http://bit.ly/1Aoywaq.

Using the dataset complete the following tasks.
1. Please create a raw RDD for all the CSV files
2. Please remove all headers from the RDD
3. Please convert each record in the RDD to a case class record 4. Please sample 20 records from the RDD.

Explanation / Answer

PART 1:

1.Completed first part of Installation of virtual box and ubuntu from the site given and updated version of java.

2. Run sudo apt-get update and sudo apt-get install default-jre and checked the updated version of java.

3. Installed scala and saved into Download folder and Decompress the tgz archive using the following command

tar -xvzf scala-2.12.3.tgz

4. File will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder.

5. Set the PATH environment variable to the scala binary using the following command and test that installation is successful by checking the version.

6. Installed spark and Decompress and move it to /usr/local/spark folder.

7. Finally set the path variable export PATH=$PATH:/usr/local/spark/bin and issue the following command to check installation was successful by spark-shell and giving prompt scala>

PArt 1 is completed now.

PART 2 :

Not aware about the chapter 2 of which book and topic but using the dataset by above link

http://bit.ly/1Aoywaq.

1. Create a raw RDD for all the CSV files and remove all headers from the RDD


2.Now convert each record in the RDD to a case class record 4.

3.Did sampling for 20 RDD.

Part 2 is also completed now.

As said above all the parts has been completed now.