In this assignment, you are to write a C++ program (using Visual C++ 2015) that
ID: 3864649 • Letter: I
Question
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads training data in WEKA arff format and generates ID3 decision tree in a format similar to that of the tree generated by Weka ID3. Please note the following:
Your algorithm will use the entire data set to generate the tree. You may assume that the attributes (a) are of nominal type (i.e., no numeric data), and (b) have no missing values.
In general, the basic ID3 algorithm uses entropy measure to select the best attribute to divide the data set. It continues to select attribute for further branching (based on the subset of data belong to that branch) until either (a) all attributes have been used, or (b) all instances under a node belong to the same class. This ensures a 0% error rate on the training set although it may not work the best with future data due to over-fitting.
Explanation / Answer
In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply five learning algorithms to each data set and compare their performance. I have included a section at the end that describes how to get weka and how to run it from the GUI or from the command line.
You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.
TURN IN:
You should turn in the top 50 lines of your statlog.arff and statlog_test.arfffiles.
For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.
TURN IN:
A table in the following format:
For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the plot command.
For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.
TURN IN:
(i, 10 points) Plot of the data points for hw_gmm_25 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.
(ii, 10 points) Plot of the data points for hw_step_50 with a line showing the learned decision boundary for Logistic Regression.
Now, let us consider the hw_gmm_250and hw_step_250 training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format: