I would really appreciate if I can get some help on this. Write a class called S
ID: 3749035 • Letter: I
Question
I would really appreciate if I can get some help on this.
Write a class called SourceModel with the following constructors and methods:
A single constructor with two String parameters, where the first parameter is the name of the source model and the second is the file name of the corpus file for the model. The constructor should create a letter-letter transition matrix using this recommended algorithm sketch:
Initialize a 26x26 matrix for character counts
Print “Training {name} model … “
Read the corpus file one character at a time, converting all characters to lower case and ignoring any non-alphabetic character.
For each character, increment the corresponding (row, col) in your counts matrix. The row is the for the previous character, the col is for the current character. (You could also think of this in terms of bigrams.)
After you read the entire corpus file, you’ll have a matrix of counts.
From the matrix of counts, create a matrix of probabilities – each row of the transition matrix is a probability distribution.
A probabilities in a distribution must sum to 1. To turn counts into probabilities, divide each count by the sum of all the counts in a row.
Print “done.” followed by a newline character.
A getName method with no parameters which returns the name of the SourceModel.
A toString method which returns a String representation of the model like the one shown below under Running Your Program in jshell.
A probability method which takes a String and returns a double which indicates the probability that the test string was generated by the source model, using the transition probability matrix created in the constructor. Here’s a recommended algorithm:
Initialize the probability to 1.0
For each two-character sequences of characters in the test string test, cici and ci+1ci+1 for i=0i=0 to test.length()1test.length()1, multiply the probability by the entry in the transition probability matrix for the c1c1 to c2c2 transition, which should be found in row cici an column ci+1ci+1 in the matrix. (You could also think of the indices as ci1,cici1,ci for i=1i=1 to test.length()1test.length()1.)
A main method that makes SourceModel runnable from the command line. You program should take 1 or more corpus file names as command line arguments followed by a quoted string as the last argument. The program should create models for all the corpora and test the string with all the corpora. Here’s an algorithm sketch:
The first n-1 arguments to the program are corpus file names to use to train models. Corpus files are of the form .corpus
The last argument to the program is a quoted string to test.
Create a SourceModel object for each corpus
Use the models to compute the probability that the test text was produced by the model
Probabilities will be very small. Normalize the probabilities of all the model predictions to a probability distribution (so they sum to 1) (closed-world assumption – we only state probabilities relative to models we have).
Print results of analysis
Running Your Program
Sample runs from the command line:
Sample runs from jshell:
Refer to Oracle’s tutorial on reading a file one character at a time: https://docs.oracle.com/javase/tutorial/essential/io/charstreams.html
FileReader’s read method returns int. You’ll probably want to cast these to chars. That’s fine. As the documentation says, the lower 16 bits are the Unicode code point for a character.
If you use String.split to get corpus names from file names, remember that . is a special regex character. Use a character class to match a literal . character. For example "foo.fighters".split("[.]") is ["foo", "fighters"].
char is an integral type, so you can easily find a char’s offset from 'a' with an expression like ch - 'a', where ch is a charvariable.
The Character class has many static utility methods you will find useful, like isAlphabetic, toLowerCase.
Corpus files:
https://drive.google.com/drive/folders/18Xa784tmnQFqz_yGeRlfjgA-u0BHCz-z?usp=sharing
Explanation / Answer
import java.util.Scanner;
import java.io.File;
import java.io.FileNotFoundException;
public class SourceModel {
//Initialize variables so they can be accessed everywhere
private String modelName;
private int[][] characterCount;
private double[] rowCount;
private double[][] probability;
/**
* Create the class
*@param name takes the name of the corpus
*@param fileName takes the fileName of corpus
*/
public SourceModel(String name, String fileName) {
modelName = name;
characterCount = new int[26][26];
rowCount = new double[26];
probability = new double[26][26];
System.out.print("Training " + name + " model ...");
try {
Scanner scanner = new Scanner(new File(fileName));
String temp = "";
//append all of the text
while (scanner.hasNext()) {
temp += scanner.next();
}
//only keeps the letters and makes the lowercase
temp = temp.replaceAll("[^A-Za-z]+", "").toLowerCase();
//iterates through each letter then puts the letters
//sequence to the respective row and column
for (int i = 0; i < (temp.length() - 1); i++) {
char firstLetter = temp.charAt(i);
char secondLetter = temp.charAt(i + 1);
//index based on ASCII values
characterCount[(int) firstLetter - 97]
[(int) secondLetter - 97]++;
rowCount[(int) firstLetter - 97]++;
}
//calculates the probability by dividing the count
//by the total counts in each row
for (int i = 0; i < probability.length; i++) {
for (int j = 0; j < probability[i].length; j++) {
if (rowCount[i] == 0) {
rowCount[i] = 0.01;
}
probability[i][j] = ((double) characterCount[i][j])
/ rowCount[i];
if (probability[i][j] == 0) {
probability[i][j] = 0.01;
}
}
}
System.out.println("Done.");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
/**
*@return a string which contains the name
*/
public String getName() {
return modelName;
}
/**
*@return a string witht he matrix
*/
public String toString() {
String matrixString = "";
matrixString += " ";
for (int i = 97; i < 123; i++) {
matrixString += " ";
matrixString += (char) i;
}
matrixString += (" ");
for (int i = 0; i < probability.length; i++) {
matrixString += ((char) (i + 97) + " ");
for (int j = 0; j < probability[i].length; j++) {
matrixString += String.format("%.2f", probability[i][j]);
matrixString += (" ");
}
matrixString += " ";
}
return matrixString;
}
/**
*@param test a set of letters to test
*@return the probability for the word
*/
public double probability(String test) {
test = test.replaceAll("[^A-Za-z]+", "").toLowerCase();
double stringProbability = 1.0;
for (int i = 0; i < test.length() - 1; i++) {
int firstIndex = (int) (test.charAt(i)) - 97;
int secondIndex = (int) (test.charAt(i + 1)) - 97;
stringProbability *= probability[firstIndex][secondIndex];
}
return stringProbability;
}
/**
*@param args the command line arguments
*/
public static void main(String[] args) {
SourceModel[] models = new SourceModel[args.length - 1];
for (int i = 0; i < args.length - 1; i++) {
models[i] = new SourceModel(args[i]
.substring(0, args[i]
.indexOf(".")), args[i]);
}
System.out.println("Analyzing: " + args[args.length - 1]);
double[] normalizedProbability = new double[args.length - 1];
double sumProbability = 0;
for (int i = 0; i < args.length - 1; i++) {
sumProbability += models[i].probability(args[args.length - 1]);
}
//normalize the probability in respect to the values given
for (int i = 0; i < normalizedProbability.length; i++) {
normalizedProbability[i] = models[i]
.probability(args[args.length - 1]) / sumProbability;
}
int highestIndex = 0;
for (int i = 0; i < args.length - 1; i++) {
System.out.print("Probability that test string is");
System.out.printf("%9s: ", models[i].getName());
System.out.printf("%.2f", normalizedProbability[i]);
System.out.println("");
if (normalizedProbability[i]
> normalizedProbability[highestIndex]) {
highestIndex = i;
}
}
System.out.println("Test string is most likely "
+ models[highestIndex].getName() + ".");
}
}