Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Mini-Project 4: Data Science& Information Theory INSTRUCTOR: DANIEL L. PIMENTEL-

ID: 3724165 • Letter: M

Question

Mini-Project 4: Data Science& Information Theory INSTRUCTOR: DANIEL L. PIMENTEL-ALARCÓN DUE 03/19/2018 One fundamental aspect of data science is quantifying how much information data contains. This is one of the main questions addressed by information theory [1]. In this mini-project you will use information theory to determine whether really an image is worth more than a thousand sords To this end, we will compute the entropy (measure of information) of an entire book (containing more than 1,000 words), and compare it against the entropy of an image (a) Download the text of a book of your choice. For example, I downloaded The Count of Monte Cristo, by Alexandre Dumas. Load your file, and compute the number of times that each character z, appears. Hint: you can do this in three lines of code, using the Matlab functions fileread and hist. (b) Estimate the probability rz) of each character (c) Compute the entropy of a character as: (d) Your result from (e) tells you the information encoded in each character of the book. Now multiply this by the number of characters in the book, to obtain the overall entropy of the book. Now we will compute the entropy of an image. (e) Download an image of your choice. For example, I downloaded the image in your image, and compute the number of times that each pixel intensity r, appears. Hint: you can also in three lines of code Pigure 41: Image containing an thustration from The Count of Monde Orate, depieting Edrmond Danths being thrown at the sea from the Chiteau d'If

Explanation / Answer

MAKE SURE YOU HAVE A TEXT FILE TITLED "book.txt" and an IMAGE TITLED picture.ppm in the same folder as the code before running the code.

% CODE FOR FIRST QUESTION

file = fileread( "SampleTextFile_200kb.txt" );
len=length(file);
Count = zeros(128);
for i = 1:len
Count( uint8(file( i ) ) ) = Count( uint8(file( i ) ) ) + 1;
end
prob = zeros(128);
entropy = zeros(128);
total_entropy = 0;
for i = 1:128
prob(i)=(Count(i)/len);
if prob(i) != 0
entropy(i)=(-1*prob(i)*log(prob(i)));
else
entropy(i) = 0;
end
total_entropy = total_entropy + len*entropy(i);
fprintf('%s ', char(i));
fprintf('prob=%f ', prob(i));
fprintf('ent=%f ', entropy(i));
end
fprintf('total entropy = ');
disp(total_entropy);

% CODE FOR SECOND QUESTION

image = imread( "picture.ppm" );
[x,y] = size(image);
total_pixels = x*y;
Count = zeros(128);
prob = zeros(128);
entropy = zeros(128);
for i = 1:x
for j = 1:y
Count(image(i,j)+1) = Count(image(i,j)+1) + 1;
end
end
total_entropy = 0;
for i=1:128
prob(i) = (Count(i+1)/total_pixels);
if prob(i) != 0
entropy(i) = (-1*prob(i)*log(prob(i)));
else
entropy(i) = 0;
end
total_entropy = total_entropy + total_pixels*entropy(i);
fprintf('pixel_value=%d ',i);
fprintf('prob= %f ', prob(i));
fprintf('ent= %f ', entropy(i));
end
fprintf('total_entropy= ')
disp(total_entropy)

BOOK contains more information because of larger number of total characters in the book.

Not a fare competition as it is depending largely on the number of characters.