Answer in C language. runnable on linux 0. Introduction In this assignment you w
ID: 3825174 • Letter: A
Question
Answer in C language. runnable on linux
0. Introduction
In this assignment you will practice using the file system API (as well as pointers in different data structures). In particular you will be creating, opening, reading, writing, and deleting files. Your task is to write an indexing program, called an indexer. Given a set of files, an indexer will parse the files and create an inverted index, which maps each token found in the files to the subset of files that contain that token. In your indexer, you will also maintain the frequency with which each token appears in each file. The indexer should tokenize the files and produce an inverted index of how many times the word occurred in each file, sorted by word. Your output should be in the following format:
count0 count1
count2 count3 count4
The above depiction gives a logical view of the inverted index. In your program, you have to define data structures to hold the mappings (token to list) and the records (file name, count).
An inverted index is a sequence of mappings where each mapping maps a token (e.g., “dog”) to a list of records, with each record containing the name of a file whose content contains the token and the frequency with which the token appears in that filename.
Here is an example of how the indexer should work. If you are given the following set of files:
File Path
File Content
/adir/boo
A dog named named Boo
/adir/baa
A cat named Baa
/adir/bdir/baa
Cat cat
Your indexer should output:
1
1
1
1
3
1
2
1
The inverted index file that your indexer writes must follow the XML format defined above. Words must be sorted in alphanumeric order. All characters of a word should be first converted to lowercase before the word is counted. Your output should print with the lists arranged in alphanumeric order (a to z, 0 to 9) of the tokens. The filenames in your output should be in descending order by frequency count (highest frequency to lowest frequency).If there is a word with the same frequency in two or more files, order them by path name alphanumerically (a to z, 0 to 9).
After constructing the entire inverted index in memory, the indexer will save it to a file.
2. Implementation
Your program must implement the following command-line interface: invertedIndex
The first argument, , gives the name of a file that you should create to hold your inverted index. The second argument, , gives the name of the directory or file that your indexer should index. If the second argument is a directory, you need to recursively index all files in the directory (and its sub-directories). If the second argument is a file, you just need to index that single file.
When indexing files in a directory, you may have files that have the same name in separate directories. You should combine all token frequencies for the same filename regardless of which directory it appears in. We define tokens as any sequence of consecutive alphanumeric characters (a- z, A-Z, 0-9) starting with an alphabetic character.
Examples of tokens according to the above definition include:
a, aba, c123
If a file contains
This an$example12 mail@rutgers
it should tokenize to
this
an example12 mail rutgers
The XML format lets us easily read the inverted index for debugging. You should carefully consider how the program may break and code it robustly. You should outline and implement a strategy to deal with potential problems and inconsistencies. For example, if a file already exists with the same name as the inverted-index file name, you should ask the user if they wish to overwrite it. If the name of the directory or file you want to index does not exist, your indexer should print an error message and exit gracefully rather than crash. There are many other error cases that you ought to consider.
3. Hints
Data structures that might be useful include a list that sorts as you insert and/or a hash table.
A custom record type (e.g., a record ({"baa" : 3}) that can be inserted into multiple data structures, such as a sorted list and/or a hash table).
You should probably approach this in steps:
o First, build a simple tokenizer to parse tokens from a file.
o Next, get your program to walk through a directory.
o Next, implement a data structure that allows you to count the number of occurrences of each
unique token in a file. And so on ...
File Path
File Content
/adir/boo
A dog named named Boo
/adir/baa
A cat named Baa
/adir/bdir/baa
Cat cat
Explanation / Answer
indexer.c
#include "headers.h"
int indexFiles ( filePathPtr path, fileList **filePaths, fileList **dirPaths )
{
if ( path == NULL )
{
return 1;
}
if ( strlen( path->string ) > 2047)
{
printf(" File Path Length is Too Long. ");
exit(1);
return 1;
}
DIR *dir;
struct dirent *entry;
int count = 0;
//file
if ( !( dir = opendir( path->string ) ) )
{
if ( (access ( path->string, F_OK ) == -1) || (access ( path->string, R_OK ) == -1))
{
printf(" Unable to Open File: %s ", path->string);
closedir( dir );
exit(1);
return 1;
}
{
if( (*filePaths) == NULL )
{
*filePaths = malloc(sizeof(fileList));
(*filePaths)->path = malloc(sizeof(filePath));
(*filePaths)->path->string = malloc(strlen(path->string) + 1 );
strcpy( (*filePaths)->path->string, path->string );
(*filePaths)->next = NULL;
}
else
{
fileList *newFile;
newFile = (*filePaths);
while(newFile->next)
{
newFile = newFile->next;
}
newFile->next = malloc(sizeof(fileList));
newFile->next->path = malloc(sizeof(filePath));
newFile->next->path->string = malloc(strlen(path->string) + 1 );
strcpy( newFile->next->path->string, path->string );
newFile->next->next = NULL;
}
}
return 0;
}
//error
if ( !(entry = readdir( dir ) ) )
{
return 1;
}
//directory recursive retrieval
do
{
if ( (access ( path->string, F_OK ) == -1) || (access ( path->string, R_OK ) == -1))
{
closedir( dir );
exit(1);
return 1;
}
//dir
if ( entry->d_type == DT_DIR )
{
//ignore ".." and "." listings
if( ( strcmp( entry->d_name, "." ) == 0 ) || ( strcmp( entry->d_name, ".." ) ) == 0 )
{
continue;
}
if((*dirPaths) == NULL)
{
*dirPaths = malloc(sizeof(fileList));
(*dirPaths)->path = malloc(sizeof(filePath));
(*dirPaths)->path->string = malloc(strlen(path->string) + 1);
strcpy( (*dirPaths)->path->string, path->string );
(*dirPaths)->next = NULL;
indexFiles( path, filePaths, dirPaths );
return 0;
}
else
{
//make string, then recreate it from stack to heap
/// [MALLOC]
char newString[ 1024 ];
int len = snprintf ( newString, sizeof(newString) - 1, "%s/%s", path->string, entry->d_name );
newString[len] = '';
fileList *newFile;
newFile = (*dirPaths);
while(newFile->next)
{
newFile = newFile->next;
}
(newFile->next) = malloc(sizeof(fileList));
newFile->next->path = malloc(sizeof(filePath));
newFile->next->path->string = malloc(strlen(newString) + 1);
strcpy( newFile->next->path->string, newString );
newFile->next->next = NULL;
/// [MALLOC]
filePathPtr newPath = stringStructCreate( newString );
indexFiles( newPath, filePaths, dirPaths );
}
}
//file
else
{
//ignore temp files
if( ( entry->d_name[ strlen( entry->d_name ) - 1 ] != '~' ) )
{
/// [MALLOC]
char *newString = ( char* )malloc( strlen( path->string ) + strlen( entry->d_name ) + 2 );
strcpy( newString, path->string );
strcat( newString, "/" );
strcat( newString, entry->d_name );
newString[ strlen( newString ) ] = '';
if((*filePaths) == NULL)
{
//*filePaths = fileListCreate(newString);
*filePaths = malloc(sizeof(fileList));
(*filePaths)->path = malloc(sizeof(filePath));
(*filePaths)->path->string = malloc(strlen(newString) + 1 );
strcpy( (*filePaths)->path->string, newString );
(*filePaths)->next = NULL;
}
else
{
fileList *newFile;
newFile = (*filePaths);
while(newFile->next)
{
newFile = newFile->next;
}
newFile->next = malloc(sizeof(fileList));
newFile->next->path = malloc(sizeof(filePath));
newFile->next->path->string = malloc(strlen(newString) + 1 );
strcpy( newFile->next->path->string, newString );
newFile->next->next = NULL;
}
//printf(" %s", (*filePaths)->path->string);
FREE( newString ); //
}
}
}
while ( ( entry = readdir( dir ) ) );
closedir( dir );
return 0;
}
int readDir(char* file, container **head)
{
FILE *fp;
long fileSize = 0;
fp = fopen(file, "r");
if(!fp)
{
fprintf(stderr, "Could not open file: %s. ", strerror(errno));
exit(EXIT_FAILURE);
}
fseek(fp, 0, SEEK_END);
fileSize = ftell(fp);
fseek(fp, 0, SEEK_SET);
if(fileSize == 0)
{
return;
}
//char *contentString = (char*)malloc(fileSize + 1);
char *contentString = calloc(fileSize, 1);
fread(contentString, fileSize, 1, fp);
stringNonAlphaToSpace(contentString);
stringLowerCase(contentString);
char *tempToken;
tempToken = strtok(contentString, " ");
while(tempToken != NULL)
{
if ((*head) == NULL)
{
*head = malloc(sizeof(container));
(*head)->token = malloc( strlen(tempToken) + 1 );
strcpy( (*head)->token, tempToken );
(*head)->path = malloc ( strlen( file ) + 1);
strcpy((*head)->path, file);
(*head)->next = NULL;
}
else
{
container *newWordList;
newWordList = (*head);
while(newWordList->next)
{
newWordList = newWordList->next;
}
(newWordList->next) = malloc(sizeof(wordList));
newWordList->next->token = malloc(strlen( tempToken ) + 1);
strcpy(newWordList->next->token, tempToken);
newWordList->next->path = malloc(strlen ( file ) + 1);
strcpy(newWordList->next->path, file);
newWordList->next->next = NULL;
}
printf(" %s", tempToken);
tempToken = strtok(NULL, " ");
}
}
-----------------------------------------------------------------------------------------------------------------------------------------------
format.c
#include "headers.h"
void stringNonAlphaToSpace(char *input){
unsigned long i = 0;
unsigned long j = 0;
char c;
while ( (c = input[i++]) != '')
{
if(isalnum(c))
{
input[j++] = tolower(c);
}
else
{
input[j++] = ' ';
}
}
//input[j-1] = '';
//input[j] = '';
}
void stringLowerCase(char *input)
{
int i = 0;
for(i = 0; input[i]; i++)
{
input[i] = tolower(input[i]);
}
}
--------------------------------------------------------------------------------------------------------------------------------------------------------
inverted-list.c
#include "headers.h"
stringStructPtr stringStructCreate ( char* string )
{
stringStructPtr newStruct = (stringStructPtr)malloc(sizeof(stringStruct));
if(string == NULL)
{
newStruct->string = NULL;
return newStruct;
}
newStruct->string = malloc ( strlen ( string ) + 1);
strcpy(newStruct->string, string);
(newStruct->string)[strlen(string)] = '';
return newStruct;
}
void buildList(ILNode **head, container *ptr)
{
if ( (*head) == NULL )
{
(*head) = malloc(sizeof(ILNode));
(*head)->token = malloc(strlen(ptr->token)+1);
strcpy((*head)->token, ptr->token);
(*head)->pathData = malloc(sizeof(fileNode));
(*head)->pathData->path = malloc( strlen( ptr->path ) + 1);
strcpy( ((*head)->pathData->path), ptr->path);
(*head)->pathData->freq = 1;
(*head)->next = NULL;
}
else
{
fileNode *pathIter;
ILNode *tokenIter;
tokenIter = (*head);
pathIter = tokenIter->pathData;
//token loop
while(tokenIter->next != NULL)
{
pathIter = tokenIter->pathData;
//new token
if ( strcmp ( ptr->token, tokenIter->token) != 0 )
{
tokenIter = tokenIter->next;
pathIter = tokenIter->pathData;
}
//existing token
else
{
//path loop
while ( pathIter->next != NULL)
{
//new path
if ( strcmp ( ptr->path, pathIter->path) != 0 )
{
pathIter = pathIter->next;
}
//existing path
else
{
pathIter->freq += 1;
return;
}
}//end path loop
//made it out of path loop
//same token, no path match, add path to token
if( strcmp ( ptr->path, pathIter->path) != 0 )
{
pathIter->next = malloc(sizeof(fileNode));
pathIter->next->path = malloc(strlen( ptr->path ) + 1);
strcpy(pathIter->next->path, ptr->path);
pathIter->next->freq = 1;
pathIter->next->next = NULL;
return;
}
else
{
pathIter->freq += 1;
return;
}
}
}//end token loop
//made it out of token loop
//different token, add token to list
if ( strcmp ( ptr->token, tokenIter->token) != 0)
{
tokenIter->next = malloc(sizeof(ILNode));
tokenIter->next->token = malloc(strlen(ptr->token) + 1);
strcpy(tokenIter->next->token, ptr->token);
tokenIter->next->next = NULL;
tokenIter->next->pathData = malloc(sizeof(fileNode));
tokenIter->next->pathData->path = malloc(strlen(ptr->path) + 1);
strcpy(tokenIter->next->pathData->path, ptr->path);
tokenIter->next->pathData->freq = 1;
tokenIter->next->pathData->next = NULL;
}
else
{
while( pathIter->next != NULL )
{
if ( strcmp ( ptr->path, pathIter->path) != 0)
{
pathIter = pathIter->next;
}
else
{
pathIter->freq += 1;
return;
}
}
pathIter->freq += 1;
}
}
}
void mergeSort(ILNode **ILList)
{
ILNode *head = *ILList;
ILNode *a;
ILNode *b;
if ((head == NULL) || (head->next == NULL))
{
return;
}
mergeSplit(head, &a, &b);
mergeSort(&a);
mergeSort(&b);
*ILList = SortedMerge(a, b);
}
ILNode* SortedMerge(ILNode* a, ILNode *b)
{
ILNode *result = NULL;
if (a == NULL)
return b;
else if(b== NULL)
return a;
if( strcmp(a->token, b->token) < 0)
{
result = a;
result->next = SortedMerge(a->next, b);
}
else
{
result = b;
result->next = SortedMerge(a,b->next);
}
return result;
}
void mergeSplit(ILNode* head, ILNode** back, ILNode** front)
{
ILNode* fast;
ILNode* slow;
if((head == NULL) || (head->next == NULL))
{
*front = head;
back = NULL;
}
else
{
slow = head;
fast = head->next;
while(fast != NULL)
{
fast = fast->next;
if(fast != NULL)
{
slow = slow->next;
fast = fast->next;
}
}
*front = head;
*back = slow->next;
slow->next = NULL;
}
}
void mergeSortRecords(fileNode **recordList)
{
fileNode *head = *recordList;
fileNode *a;
fileNode *b;
if ((head == NULL) || (head->next == NULL))
{
return;
}
mergeSplitRecords(head, &a, &b);
mergeSortRecords(&a);
mergeSortRecords(&b);
*recordList = SortedMergeRecords(a, b);
}
fileNode* SortedMergeRecords(fileNode* a, fileNode *b)
{
fileNode *result = NULL;
if (a == NULL)
return b;
else if(b== NULL)
return a;
if( (a->freq > b->freq) )
{
result = a;
result->next = SortedMergeRecords(a->next, b);
}
else
{
result = b;
result->next = SortedMergeRecords(a,b->next);
}
return result;
}
void mergeSplitRecords(fileNode* head, fileNode**back, fileNode** front)
{
fileNode* fast;
fileNode* slow;
if((head == NULL) || (head->next == NULL))
{
*front = head;
back = NULL;
}
else
{
slow = head;
fast = head->next;
while(fast != NULL)
{
fast = fast->next;
if(fast != NULL)
{
slow = slow->next;
fast = fast->next;
}
}
*front = head;
*back = slow->next;
slow->next = NULL;
}
}
int ILPrint(ILNode *head)
{
ILNode *fileIter = head;
fileNode *pathIter;
while(fileIter != NULL)
{
pathIter = fileIter->pathData;
printf(" %s", fileIter->token);
while(pathIter != NULL)
{
printf(" %s %d", pathIter->path, pathIter->freq);
pathIter = pathIter->next;
}
fileIter = fileIter->next;
}
}
int tokenSearch(ILNode *head, char* token, char* path)
{
ILNode *tokenIter = head;
fileNode *pathIter = head->pathData;
while(tokenIter != NULL)
{
if( strcmp( ( tokenIter->token ), token ) == 0 )
{
while(pathIter != NULL)
{
if(strcmp((pathIter->path), path) == 0)
{
printf(" found: %s", token);
printf(" found: %s", path);
return 0;
}
pathIter = pathIter->next;
}
return 1;
}
tokenIter = tokenIter->next;
}
}
---------------------------------------------------------------------------------------------------------------------------------------------------
output.c
#include "headers.h"
int output ( char* file, ILNode *head )
{
printf(" %s", file);
if( access (file, F_OK) != -1)
{
printf(" File Already Exists. Overwrite? (Y/N)?: ");
if(tolower(getchar()) == 'y')
{
remove(file);
}
else
{
printf(" Program Terminated. ");
return 1;
}
}
FILE *fp;
fp = fopen(file, "ab+");
ILNode *tokenIter;
fileNode *pathIter;
tokenIter = head;
while(tokenIter != NULL)
{
fprintf(fp, "<list> ");
fprintf(fp, "%s ", tokenIter->token);
pathIter = tokenIter->pathData;
while(pathIter != NULL)
{
fprintf(fp, "%s", pathIter->path);
fprintf(fp, " %d ", pathIter->freq);
pathIter = pathIter->next;
}
fprintf(fp, "</list> ");
tokenIter = tokenIter->next;
}
fclose(fp);
}
-------------------------------------------------------------------------------------------------------------------------------------------
main.c
#include "headers.h"
int main ( int argc, char *argv[ ] )
{
system("clear");
if(argc != 3)
{
printf( " Incorrect Number of Arguments. ./indexer <outputfile> <dir> ");
return 1;
}
/// [MALLOC]
filePathPtr inputPath = stringStructCreate( argv[ 2 ] );
//filePath *outputPath = stringStructCreate("1");
/// [MALLOC]
filePathPtr outputPath = stringStructCreate( argv[ 1 ] );
//filePath *inputPath = stringStructCreate("Untitled Folder");
/// [MALLOC]
fileListPtr filePaths = NULL;
/// [MALLOC]
fileListPtr dirPaths = NULL;
//gather all the files in the directory
indexFiles ( inputPath, &filePaths, &dirPaths );
//read the files, and create a list of the tokens
container *mergedList = NULL;
fileList *fileIter = filePaths;
while(fileIter != NULL)
{
readDir(fileIter->path->string, &mergedList);
fileIter = fileIter->next;
}
//add the tokens to a list
container *listIter = mergedList;
ILNode *index = NULL;
while( listIter != NULL)
{
buildList(&index, listIter);
listIter = listIter->next;
}
//ILPrint(index);
mergeSort(&index);
ILNode *nodeIter = index;
while(nodeIter != NULL)
{
mergeSortRecords(&(nodeIter->pathData));
nodeIter = nodeIter->next;
}
ILPrint(index);
output(outputPath->string, index);
printf(" ");
return 0;
}
------------------------------------------------------------------------------------------------------------------------------------
indexer.h
#ifndef INDEXER_H
#define INDEXER_H
#include "inverted-list.h"
int readFiles ( fileListPtr filePaths, tokenListPtr fileTokensList );
int indexFiles ( filePathPtr path, fileList **filePaths, fileList **dirPaths );
#endif
-----------------------------------------------------------------------------------------------------------------------------------
format.h
#ifndef INDEXER_H
#define INDEXER_H
void stringNonAlphaToSpace ( char* input );
void stringLowerCase ( char* input );
#endif
-----------------------------------------------------------------------------------------------------------------------
inverted-list.h
#ifndef INVERTED_LIST_H
#define INVERTED_LIST_H
typedef struct stringStruct { char* string; } stringStruct;
typedef stringStruct* stringStructPtr;
typedef stringStruct tokenStr;
typedef tokenStr* tokenPtr;
typedef stringStruct filePath;
typedef filePath* filePathPtr;
typedef stringStruct fileContents;
typedef fileContents* fileContentsPtr;
typedef struct fileList { filePathPtr path;
struct fileList* next; } fileList;
typedef fileList* fileListPtr;
typedef struct wordList
{
char* token;
struct wordList *next;
} wordList;
typedef wordList* wordListPtr;
typedef struct tokenList { tokenPtr token;
filePathPtr path;
struct tokenList* next; } tokenList;
typedef tokenList* tokenListPtr;
typedef struct referenceList { filePathPtr path;
int* freq;
int fuckYou;
struct referenceList* next; } referenceList;
typedef referenceList* referenceListPtr;
typedef struct InvertedList { tokenPtr token;
referenceListPtr fileDetails;
struct InvertedList* next; } InvertedList;
typedef InvertedList* InvertedListPtr;
typedef struct container
{
char *path;
char *token;
struct container* next;
}container;
typedef struct fileNode
{
char *path;
int freq;
struct fileNode *next;
} fileNode;
typedef fileNode* fileNodePtr;
typedef struct ILNode
{
char *token;
fileNode *pathData;
struct ILNode *next;
} ILNode;
typedef ILNode* ILNodePtr;
stringStructPtr stringStructCreate ( char* string );
int stringStructDestroy ( stringStructPtr ptr );
tokenListPtr tokenListCreate ( );
int tokenListDestroy ( tokenListPtr ptr );
fileListPtr fileListCreate ( char* string );
int fileListDestroy ( fileListPtr ptr );
referenceListPtr referenceListCreate ( );
int referenceListDestroy ( referenceListPtr ptr );
InvertedListPtr ILCreate ( );
int ILDestroy ( InvertedListPtr list );
int tokenSearch(ILNode *head, char* token, char* path);
void printTokens(ILNode *head);
void pathSearch(ILNode *head, char* file);
int ILPrint(ILNode *head);
int readDir(char* file, container **head);
int newList(ILNode **head, container *ptr);
void mergeSort(ILNode **head);
void mergeSplit(ILNode* head, ILNode** back, ILNode** front);
ILNode* SortedMerge(ILNode* a, ILNode *b);
void mergeSortRecords(fileNode **recordList);
fileNode* SortedMergeRecords(fileNode* a, fileNode *b);
void mergeSplitRecords(fileNode* head, fileNode**back, fileNode** front);
#endif
---------------------------------------------------------------------------------------------------------------------------------------
output.h
#ifndef INDEXER_H
#define INDEXER_H
int output ( char* filePath, invertedList list );
#endif
----------------------------------------------------------------------------------------------------------------------------------------
Makefile
CC = gcc
CFLAGS = -g
CFILES = indexer.c format.c inverted-list.c output.c main.c
HFILES = indexer.h format.h inverted-list.h output.h
main:
$(CC) $(CFLAGS) -o index $(CFILES)
archive: main
ar -r libsl.a $(HFILES)
clean:
rm -rf *.o
-------------------------------------------------------------------------------------------------
Invocation
make
./index <outputfile> <dir to read>
Makefile
CC = gcc
CFLAGS = -g
CFILES = indexer.c format.c inverted-list.c output.c main.c
HFILES = indexer.h format.h inverted-list.h output.h
main:
$(CC) $(CFLAGS) -o index $(CFILES)
archive: main
ar -r libsl.a $(HFILES)
clean:
rm -rf *.o
-------------------------------------------------------------------------------------------------
Invocation
make
./index <outputfile> <dir to read>