Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Code in Python You decided to add NLP(Natural Language Processing) capabilities

ID: 3754393 • Letter: C

Question

Code in Python

You decided to add NLP(Natural Language Processing) capabilities to your model. One good way of deriving information from text is to count words in the text Part 1. To this end, you need to write a program that gets a filename and n as input, retrieves the text, and returns a dictionary where keys are n-grams(i.e., groups of n consecutive words, for example, 3-grams in "to be or not to be" are "to be or, "be or not", "or not to" and "not to be"), and the values are the number of occurrences of n-grams that appear more than once in the text 15 pts. Note: For the purposes of this question, assume 1 n 10. Part 2. Using the results of the previous part, find the 10 n-grams with the highest total length(i.e., len(w) + len(w2) + len w3)) 15 pts You can test your program on this file after downloading it: http://www.gutenberg. org/files/100/ 100-0.txt

Explanation / Answer

To run the pipeline properly, a system equipped with and able to handle at least 4 GB RAM is recommended. The following operating systems have been tested:

macOS (10.10 - 10.13)

Linux (Ubuntu 14.04 - 17.10)

Windows 7 - 10

Furthermore, the pipeline depends on an internet connection when running to download the models for the current configuration. It does not work offline!

1.2. Java Installation

The following step installs the base system requirements needed to run DKPro Core pipelines on your machine. This needs to be performed only once. Download and install the latest Java SE Runtime Environment (at least Java 1.8) from the Oracle Java Site, then follow the installation instructions for your operating system. You can check your current Java version by running java -version in your command line.

1.3. Pipeline Download

When the Java environment is prepared, you can download the latest binary. Select the file named ddw-0.4.6.zip and unpack it somewhere easily accessible. As a next step we need to navigate to this folder using the command line.

2. Running the Pipeline

2.1. Using the Command Line

The DKPro pipeline does not have a graphical user interface (GUI). Therefore you have to use the pipeline (both setting up and processing data) with the command prompt. In all versions of the Windows operating system, pressing the Windows key + "R" should launch the command prompt. Otherwise, the command prompt can be launched

in Windows 7 by clicking on the "Start"-button, type "command" in the search box and click on "Command Prompt"

in Windows 8 with a right-click on the “Start”-button, choosing “run", and typing “cmd” in the search box. Alternatively type "cmd" in the "Search".

in Windows 10 by typing "cmd.exe" into the search box on the taskbar and selecting the first option.

Navigate to the directory that contains the DKPro-pipeline. For example, if you are using windows and keeping your pipeline in folder named "DKPro" on drive "D:", by typing,

and press enter.

Alternatively the directory can be accessed instantly by holding Shift + right-clicking the folder needed and selecting "Open command window here".

2.2. Processing a Textfile

Now you can process a text file. How to test when you don’t have any data? We’ve prepared a demonstration text that can be downloaded and processed via the pipeline. You can compare your output with this file. If you receive an identical output DKPro pipeline works fine on your computer. There are also a plenty of free texts available from TextGrid Repository or Deutsches Textarchiv. If you do not specify the -language parameter, the pipeline is prepared to analyze English input. For more details see further below.

To process data type the following command in the command prompt

and press Enter.

For example:

If your input and/or output file are located in the current directory you can type "." instead of the full input- and/or output-path. For example:

The pipeline will process your data and save the output as a .csv-File in the specified folder.  If

is shown on your command prompt everything has worked well. To see final results check the output-file in your specified output folder.

Important Note: Depending on the configuration of your system and the size of the input file processing may take some time, e.g. even a test file of 630 words may easily take 1-2 minutes, even if 4 GB RAM are allocated to the task.