Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Consider again the sed codeofFigure 14.1. It is tempting to write the ?rst of th

ID: 3573892 • Letter: C

Question

Consider again the sed codeofFigure 14.1. It is tempting to write the ?rst of the compound statements as follows
(note the differences in the three substitution commands):

/<[hH][123]>.*</[hH][123]>/ { ;# match whole heading
h ;# save copy of pattern space
s/^.*(<[hH][123]>)// ;# delete text before opening tag
s/(</[hH][123]>).*$// ;# delete text after closing tag
p ;# print what remains
g ;# retrieve saved pattern space
s/^.*</[hH][123]>// ;# delete through closing tag
b top

Q: Explain why this doesn’t work. (Hint: Remember the difference between greedy and minimal matches)

label (target for branch) top match whole heading save copy of pattern space delete text after closing tag s/ delete text before opening tag print what remains retrieve saved pattern space s/K ChHJ [123] delete opening tag delete closing tag b top and branch to top of script tt match opening tag (only) extend search to next line b top and branch to top of script if no match at all, delete Figure l4 Script in sed to extract headers from an HTML file. The script assumes that opening and closing tags are properly matched, and that headers do not nest.

Explanation / Answer

As a simple text processing exmaple, considerthe problem of extracting alll headers from a web page (an HTML file). These are strings delimited by <h1>...</h1>, <h2>..</h2> , and <h3>..</h3> tags. Accomplishing this task in an editor like emacs, vim, or even Microsoft Word is straightforward but tedious: one must search for an opening tag, delete proceding text, search for closing tag, mark the current position (as the starting point for the next deleting), and repeat. A program to perform these tasks in sed, the Unix "stream editor", appears in Figure 14.1. The code consists of a label and three commands, the first two of which are compound. The first compound command prints the first headers, if any, found in thr portion of the input currently begin examined (Which sed calls the pattern spaces), The second compound command appends a new line to the pattern space whenever it already contains a header-opening tag. Both compound commands, and several of the subcommands, use regular expression patterns, delimited by slashes. The third command (the lone d) simply deletes the pattern space. Because each compound command ends with a branch back to the top of the script, th second will execute only if the first does not, and the delete will execute only if neither compound does.

The editor heritage of sed is clear in this example. Commands are generally one character long, and there are no variables--no state of any kind beyound the program counter and text that is begin edited. These limitations make sed best suited to "One-line programs," typically entered verbatim from the keybord with the -ecommand-line switch, The following, for example, will read from standard input, delete blank lines, and (implicitly) print the nonblank lines to standard output: