| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
As you have already seen, each awk statement consists of
a pattern with an associated action. This chapter describes how
you build patterns and actions, what kinds of things you can do within
actions, and awk's built-in variables.
The pattern-action rules and the statements available for use
within actions form the core of awk programming.
In a sense, everything covered
up to here has been the foundation
that programs are built on top of. Now it's time to start
building something useful.
7.1 Pattern Elements What goes into a pattern. 7.2 Using Shell Variables in Programs How to use shell variables with awk.7.3 Actions What goes into an action. 7.4 Control Statements in Actions Describes the various control statements in detail. 7.5 Built-in Variables Summarizes the built-in variables.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
7.1.1 Regular Expressions as Patterns Using regexps as patterns. 7.1.2 Expressions as Patterns Any expression can be used as a pattern. 7.1.3 Specifying Record Ranges with Patterns Pairs of patterns specify record ranges. 7.1.4 The BEGINandENDSpecial PatternsSpecifying initialization and cleanup rules. 7.1.5 The Empty Pattern The empty pattern, which matches every record.
Patterns in awk control the execution of rules--a rule is
executed when its pattern matches the current input record.
The following is a summary of the types of patterns in awk:
/regular expression/
expression
pat1, pat2
BEGIN
END
awk program.
(See section The BEGIN and END Special Patterns.)
empty
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Regular expressions are one of the first kinds of patterns presented in this book. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is `$0 ~ /pattern/'. The pattern matches when the input record matches the regexp. For example:
/foo|bar|baz/ { buzzwords++ }
END { print buzzwords, "buzzwords seen" }
|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Any awk expression is valid as an awk pattern.
The pattern matches if the expression's value is nonzero (if a
number) or non-null (if a string).
The expression is reevaluated each time the rule is tested against a new
input record. If the expression uses fields such as $1, the
value depends directly on the new input record's text; otherwise it
depends on only what has happened so far in the execution of the
awk program.
Comparison expressions, using the comparison operators described in
Variable Typing and Comparison Expressions,
are a very common kind of pattern.
Regexp matching and non-matching are also very common expressions.
The left operand of the `~' and `!~' operators is a string.
The right operand is either a constant regular expression enclosed in
slashes (/regexp/), or any expression whose string value
is used as a dynamic regular expression
(see section Using Dynamic Regexps).
The following example prints the second field of each input record
whose first field is precisely `foo':
$ awk '$1 == "foo" { print $2 }' BBS-list
|
(There is no output, because there is no BBS site with the exact name `foo'.) Contrast this with the following regular expression match, which accepts any record with a first field that contains `foo':
$ awk '$1 ~ /foo/ { print $2 }' BBS-list
-| 555-1234
-| 555-6699
-| 555-6480
-| 555-2127
|
A regexp constant as a pattern is also a special case of an expression
pattern. The expression /foo/ has the value one if `foo'
appears in the current input record. Thus, as a pattern, /foo/
matches any record containing `foo'.
Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match. For example, the following command prints all the records in `BBS-list' that contain both `2400' and `foo':
$ awk '/2400/ && /foo/' BBS-list -| fooey 555-1234 2400/1200/300 B |
The following command prints all records in `BBS-list' that contain either `2400' or `foo' (or both, of course):
$ awk '/2400/ || /foo/' BBS-list -| alpo-net 555-3412 2400/1200/300 A -| bites 555-1675 2400/1200/300 A -| fooey 555-1234 2400/1200/300 B -| foot 555-6699 1200/300 B -| macfoo 555-6480 1200/300 A -| sdace 555-3430 2400/1200/300 A -| sabafoo 555-2127 1200/300 C |
The following command prints all records in `BBS-list' that do not contain the string `foo':
$ awk '! /foo/' BBS-list -| aardvark 555-5553 1200/300 B -| alpo-net 555-3412 2400/1200/300 A -| barfly 555-7685 1200/300 A -| bites 555-1675 2400/1200/300 A -| camelot 555-0542 300 C -| core 555-2912 1200/300 C -| sdace 555-3430 2400/1200/300 A |
The subexpressions of a Boolean operator in a pattern can be constant regular
expressions, comparisons, or any other awk expressions. Range
patterns are not expressions, so they cannot appear inside Boolean
patterns. Likewise, the special patterns BEGIN and END,
which never match any input record, are not expressions and cannot
appear inside Boolean patterns.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A range pattern is made of two patterns separated by a comma, in the form `begpat, endpat'. It is used to match ranges of consecutive input records. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends. For example, the following:
awk '$1 == "on", $1 == "off"' myfile |
prints every record in `myfile' between `on'/`off' pairs, inclusive.
A range pattern starts out by matching begpat against every input record. When a record matches begpat, the range pattern is turned on and the range pattern matches this record as well. As long as the range pattern stays turned on, it automatically matches every input record read. The range pattern also matches endpat against every input record; when this succeeds, the range pattern is turned off again for the following record. Then the range pattern goes back to checking begpat against each record.
The record that turns on the range pattern and the one that turns it
off both match the range pattern. If you don't want to operate on
these records, you can write if statements in the rule's action
to distinguish them from the records you are interested in.
It is possible for a pattern to be turned on and off by the same
record. If the record satisfies both conditions, then the action is
executed for just that record.
For example, suppose there is text between two identical markers (say
the `%' symbol), each on its own line, that should be ignored.
A first attempt would be to
combine a range pattern that describes the delimited text with the
next statement
(not discussed yet, see section The next Statement).
This causes awk to skip any further processing of the current
record and start over again with the next input record. Such a program
looks like this:
/^%$/,/^%$/ { next }
{ print }
|
This program fails because the range pattern is both turned on and turned off by the first line, which just has a `%' on it. To accomplish this task, write the program in the following manner, using a flag:
/^%$/ { skip = ! skip; next }
skip == 1 { next } # skip lines with `skip' set
|
In a range pattern, the comma (`,') has the lowest precedence of all the operators (i.e., it is evaluated last). Thus, the following program attempts to combine a range pattern with another simpler test:
echo Yes | awk '/1/,/2/ || /Yes/' |
The intent of this program is `(/1/,/2/) || /Yes/'.
However, awk interprets this as `/1/, (/2/ || /Yes/)'.
This cannot be changed or worked around; range patterns do not combine
with other patterns:
$ echo yes | gawk '(/1/,/2/) || /Yes/' error--> gawk: cmd. line:1: (/1/,/2/) || /Yes/ error--> gawk: cmd. line:1: ^ parse error error--> gawk: cmd. line:2: (/1/,/2/) || /Yes/ error--> gawk: cmd. line:2: ^ unexpected newline |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
BEGIN and END Special Patterns
All the patterns described so far are for matching input records.
The BEGIN and END special patterns are different.
They supply startup and cleanup actions for awk programs.
BEGIN and END rules must have actions; there is no default
action for these rules because there is no current record when they run.
BEGIN and END rules are often referred to as
"BEGIN and END blocks" by long-time awk
programmers.
7.1.4.1 Startup and Cleanup Actions How and why to use BEGIN/END rules. 7.1.4.2 Input/Output from BEGINandENDRulesI/O issues in BEGIN/END rules.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A BEGIN rule is executed once only, before the first input record
is read. Likewise, an END rule is executed once only, after all the
input is read. For example:
$ awk '
> BEGIN { print "Analysis of \"foo\"" }
> /foo/ { ++n }
> END { print "\"foo\" appears", n, "times." }' BBS-list
-| Analysis of "foo"
-| "foo" appears 4 times.
|
This program finds the number of records in the input file `BBS-list'
that contain the string `foo'. The BEGIN rule prints a title
for the report. There is no need to use the BEGIN rule to
initialize the counter n to zero, since awk does this
automatically (see section 6.3 Variables).
The second rule increments the variable n every time a
record containing the pattern `foo' is read. The END rule
prints the value of n at the end of the run.
The special patterns BEGIN and END cannot be used in ranges
or with Boolean operators (indeed, they cannot be used with any operators).
An awk program may have multiple BEGIN and/or END
rules. They are executed in the order in which they appear: all the BEGIN
rules at startup and all the END rules at termination.
BEGIN and END rules may be intermixed with other rules.
This feature was added in the 1987 version of awk and is included
in the POSIX standard.
The original (1978) version of awk
required the BEGIN rule to be placed at the beginning of the
program, the END rule to be placed at the end, and only allowed one of
each.
This is no longer required, but it is a good idea to follow this template
in terms of program organization and readability.
Multiple BEGIN and END rules are useful for writing
library functions, because each library file can have its own BEGIN and/or
END rule to do its own initialization and/or cleanup.
The order in which library functions are named on the command line
controls the order in which their BEGIN and END rules are
executed. Therefore you have to be careful when writing such rules in
library files so that the order in which they are executed doesn't matter.
See section Command-Line Options, for more information on
using library functions.
See section A Library of awk Functions,
for a number of useful library functions.
If an awk program only has a BEGIN rule and no
other rules, then the program exits after the BEGIN rule is
run.(23) However, if an
END rule exists, then the input is read, even if there are
no other rules in the program. This is necessary in case the END
rule checks the FNR and NR variables.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
BEGIN and END Rules
There are several (sometimes subtle) points to remember when doing I/O
from a BEGIN or END rule.
The first has to do with the value of $0 in a BEGIN
rule. Because BEGIN rules are executed before any input is read,
there simply is no input record, and therefore no fields, when
executing BEGIN rules. References to $0 and the fields
yield a null string or zero, depending upon the context. One way
to give $0 a real value is to execute a getline command
without a variable (see section Explicit Input with getline).
Another way is to simply assign a value to $0.
The second point is similar to the first but from the other direction.
Traditionally, due largely to implementation issues, $0 and
NF were undefined inside an END rule.
The POSIX standard specifies that NF is available in an END
rule. It contains the number of fields from the last input record.
Most probably due to an oversight, the standard does not say that $0
is also preserved, although logically one would think that it should be.
In fact, gawk does preserve the value of $0 for use in
END rules. Be aware, however, that Unix awk, and possibly
other implementations, do not.
The third point follows from the first two. The meaning of `print'
inside a BEGIN or END rule is the same as always:
`print $0'. If $0 is the null string, then this prints an
empty line. Many long time awk programmers use an unadorned
`print' in BEGIN and END rules, to mean `print ""',
relying on $0 being null. Although one might generally get away with
this in BEGIN rules, it is a very bad idea in END rules,
at least in gawk. It is also poor style, since if an empty
line is needed in the output, the program should print one explicitly.
Finally, the next and nextfile statements are not allowed
in a BEGIN rule, because the implicit
read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements
are not valid in an END rule, since all the input has been read.
(See section The next Statement, and see
Using gawk's nextfile Statement.)
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
An empty (i.e., non-existent) pattern is considered to match every input record. For example, the program:
awk '{ print $1 }' BBS-list
|
prints the first field of every record.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
awk programs are often used as components in larger
programs written in shell.
For example, it is very common to use a shell variable to
hold a pattern that the awk program searches for.
There are two ways to get the value of the shell variable
into the body of the awk program.
The most common method is to use shell quoting to substitute the variable's value into the program inside the script. For example, in the following program:
echo -n "Enter search pattern: "
read pattern
awk "/$pattern/ "'{ nmatches++ }
END { print nmatches, "found" }' /path/to/data
|
the awk program consists of two pieces of quoted text
that are concatenated together to form the program.
The first part is double-quoted, which allows substitution of
the pattern variable inside the quotes.
The second part is single-quoted.
Variable substitution via quoting works, but can be potentially messy. It requires a good understanding of the shell's quoting rules (see section Shell Quoting Issues), and it's often difficult to correctly match up the quotes when reading the program.
A better method is to use awk's variable assignment feature
(see section Assigning Variables on the Command Line)
to assign the shell variable's value to an awk variable's
value. Then use dynamic regexps to match the pattern
(see section Using Dynamic Regexps).
The following shows how to redo the
previous example using this technique:
echo -n "Enter search pattern: "
read pattern
awk -v pat="$pattern" '$0 ~ pat { nmatches++ }
END { print nmatches, "found" }' /path/to/data
|
Now, the awk program is just one single-quoted string.
The assignment `-v pat="$pattern"' still requires double quotes,
in case there is whitespace in the value of $pattern.
The awk variable pat could be named pattern
too, but that would be more confusing. Using a variable also
provides more flexibility, since the variable can be used anywhere inside
the program--for printing, as an array subscript, or for any other
use--without requiring the quoting tricks at every point in the program.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
An awk program or script consists of a series of
rules and function definitions interspersed. (Functions are
described later. See section User-Defined Functions.)
A rule contains a pattern and an action, either of which (but not
both) may be omitted. The purpose of the action is to tell
awk what to do once a match for the pattern is found. Thus,
in outline, an awk program generally looks like this:
[pattern] [{ action }]
[pattern] [{ action }]
...
function name(args) { ... }
...
|
An action consists of one or more awk statements, enclosed
in curly braces (`{' and `}'). Each statement specifies one
thing to do. The statements are separated by newlines or semicolons.
The curly braces around an action must be used even if the action
contains only one statement, or if it contains no statements at
all. However, if you omit the action entirely, omit the curly braces as
well. An omitted action is equivalent to `{ print $0 }':
/foo/ { } match |
The following types of statements are supported in awk:
awk
programs. The awk language gives you C-like constructs
(if, for, while, and do) as well as a few
special ones (see section Control Statements in Actions).
if, while, do,
or for statement.
getline command
(see section Explicit Input with getline), the next
statement (see section The next Statement),
and the nextfile statement
(see section Using gawk's nextfile Statement).
print and printf.
See section Printing Output.
delete Statement.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Control statements, such as if, while, and so on,
control the flow of execution in awk programs. Most of the
control statements in awk are patterned on similar statements in C.
All the control statements start with special keywords, such as if
and while, to distinguish them from simple expressions.
Many control statements contain other statements. For example, the
if statement contains another statement that may or may not be
executed. The contained statement is called the body.
To include more than one statement in the body, group them into a
single compound statement with curly braces, separating them with
newlines or semicolons.
7.4.1 The if-elseStatementConditionally execute some awkstatements.7.4.2 The whileStatementLoop until some condition is satisfied. 7.4.3 The do-whileStatementDo specified action while looping until some condition is satisfied. 7.4.4 The forStatementAnother looping statement, that provides initialization and increment clauses. 7.4.5 The breakStatementImmediately exit the innermost enclosing loop. 7.4.6 The continueStatementSkip to the end of the innermost enclosing loop. 7.4.7 The nextStatementStop processing the current input record. 7.4.8 Using gawk'snextfileStatementStop processing the current file. 7.4.9 The exitStatementStop execution of awk.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
if-else Statement
The if-else statement is awk's decision-making
statement. It looks like this:
if (condition) then-body [else else-body] |
The condition is an expression that controls what the rest of the
statement does. If the condition is true, then-body is
executed; otherwise, else-body is executed.
The else part of the statement is
optional. The condition is considered false if its value is zero or
the null string; otherwise the condition is true.
Refer to the following:
if (x % 2 == 0)
print "x is even"
else
print "x is odd"
|
In this example, if the expression `x % 2 == 0' is true (that is,
if the value of x is evenly divisible by two), then the first
print statement is executed; otherwise the second print
statement is executed.
If the else keyword appears on the same line as then-body and
then-body is not a compound statement (i.e., not surrounded by
curly braces), then a semicolon must separate then-body from
the else.
To illustrate this, the previous example can be rewritten as:
if (x % 2 == 0) print "x is even"; else
print "x is odd"
|
If the `;' is left out, awk can't interpret the statement and
it produces a syntax error. Don't actually write programs this way,
because a human reader might fail to see the else if it is not
the first thing on its line.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
while Statement
In programming, a loop is a part of a program that can
be executed two or more times in succession.
The while statement is the simplest looping statement in
awk. It repeatedly executes a statement as long as a condition is
true. For example:
while (condition) body |
body is a statement called the body of the loop,
and condition is an expression that controls how long the loop
keeps running.
The first thing the while statement does is test the condition.
If the condition is true, it executes the statement body.
(The condition is true when the value
is not zero and not a null string.)
After body has been executed,
condition is tested again, and if it is still true, body is
executed again. This process repeats until the condition is no longer
true. If the condition is initially false, the body of the loop is
never executed and awk continues with the statement following
the loop.
This example prints the first three fields of each record, one per line:
awk '{ i = 1
while (i <= 3) {
print $i
i++
}
}' inventory-shipped
|
The body of this loop is a compound statement enclosed in braces,
containing two statements.
The loop works in the following manner: first, the value of i is set to one.
Then, the while statement tests whether i is less than or equal to
three. This is true when i equals one, so the i-th
field is printed. Then the `i++' increments the value of i
and the loop repeats. The loop terminates when i reaches four.
A newline is not required between the condition and the body; however using one makes the program clearer unless the body is a compound statement or else is very simple. The newline after the open-brace that begins the compound statement is not required either, but the program is harder to read without it.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
do-while Statement
The do loop is a variation of the while looping statement.
The do loop executes the body once and then repeats the
body as long as the condition is true. It looks like this:
do body while (condition) |
Even if the condition is false at the start, the body is
executed at least once (and only once, unless executing body
makes condition true). Contrast this with the corresponding
while statement:
while (condition) body |
This statement does not execute body even once if the condition
is false to begin with.
The following is an example of a do statement:
{ i = 1
do {
print $0
i++
} while (i <= 10)
}
|
This program prints each input record ten times. However, it isn't a very
realistic example, since in this case an ordinary while would do
just as well. This situation reflects actual experience; only
occasionally is there a real use for a do statement.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
for Statement
The for statement makes it more convenient to count iterations of a
loop. The general form of the for statement looks like this:
for (initialization; condition; increment) body |
The initialization, condition, and increment parts are
arbitrary awk expressions, and body stands for any
awk statement.
The for statement starts by executing initialization.
Then, as long
as the condition is true, it repeatedly executes body and then
increment. Typically, initialization sets a variable to
either zero or one, increment adds one to it, and condition
compares it against the desired number of iterations.
For example:
awk '{ for (i = 1; i <= 3; i++)
print $i
}' inventory-shipped
|
This prints the first three fields of each input record, with one field per line.
It isn't possible to
set more than one variable in the
initialization part without using a multiple assignment statement
such as `x = y = 0'. This makes sense only if all the initial values
are equal. (But it is possible to initialize additional variables by writing
their assignments as separate statements preceding the for loop.)
The same is true of the increment part. Incrementing additional
variables requires separate statements at the end of the loop.
The C compound expression, using C's comma operator, is useful in
this context but it is not supported in awk.
Most often, increment is an increment expression, as in the previous example. But this is not required; it can be any expression whatsoever. For example, the following statement prints all the powers of two between 1 and 100:
for (i = 1; i <= 100; i *= 2) print i |
If there is nothing to be done, any of the three expressions in the
parentheses following the for keyword may be omitted. Thus,
`for (; x > 0;)' is equivalent to `while (x > 0)'. If the
condition is omitted, it is treated as true, effectively
yielding an infinite loop (i.e., a loop that never terminates).
In most cases, a for loop is an abbreviation for a while
loop, as shown here:
initialization
while (condition) {
body
increment
}
|
The only exception is when the continue statement
(see section The continue Statement) is used
inside the loop. Changing a for statement to a while
statement in this way can change the effect of the continue
statement inside the loop.
The awk language has a for statement in addition to a
while statement because a for loop is often both less work to
type and more natural to think of. Counting the number of iterations is
very common in loops. It can be easier to think of this counting as part
of looping rather than as something to do inside the loop.
There is an alternate version of the for loop, for iterating over
all the indices of an array:
for (i in array)
do something with array[i]
|
See section Scanning All Elements of an Array,
for more information on this version of the for loop.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
break Statement
The break statement jumps out of the innermost for,
while, or do loop that encloses it. The following example
finds the smallest divisor of any integer, and also identifies prime
numbers:
# find smallest divisor of num
{
num = $1
for (div = 2; div*div <= num; div++)
if (num % div == 0)
break
if (num % div == 0)
printf "Smallest divisor of %d is %d\n", num, div
else
printf "%d is prime\n", num
}
|
When the remainder is zero in the first if statement, awk
immediately breaks out of the containing for loop. This means
that awk proceeds immediately to the statement following the loop
and continues processing. (This is very different from the exit
statement, which stops the entire awk program.
See section The exit Statement.)
Th following program illustrates how the condition of a for
or while statement could be replaced with a break inside
an if:
# find smallest divisor of num
{
num = $1
for (div = 2; ; div++) {
if (num % div == 0) {
printf "Smallest divisor of %d is %d\n", num, div
break
}
if (div*div > num) {
printf "%d is prime\n", num
break
}
}
}
|
The break statement has no meaning when
used outside the body of a loop. However, although it was never documented,
historical implementations of awk treated the break
statement outside of a loop as if it were a next statement
(see section The next Statement).
Recent versions of Unix awk no longer allow this usage.
gawk supports this use of break only
if `--traditional'
has been specified on the command line
(see section Command-Line Options).
Otherwise, it is treated as an error, since the POSIX standard
specifies that break should only be used inside the body of a
loop.
(d.c.)
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
continue Statement
As with break, the continue statement is used only inside
for, while, and do loops. It skips
over the rest of the loop body, causing the next cycle around the loop
to begin immediately. Contrast this with break, which jumps out
of the loop altogether.
The continue statement in a for loop directs awk to
skip the rest of the body of the loop and resume execution with the
increment-expression of the for statement. The following program
illustrates this fact:
BEGIN {
for (x = 0; x <= 20; x++) {
if (x == 5)
continue
printf "%d ", x
}
print ""
}
|
This program prints all the numbers from 0 to 20--except for five, for
which the printf is skipped. Because the increment `x++'
is not skipped, x does not remain stuck at five. Contrast the
for loop from the previous example with the following while loop:
BEGIN {
x = 0
while (x <= 20) {
if (x == 5)
continue
printf "%d ", x
x++
}
print ""
}
|
This program loops forever once x reaches five.
The continue statement has no meaning when used outside the body of
a loop. Historical versions of awk treated a continue
statement outside a loop the same way they treated a break
statement outside a loop: as if it were a next
statement
(see section The next Statement).
Recent versions of Unix awk no longer work this way, and
gawk allows it only if `--traditional' is specified on
the command line (see section Command-Line Options). Just like the
break statement, the POSIX standard specifies that continue
should only be used inside the body of a loop.
(d.c.)
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
next Statement
The next statement forces awk to immediately stop processing
the current record and go on to the next record. This means that no
further rules are executed for the current record, and the rest of the
current rule's action isn't executed.
Contrast this with the effect of the getline function
(see section Explicit Input with getline). That also causes
awk to read the next record immediately, but it does not alter the
flow of control in any way (i.e., the rest of the current action executes
with a new input record).
At the highest level, awk program execution is a loop that reads
an input record and then tests each rule's pattern against it. If you
think of this loop as a for statement whose body contains the
rules, then the next statement is analogous to a continue
statement. It skips to the end of the body of this implicit loop and
executes the increment (which reads another record).
For example, suppose an awk program works only on records
with four fields, and it shouldn't fail when given bad input. To avoid
complicating the rest of the program, write a "weed out" rule near
the beginning, in the following manner:
NF != 4 {
err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
print err > "/dev/stderr"
next
}
|
Because of the next statement,
the program's subsequent rules won't see the bad record. The error
message is redirected to the standard error output stream, as error
messages should be.
See section Special File Names in gawk.
According to the POSIX standard, the behavior is undefined if
the next statement is used in a BEGIN or END rule.
gawk treats it as a syntax error.
Although POSIX permits it,
some other awk implementations don't allow the next
statement inside function bodies
(see section User-Defined Functions).
Just as with any other next statement, a next statement inside a
function body reads the next record and starts processing it with the
first rule in the program.
If the next statement causes the end of the input to be reached,
then the code in any END rules is executed.
See section The BEGIN and END Special Patterns.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
gawk's nextfile Statement
gawk provides the nextfile statement,
which is similar to the next statement.
However, instead of abandoning processing of the current record, the
nextfile statement instructs gawk to stop processing the
current data file.
The nextfile statement is a gawk extension.
In most other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
nextfile is not special.
Upon execution of the nextfile statement, FILENAME is
updated to the name of the next data file listed on the command line,
FNR is reset to one, ARGIND is incremented, and processing
starts over with the first rule in the program.
(ARGIND hasn't been introduced yet. See section 7.5 Built-in Variables.)
If the nextfile statement causes the end of the input to be reached,
then the code in any END rules is executed.
See section The BEGIN and END Special Patterns.
The nextfile statement is useful when there are many data files
to process but it isn't necessary to process every record in every file.
Normally, in order to move on to the next data file, a program
has to continue scanning the unwanted records. The nextfile
statement accomplishes this much more efficiently.
While one might think that `close(FILENAME)' would accomplish
the same as nextfile, this isn't true. close is
reserved for closing files, pipes, and coprocesses that are
opened with redirections. It is not related to the main processing that
awk does with the files listed in ARGV.
If it's necessary to use an awk version that doesn't support
nextfile, see
Implementing nextfile as a Function,
for a user-defined function that simulates the nextfile
statement.
The current version of the Bell Laboratories awk
(see section Other Freely Available awk Implementations)
also supports nextfile. However, it doesn't allow the nextfile
statement inside function bodies
(see section User-Defined Functions).
gawk does; a nextfile inside a
function body reads the next record and starts processing it with the
first rule in the program, just as any other nextfile statement.
Caution: Versions of gawk prior to 3.0 used two
words (`next file') for the nextfile statement.
In version 3.0, this was changed
to one word, because the treatment of `file' was
inconsistent. When it appeared after next, `file' was a keyword;
otherwise, it was a regular identifier. The old usage is no longer
accepted; `next file' generates a syntax error.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
exit Statement
The exit statement causes awk to immediately stop
executing the current rule and to stop processing input; any remaining input
is ignored. The exit statement is written as follows:
exit [return code] |
When an exit statement is executed from a BEGIN rule, the
program stops processing everything immediately. No input records are
read. However, if an END rule is present,
as part of executing the exit statement,
the END rule is executed
(see section The BEGIN and END Special Patterns).
If exit is used as part of an END rule, it causes
the program to stop immediately.
An exit statement that is not part of a BEGIN or END
rule stops the execution of any further automatic rules for the current
record, skips reading any remaining input records, and executes the
END rule if there is one.
In such a case,
if you don't want the END rule to do its job, set a variable
to nonzero before the exit statement and check that variable in
the END rule.
See section Assertions,
for an example that does this.
If an argument is supplied to exit, its value is used as the exit
status code for the awk process. If no argument is supplied,
exit returns status zero (success). In the case where an argument
is supplied to a first exit statement, and then exit is
called a second time from an END rule with no argument,
awk uses the previously supplied exit value.
(d.c.)
For example, suppose an error condition occurs that is difficult or
impossible to handle. Conventionally, programs report this by
exiting with a nonzero status. An awk program can do this
using an exit statement with a nonzero argument, as shown
in the following example:
BEGIN {
if (("date" | getline date_now) <= 0) {
print "Can't get system date" > "/dev/stderr"
exit 1
}
print "current date is", date_now
close("date")
}
|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Most awk variables are available for you to use for your own
purposes; they never change unless your program assigns values to
them, and they never affect anything unless your program examines them.
However, a few variables in awk have special built-in meanings.
awk examines some of these automatically, so that they enable you
to tell awk how to do certain things. Others are set
automatically by awk, so that they carry information from the
internal workings of awk to your program.
This section documents all the built-in variables of
gawk, most of which are also documented in the chapters
describing their areas of activity.
7.5.1 Built-in Variables That Control awkBuilt-in variables that you change to control awk.
7.5.2 Built-in Variables That Convey Information Built-in variables where awkgives you information.7.5.3 Using ARGCandARGVWays to use ARGCandARGV.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
awk
The following is an alphabetical list of variables that you can change to
control how awk does certain things. The variables that are
specific to gawk are marked with a pound sign (`#').
BINMODE #
"r" or "w" specify that input files and
output files, respectively, should use binary I/O.
A string value of "rw" or "wr" indicates that all
files should use binary I/O.
Any other string value is equivalent to "rw", but gawk
generates a warning message.
BINMODE is described in more detail in
Using gawk on PC Operating Systems.
This variable is a gawk extension.
In other awk implementations
(except mawk,
see section Other Freely Available awk Implementations),
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
CONVFMT
sprintf function
(see section String Manipulation Functions).
Its default value is "%.6g".
CONVFMT was introduced by the POSIX standard.
FIELDWIDTHS #
gawk
how to split input with fixed columnar boundaries.
Assigning a value to FIELDWIDTHS
overrides the use of FS for field splitting.
See section Reading Fixed-Width Data, for more information.
If gawk is in compatibility mode
(see section Command-Line Options), then FIELDWIDTHS
has no special meaning, and field-splitting operations occur based
exclusively on the value of FS.
FS
""), then each
character in the record becomes a separate field.
(This behavior is a gawk extension. POSIX awk does not
specify the behavior when FS is the null string.)
The default value is " ", a string consisting of a single
space. As a special exception, this value means that any
sequence of spaces, tabs, and/or newlines is a single separator.(24) It also causes
spaces, tabs, and newlines at the beginning and end of a record to be ignored.
You can set the value of FS on the command line using the
`-F' option:
awk -F, 'program' input-files |
If gawk is using FIELDWIDTHS for field splitting,
assigning a value to FS causes gawk to return to
the normal, FS-based field splitting. An easy way to do this
is to simply say `FS = FS', perhaps with an explanatory comment.
IGNORECASE #
IGNORECASE is nonzero or non-null, then all string comparisons
and all regular expression matching are case-independent. Thus, regexp
matching with `~' and `!~', as well as the gensub,
gsub, index, match, split, and sub
functions, record termination with RS, and field splitting with
FS, all ignore case when doing their particular regexp operations.
However, the value of IGNORECASE does not affect array subscripting.
See section Case Sensitivity in Matching.
If gawk is in compatibility mode
(see section Command-Line Options),
then IGNORECASE has no special meaning. Thus, string
and regexp operations are always case-sensitive.
LINT #
gawk
behaves as if the `--lint' command-line option is in effect.
(see section Command-Line Options).
With a value of "fatal", lint warnings become fatal errors.
Any other true value prints non-fatal warnings.
Assigning a false value to LINT turns off the lint warnings.
This variable is a gawk extension. It is not special
in other awk implementations. Unlike the other special variables,
changing LINT does affect the production of lint warnings,
even if gawk is in compatibility mode. Much as
the `--lint' and `--traditional' options independently
control different aspects of gawk's behavior, the control
of lint warnings during program execution is independent of the flavor
of awk being executed.
OFMT
print statement. It works by being passed
as the first argument to the sprintf function
(see section String Manipulation Functions).
Its default value is "%.6g". Earlier versions of awk
also used OFMT to specify the format for converting numbers to
strings in general expressions; this is now done by CONVFMT.
OFS
print statement. Its
default value is " ", a string consisting of a single space.
ORS
print statement. Its default value is "\n", the newline
character. (See section 5.3 Output Separators.)
RS
awk's input record separator. Its default value is a string
containing a single newline character, which means that an input record
consists of a single line of text.
It can also be the null string, in which case records are separated by
runs of blank lines.
If it is a regexp, records are separated by
matches of the regexp in the input text.
(See section How Input Is Split into Records.)
The ability for RS to be a regular expression
is a gawk extension.
In most other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
just the first character of RS's value is used.
SUBSEP
"\034" and is used to separate the parts of the indices of a
multidimensional array. Thus, the expression foo["A", "B"]
really accesses foo["A\034B"]
(see section Multidimensional Arrays).
TEXTDOMAIN #
awk level. It sets the default text domain for specially
marked string constants in the source text, as well as for the
dcgettext and bindtextdomain functions
(see section Internationalization with gawk).
The default value of TEXTDOMAIN is "messages".
This variable is a gawk extension.
In other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following is an alphabetical list of variables that awk
sets automatically on certain occasions in order to provide
information to your program. The variables that are specific to
gawk are marked with an asterisk (`*').
ARGC, ARGV
awk programs are stored in
an array called ARGV. ARGC is the number of command-line
arguments present. See section Other Command-Line Arguments.
Unlike most awk arrays,
ARGV is indexed from 0 to ARGC - 1.
In the following example:
$ awk 'BEGIN {
> for (i = 0; i < ARGC; i++)
> print ARGV[i]
> }' inventory-shipped BBS-list
-| awk
-| inventory-shipped
-| BBS-list
|
ARGV[0] contains "awk", ARGV[1]
contains "inventory-shipped" and ARGV[2] contains
"BBS-list". The value of ARGC is three, one more than the
index of the last element in ARGV, because the elements are numbered
from zero.
The names ARGC and ARGV, as well as the convention of indexing
the array from 0 to ARGC - 1, are derived from the C language's
method of accessing command-line arguments.
The value of ARGV[0] can vary from system to system.
Also, you should note that the program text is not included in
ARGV, nor are any of awk's command-line options.
See section Using ARGC and ARGV, for information
about how awk uses these variables.
ARGIND #
ARGV of the current file being processed.
Every time gawk opens a new data file for processing, it sets
ARGIND to the index in ARGV of the file name.
When gawk is processing the input files,
`FILENAME == ARGV[ARGIND]' is always true.
This variable is useful in file processing; it allows you to tell how far along you are in the list of data files as well as to distinguish between successive instances of the same file name on the command line.
While you can change the value of ARGIND within your awk
program, gawk automatically sets it to a new value when the
next file is opened.
This variable is a gawk extension.
In other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
ENVIRON
ENVIRON["HOME"] might be `/home/arnold'. Changing this array
does not affect the environment passed on to any programs that
awk may spawn via redirection or the system function.
Some operating systems may not have environment variables.
On such systems, the ENVIRON array is empty (except for
ENVIRON["AWKPATH"],
see section The AWKPATH Environment Variable).
ERRNO #
getline,
during a read for getline, or during a close operation,
then ERRNO contains a string describing the error.
This variable is a gawk extension.
In other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
FILENAME
awk is currently reading.
When no data files are listed on the command line, awk reads
from the standard input and FILENAME is set to "-".
FILENAME is changed each time a new file is read
(see section Reading Input Files).
Inside a BEGIN rule, the value of FILENAME is
"", since there are no input files being processed
yet.(25)
(d.c.)
Note though, that using getline
(see section Explicit Input with getline)
inside a BEGIN rule can give
FILENAME a value.
FNR
FNR is
incremented each time a new record is read
(see section Explicit Input with getline). It is reinitialized
to zero each time a new input file is started.
NF
NF is set each time a new record is read, when a new field is
created or when $0 changes (see section Examining Fields).
NR
awk has processed since
the beginning of the program's execution
(see section How Input Is Split into Records).
NR is incremented each time a new record is read.
PROCINFO #
awk program.
The following elements (listed alphabetically)
are guaranteed to be available:
PROCINFO["egid"]
getegid system call.
PROCINFO["euid"]
geteuid system call.
PROCINFO["FS"]
"FS" if field splitting with FS is in effect, or it is
"FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect.
PROCINFO["gid"]
getgid system call.
PROCINFO["pgrpid"]
PROCINFO["pid"]
PROCINFO["ppid"]
PROCINFO["uid"]
getuid system call.
On some systems, there may be elements in the array, "group1"
through "groupN" for some N. N is the number of
supplementary groups that the process has. Use the in operator
to test for these elements
(see section Referring to an Array Element).
This array is a gawk extension.
In other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
RLENGTH
match function
(see section String Manipulation Functions).
RLENGTH is set by invoking the match function. Its value
is the length of the matched string, or -1 if no match is found.
RSTART
match function
(see section String Manipulation Functions).
RSTART is set by invoking the match function. Its value
is the position of the string where the matched substring starts, or zero
if no match was found.
RT #
RS, the record separator.
This variable is a gawk extension.
In other awk implementations,
or if gawk is in compatibility mode
(see section Command-Line Options),
it is not special.
NR and FNR awk increments NR and FNR
each time it reads a record, instead of setting them to the absolute
value of the number of records read. This means that a program can
change these variables and their new values are incremented for
each record.
(d.c.)
This is demonstrated in the following example:
$ echo '1
> 2
> 3
> 4' | awk 'NR == 2 { NR = 17 }
> { print NR }'
-| 1
-| 17
-| 18
-| 19
|
Before FNR was added to the awk language
(see section Major Changes Between V7 and SVR3.1),
many awk programs used this feature to track the number of
records in a file by resetting NR to zero when FILENAME
changed.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
ARGC and ARGV
Built-in Variables That Convey Information,
presented the following program describing the information contained in ARGC
and ARGV:
$ awk 'BEGIN {
> for (i = 0; i < ARGC; i++)
> print ARGV[i]
> }' inventory-shipped BBS-list
-| awk
-| inventory-shipped
-| BBS-list
|
In this example, ARGV[0] contains `awk', ARGV[1]
contains `inventory-shipped', and ARGV[2] contains
`BBS-list'.
Notice that the awk program is not entered in ARGV. The
other special command-line options, with their arguments, are also not
entered. This includes variable assignments done with the `-v'
option (see section Command-Line Options).
Normal variable assignments on the command line are
treated as arguments and do show up in the ARGV array:
$ cat showargs.awk
-| BEGIN {
-| printf "A=%d, B=%d\n", A, B
-| for (i = 0; i < ARGC; i++)
-| printf "\tARGV[%d] = %s\n", i, ARGV[i]
-| }
-| END { printf "A=%d, B=%d\n", A, B }
$ awk -v A=1 -f showargs.awk B=2 /dev/null
-| A=1, B=0
-| ARGV[0] = awk
-| ARGV[1] = B=2
-| ARGV[2] = /dev/null
-| A=1, B=2
|
A program can alter ARGC and the elements of ARGV.
Each time awk reaches the end of an input file, it uses the next
element of ARGV as the name of the next input file. By storing a
different string there, a program can change which files are read.
Use "-" to represent the standard input. Storing
additional elements and incrementing ARGC causes
additional files to be read.
If the value of ARGC is decreased, that eliminates input files
from the end of the list. By recording the old value of ARGC
elsewhere, a program can treat the eliminated arguments as
something other than file names.
To eliminate a file from the middle of the list, store the null string
("") into ARGV in place of the file's name. As a
special feature, awk ignores file names that have been
replaced with the null string.
Another option is to
use the delete statement to remove elements from
ARGV (see section The delete Statement).
All of these actions are typically done in the BEGIN rule,
before actual processing of the input begins.
See section Splitting a Large File into Pieces, and see
Duplicating Output into Multiple Files, for examples
of each way of removing elements from ARGV.
The following fragment processes ARGV in order to examine, and
then remove, command-line options:
BEGIN {
for (i = 1; i < ARGC; i++) {
if (ARGV[i] == "-v")
verbose = 1
else if (ARGV[i] == "-d")
debug = 1
else if (ARGV[i] ~ /^-?/) {
e = sprintf("%s: unrecognized option -- %c",
ARGV[0], substr(ARGV[i], 1, ,1))
print e > "/dev/stderr"
} else
break
delete ARGV[i]
}
}
|
To actually get the options into the awk program,
end the awk options with `--' and then supply
the awk program's options, in the following manner:
awk -f myprog -- -v -d file1 file2 ... |
This is not necessary in gawk. Unless `--posix' has
been specified, gawk silently puts any unrecognized options
into ARGV for the awk program to deal with. As soon
as it sees an unknown option, gawk stops looking for other
options that it might otherwise recognize. The previous example with
gawk would be:
gawk -f myprog -d -v file1 file2 ... |
Because `-d' is not a valid gawk option,
it and the following `-v'
are passed on to the awk program.
| [ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |