Perl is famous for processing text files via regular expressions.
Regular Expressions in Perl
A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.
I shall assume that you are familiar with Regex syntax. Otherwise, you could read:
- "Regex Syntax Summary" for a summary of regex syntax and examples.
- "Regular Expressions" for full coverage.
Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In  Perl (and JavaScript), a regex is delimited by  a pair of forward slashes (default), in the form of /regex/. You can use built-in operators:
- m/regex/modifier: Match against the regex.
- s/regex/replacement/modifier: Substitute matched substring(s) by the replacement.
Matching Operator m//
You can use matching operator m// to check if a regex pattern exists in a string.  The syntax is:
m/regex/ m/regex/modifiers # Optional modifiers /regex/ # Operator m can be omitted if forward-slashes are used as delimiter /regex/modifiers
Delimiter
Instead of using forward-slashes (/) as delimiter, you could use other non-alphanumeric characters such as !, @ and % in the form of m!regex!modifiers m@regex@modifiers or m%regex%modifiers. However, if forward-slash (/) is used as the delimiter, the operator m can be omitted in the form of /regex/modifiers. Changing the default delimiter is confusing, and not recommended.
m//, by default, operates on the default variable $_.  It returns true if $_ matches regex; and false otherwise.
Example 1: Regex [0-9]+
#!/usr/bin/env perl # try_m_1.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ print m/[0-9]+/ ? "Accept\n" : "Reject\n"; # one or more digits? }
$ ./try_m_1.pl 123 Accept 00000 Accept abc Reject abc123 Accept
Example 2: Extracting the Matched Substrings
The built-in array variables @- and @+ keep the start and end positions
   of the matched substring, where $-[0] and $+[0] for the full match, and
   $-[n] and $+[n] for back references $1, $2, ..., $n, ....
#!/usr/bin/env perl # try_m_2.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ if (m/[0-9]+/) { print 'Accept substring: ' . substr($_, $-[0], $+[0] - $-[0]) . "\n"; } else { print "Reject\n"; } }
$ ./try_m_2.pl 123 Accept substring: 123 00000 Accept substring: 00000 abc Reject abc123xyz Accept substring: 123 abc123xyz456 Accept substring: 123
Example 3: Modifier 'g' (global)
By default, m// finds only the first match. To find all matches, include 'g' (global) modifier.
#!/usr/bin/env perl # try_m_3.pl use strict; use warnings; my $regex = '[0-9]+'; # Define regex pattern in non-interpolating string while (<>) { # Read input from command-line into default variable $_ # Do m//g and save matched substring into an array my @matches = /$regex/g; print "Matched substrings (in array): @matches\n"; # print array # Do m//g in a loop print 'Matched substrings (in loop) : '; while (/$regex/g) { print substr($_, $-[0], $+[0] - $-[0]), ','; } print "\n"; }
$ ./try_m_3.pl abc123xyz456_0_789 Matched substrings (in array): 123 456 0 789 Matched substrings (in loop) : 123,456,0,789, abc Matched substrings (in array): Matched substrings (in loop) : 123 Matched substrings (in array): 123 Matched substrings (in loop) : 123,
Operators =~ and !~
By default, the matching operators operate on the default variable $_.  To operate on other variable instead of $_, you could use the =~ and !~ operators as follows:
str =~ m/regex/modifiers # Return true if str matches regex. str !~ m/regex/modifiers # Return true if str does NOT match regex.
When used with m//, =~ behaves like comparison (== or eq).
Example 4: =~ Operator
#!/usr/bin/env perl # try_m_4.pl use strict; use warnings; print 'yes or no? '; my $reply; chomp($reply = <>); # Remove newline print $reply =~ /^y/i ? "positive!\n" : "negative!\n"; # Begins with 'y', case-insensitive
Substitution Operator s///
You can substitute a string (or a portion of a string) with another string using s/// substitution operator.  The syntax is:
s/regex/replacement/
s/regex/replacement/modifiers  # Optional modifiers
Similar to m//, s/// operates on the default variable $_ by default. To operate on other variable, you could use the =~ and !~ operators. When used with s///, =~ behaves like assignment (=).
Example 5: s///
#!/usr/bin/env perl # try_s_1.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ s/\w+/\*\*\*/g; # Match each word print "$_"; }
$ ./try_s_1.pl this is an apple. *** *** *** ***.
Modifiers
Modifiers (such as /g, /i, /e, /o, /s and /x) can be used to control the behavior of m// and s///.
- g (global): By default, only the first occurrence of the matching string of each line is processed.  You can use modifier /gto specify global operation.
- i (case-insensitive): By default, matching is case-sensitive. You can use the modifier /ito enable case in-sensitive matching.
- m (multiline): multiline string, affecting position anchor ^,$,\A,\Z.
- s: permits metacharacter .(dot) to match the newline.
Parenthesized Back-References & Matched Variables $1, ..., $9
Parentheses ( ) serve two purposes in regex:
- Firstly, parentheses ( )can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example,/(a|e|i|o|u){3,5}/is the same as/a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/.
- Secondly, parentheses are used to provide the so called back-references.  A back-reference contains the matched sub-string.  For examples, the regex /(\S+)/creates one back-reference(\S+), which contains the first word (consecutive non-spaces) in the input string; the regex/(\S+)\s+(\S+)/creates two back-references:(\S+)and another(\S+), containing the first two words, separated by one or more spaces\s+.
 The back-references are stored in special variables $1, $2, …, $9, where  $1 contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/ creates two back-references which matched with the first two words. The matched words are stored in $1 and $2, respectively. 
For example, the following expression swap the first and second words:
s/(\S+) (\S+)/$2 $1/;   # Swap the first and second words separated by a single space
Back-references can also be referenced in your program.
For example,
(my $word) = ($str =~ /(\S+)/);
The parentheses creates one back-reference, which matches the first word of the $str if there is one, and is placed inside the scalar variable $word. If there is no match, $word is UNDEF.
Another example,
(my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/);
The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str into variables $word1 and $word2 if there are more than two words; otherwise, both $word1 and $word2 are UNDEF.  Note that regular expression matching must be complete and there is no partial matching.
\1, \2, \3 has the same meaning as $1, $2, $3, but are valid only inside the s/// or m//. For example, /(\S+)\s\1/ matches a pair of repeated words, separated by a white-space.
Character Translation Operator tr///
You can use translator operator to translate a character into another character. The syntax is:
tr/fromchars/tochars/modifiers
replaces or translates fromchars to tochars in $_, and returns the number of characters replaced.
For examples,
tr/a-z/A-Z/ # converts $_ to uppercase. tr/dog/cat/ # translates d to c, o to a, g to t. $str =~ tr/0-9/a-j/ # replace 0 by a, etc in $str. tr/A-CG/KX-Z/ # replace A by K, B by X, C by Y, G by Z.
Instead of forward slash (/), you can use parentheses (), brackets [], curly bracket {} as delimiter, e.g.,
tr[0-9][##########] # replace numbers by #. tr{!.}(.!) # swap ! and ., one pass.
If tochars is shorter than fromchars, the last character of tochars is used repeatedly.
tr/a-z/A-E/       # f to z is replaced by E.
tr/// returns the number of replaced characters.  You can use it to count the occurrence of certain characters.  For examples,
my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/); my $numDigits = ($string =~ tr/0-9/0-9/); my $numSpaces = ($string =~ tr/ / /);
Modifiers /c, /d and /s for tr///
- /c: complements (inverses)- fromchars.
- /d: deletes any matched but un-replaced characters.
- /s: squashes duplicate characters into just one.
For examples,
tr/A-Za-z/ /c # replaces all non-alphabets with space tr/A-Z//d # deletes all uppercase (matched with no replacement). tr/A-Za-z//dc # deletes all non-alphabets tr/!//s # squashes duplicate !
String Functions: split and join
split(regex, str, [numItems]): Splits the given str using the regex, and return the items in an array. The optional third parameter specifies the maximum items to be processed.
join(joinStr, strList): Joins the items in strList with the given joinStr (possibly empty).
For examples,
#!/usr/bin/env perl use strict; use warnings; my $msg = 'Hello, world again!'; my @words = split(/ /, $msg); # ('Hello,', 'world', 'again!') for (@words) { say; } # Use default scalar variable say join('--', @words); # 'Hello,--world--again!' my $newMsg = join '', @words; # 'Hello,worldagain!' say $newMsg;
Functions grep, map
- grep(regex, array): selects those elements of the array, that matchesregex.
- map(regex, array): returns a new array constructed by applying regexto each element of thearray.
File Input/Output
Filehandle
Filehandles are data structure which your program can use to manipulate files. A filehandle acts as a gate between your program and the files, directories, or other programs. Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.
Naming Convention: use uppercase for the name of the filehandle, e.g., FILE,  DIR, FILEIN, FILEOUT, and etc.
Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., <FILEHANDLE>.
Example: Read and print the content of a text file via a filehandle.
#!/usr/bin/env perl use strict; use warnings; # FileRead.pl: Read & print the content of a text file. my $filename = shift; # Get the filename from command line. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; while (<FILE>) { # Set $_ to each line of the file print; # Print $_ }
Example: Search and print lines containing a particular search word.
#!/usr/bin/env perl use strict; use warnings; # FileSearch.pl: Search for lines containing a search word. (my $filename, my $word) = @ARGV; # Get filename & search word. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; while (<FILE>) { # Set $_ to each line of the file print if /\b$word\b/i; # Match $_ with word, case insensitive }
Example: Print the content of a directory via a directory handle.
#!/usr/bin/env perl use strict; use warnings; # DirPrint.pl: Print the content of a directory. my $dirname = shift; # Get directory name from command-line opendir(DIR, $dirname) or die "Can't open directory $dirname: $!"; my @files = readdir(DIR); foreach my $file (@files) { # Display files not beginning with dot. print "$file\n" if ($file !~ /^\./); }
You can use C-style's printf for formatted output to file.
File Handling Functions
Function open: open(filehandle, string) opens the filename given by string and associates it with the filehandle. It returns true if success and UNDEF otherwise.
- If string begins with <(or nothing), it is opened for reading.
- If string begins with >, it is opened for writing.
- If string begins with >>, it is opened for appending.
- If string begins with +<,+>,+>>, it is opened for both reading and writing.
- If string is -,STDINis opened.
- If string is >-,STDOUTis opened.
- If string begins with -|or|-, your process willfork()to execute the pipe command.
Function close: close(filehandle) closes the file associated with the filehandle. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file.  You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.
A common procedure for modifying a file is to:
- Read in the entire file with open(FILE, $filename)and@lines = <FILE>.
- Close the filehandle.
- Operate upon @lines(which is in the fast RAM) rather thanFILE(which is in the slow disk).
- Write the new file contents using open(FILE, “>$filename”)andprint FILE @lines.
- Close the file handle.
Example: Read the contents of the entire file into memory; modify and write back to disk.
#!/usr/bin/env perl use strict; use warnings; # FileChange.pl my $filename = shift; # Get the filename from command line. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; # Read the entire file into an array in memory. my @lines = <FILE>; close(FILE); open(FILE, ">$filename") or die "Can't write to $filename: $!"; foreach my $line (@lines) { print FILE uc($line); # Change to uppercase } close(FILE);
Example: Reading from a file
#!/usr/bin/env perl
use strict;
use warnings;
   
open(FILEIN, "test.txt") or die "Can't open file: $!";
while (<FILEIN>) {     # set $_ to each line of the file.
   print;              # print $_
}
Example: Writing to a file
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, ">$filename") or die "Can't write to $filename: $!"; print FILE "This is line 1\n"; # no comma after FILE. print FILE "This is line 2\n"; print FILE "This is line 3\n";
Example: Appending to a file
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, ">>$filename") or die "Can't append to $filename: $!"; print FILE "This is line 4\n"; # no comma after FILE. print FILE "This is line 5\n";
In-Place Editing
Instead of reading in one file and write to another file, you  could do in-place editing by specifying –i flag or use the special variable $^I.
- The –ibackupExtensionflag tells Perl to edit files in-place. If abackupExtensionis provided, a backup file will be created with thebackupExtension.
- The special variable $^I=backupExtensiondoes the same thing.
Example: In-place editing using –i flag
#!/usr/bin/env perl -i.old # In-place edit, backup as '.old' use strict; use warnings; while (<>) { s/line/TEST/g; print; # Print to the file, not STDOUT. }
Example:  In-place editing using $^I special variable.
#!/usr/bin/env perl use strict; use warnings; $^I = '.bak'; # Enable in-place editing, backup in '.bak'. while (<>) { s/TEST/line/g; print; # Print to the file, not STDOUT. }
Functions seek, tell, truncate
seek(filehandle, position, whence): moves the file pointer of the filehandle to position, as  measured from whence.  seek()  returns 1 upon success and  0 otherwise.  File position is measured in bytes.  whence of 0  measured from the beginning of the file; 1  measured from the current position; and 2  measured from the end.  For example:
seek(FILE, 0, 2); # 0 byte from end-of-file, give file size. seek(FILE, -2, 2); # 2 bytes before end-of-file. seek(FILE, -10, 1); # Move file pointer 10 byte backward. seek(FILE, 20, 0); # 20 bytes from the begin-of-file.
tell(filehandle): returns the current file position of filehandle.
truncate(FILE, length): truncates FILE to length bytes.  FILE can be  either a filehandle or a file name.
To find the length of a file, you could:
seek(FILE, 0, 2); # Move file point to end of file. print tell(FILE); # Print the file size.
Example: Truncate  the last 2 bytes if they begin with \x0D,
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, "+<$filename") or die "Can't open $file: $!"; seek(FILE, -2, 2); # 2 byte before end-of-file. my $pos = tell FILE; my $data = <FILE>; # read moves the file pointer. if ($data =~ /^\x0D/) { # begin with 0D truncate FILE, $pos; # truncate last 2 bytes. }
Function eof
eof(filehandle) returns 1 if  the file pointer is positioned at the end of the file or if the filehandle is not opened.
Reading Bytes Instead of Lines
The function read(filehandle, var, length, offset) reads length bytes from filehandle starting from the current file pointer, and saves into variable var starting from offset (if omitted, default is 0).  The bytes  includes \x0A, \x0D etc.
Example
#!/usr/bin/env perl use strict; use warnings; (my $numbytes, my $filename) = @ARGV; open(FILE, $filename) or die "Can't open $filename: $!"; my $data; read(FILE, $data, $numbytes); print $data, "\n----\n"; read(FILE, $data, $numbytes); # continue from current file ptr print $data; print $data, "\n----\n"; read(FILE, $data, $numbytes, 2); # save in $data offset 2 print $data, "\n----\n";
Piping Data To and From a Process
If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.
- open(handle, "command|")lets you read from the output of- command.
- open(handle, "|command")lets you write to the input of- command.
Both of these statements return the Process ID (PID) of the command.
Example: The dir command lists the current directory.  By opening a pipe from dir,  you can access its output.
#!/usr/bin/env perl
use strict;
use warnings;
  
open(PIPEFROM, "dir|") or die "Pipe failed: $!";
while (<PIPEFROM>) {
   print;
}
close PIPEFROM;
Example: This example shows how you can pipe input into the sendmail program.
#!/usr/bin/env perl use strict; use warnings; my $my_login = test open(MAIL, "| sendmail –t –f$my_login") or die "Pipe failed: $!"; print MAIL, "From: test101@test.com\n"; print MAIL, "To: test102@test.com\n"; print MAIL, "Subject: test\n"; print MAIL, "\n"; print MAIL, "Testing line 1\n"; print MAIL, "Testing line 2\n"; close MAIL;
You cannot pipe data both to and from a command.  If you want to read the output of a command  that you have opened with the |command,  send the output to a file.  For example,
open (PIPETO, "|command > /output.txt");
Deleting file: Function unlink
unlink(FILES) deletes the FILES,  returning the number of files deleted.   Do not use unlink()  to delete a directory, use rmdir()  instead. For example,
unlink $filename; unlink "/var/adm/message"; unlink "message";
Inspecting Files
You can inspect a file using (-test FILE)  condition.  The condition returns true if FILE satisfies test.  FILE can be a filehandle or filename.  The available test are:
- -e: exists.
- -f: plain file.
- -d: directory.
- -T: seems to be a text file (data from 0 to 127).
- -B: seems to be a binary file (data from 0 to 255).
- -r: readable.
- -w: writable.
- -x: executable.
- -s: returns the size of the file in bytes.
- -z: empty (zero byte).
Example
#!/usr/bin/env perl
use strict;
use warnings;
  
my $dir = shift;
opendir(DIR, $dir) or die "Can't open directory: $!";
my @files = readdir(DIR);
closedir(DIR);
   
foreach my $file (@files) {
   if (-f "$dir/$file") {
      print "$file is a file\n";
      print "$file seems to be a text file\n" if (-T "$dir/$file");
      print "$file seems to be a binary file\n" if (-B "$dir/$file");
      my $size = -s "$dir/$file";
      print "$file size is $size\n";
      print "$file is a empty\n" if (-z "$dir/$file");
   } elsif (-d "$dir/$file") {
      print "$file is a directory\n";
   }
   print "$file is a readable\n" if (-r "$dir/$file");
   print "$file is a writable\n" if (-w "$dir/$file");
   print "$file is a executable\n" if (-x "$dir/$file");
}
Function stat and lsstat
The function stat(FILE) returns a 13-element  array giving the vital statistics of FILE.  lsstat(SYMLINK) returns the same  thing for the symbolic link SYMLINK.
The elements are:
| Index | Value | 
|---|---|
| 0 | The device | 
| 1 | The file's inode | 
| 2 | The file's mode | 
| 3 | The number of hard links to the file | 
| 4 | The user ID of the file's owner | 
| 5 | The group ID of the file | 
| 6 | The raw device | 
| 7 | The size of the file | 
| 8 | The last accessed time | 
| 9 | The last modified time | 
| 10 | The last time the file's status changed | 
| 11 | The block size of the system | 
| 12 | The number of blocks used by the file | 
For example: The command
perl -e "$size= (stat('test.txt'))[7]; print $size"
prints the file size of "test.txt".
Accessing the Directories
- opendir(DIRHANDLE, dirname)opens the directory- dirname.
- closedir(DIRHANDLE)closes the directory handle.
- readdir(DIRHANDLE)returns the next file from- DIRHANDLEin a scalar context, or the rest of the files in the array context.
- glob(string)returns an array of filenames matching the wildcard in- string, e.g.,- glob('*.dat')and- glob('test??.txt').
- mkdir(dirname, mode)creates the directory- dirnamewith the protection specified by- mode.
- rmdir(dirname)deletes the directory- dirname, only if it is empty.
- chdir(dirname)changes the working directory to- dirname.
- chroot(dirname)makes- dirnamethe root directory "/" for the current process, used by superuser only.
Example: Print the contents of a given directory.
#!/usr/bin/env perl
use strict;
use warnings;
  
my $dirname = shift;      # first command-line argument.
opendir(DIR, $dirname) or die "can't open $dirname: $!\n";
@files = readdir(DIR);
closedir(DIR);
foreach my $file (@files) {
   print "$file\n";
}
Example: Removing empty files in a given directory
#!/usr/bin/env perl
use strict;
use warnings;
  
my $dirname = shift;
opendir(DIR, $dirname) or die "Can't open directory: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
   if ((-f "$dir/$file") && (-z "$dir/$file")) {
      print "deleting $dir/$file\n";
      unlink "$dir/$file";
   }
}
closedir(DIR);
Example: Display  files matches "*.txt"
my @files = glob('*.txt');
foreach (@files) { print; print "\n" }
Example: Display files matches the command-line pattern.
$file = shift;
@files = glob($file);
foreach (@files) {
   print;
   print "\n" 
}
Standard Filehandles
Perl defines the following standard filehandles:
- STDIN– Standard Input, usually refers to the keyboard.
- STDOUT– Standard Output, usually refers to the console.
- STDERR– Standard Error, usually refers to the console.
- ARGV– Command-line arguments.
For example:
my $line = <STDIN> # Set $line to the next line of user input my $item = <ARGV> # Set $item to the next command-line argument my @items = <ARGV> # Put all command-line arguments into the array
When you use an empty angle brackets <>  to get inputs from user, it uses the STDIN  filehandle; when you get the inputs from the command-line, it uses ARGV filehandle.  Perl fills in STDIN  or ARGV for you  automatically.  Whenever you use print() function, it uses the  STDOUT filehandler.
<>  behaves like <ARGV> when  there is still data to be read from the command-line files, and behave like <STDIN> otherwise.
Text Formatting
Function write
write(filehandle): printed formatted text to filehandle, using the format associated with filehandle.  If filehandle is omitted, STDOUT would be used.
Declaring format
format name = text1 text2 .
Picture Field @<, @|, @>
- @<: left-flushes the string on the next line of formatting texts.
- @>: right-flushes the string on the next line of formatting texts.
- @|: centers the string on the next line of the formatting texts.
@<, @>, @| can be repeated to control the number of characters to be formatted.
The number of characters to be formatted is same as the length of the picture field.
  @###.## formats numbers by lining up the decimal points under ".".
For examples,
[TODO]
Printing Formatting String printf
printf(filehandle, template, array): prints a formatted string to filehandle (similar to C's fprintf()). For example,
printf(FILE "The number is %d", 15);
The available formatting fields are:
| Field | Expected Value | 
|---|---|
| %s | String | 
| %c | Character | 
| %d | Decimal number | 
| %ld | Long decimal Number | 
| %u | Unsigned decimal number | 
| %x | Hexadecimal number | 
| %lx | Long hexadecimal number | 
| %o | Octal number | 
| %lo | Long octal number | 
| %f | Fixed-point floating-point number | 
| %e | Exponential floating-point number | 
| %g | Compact floating-point number | 
REFERENCES & RESOURCES
[TODO]