Introduction
Regular Expression (regex) is extremely useful in programming, especially in processing text files.
I assume that you are familiar with regex and Java. Otherwise, read up the regex syntax at:
- My article on "Regular Expressions".
- The Online Java Tutorial Trail on "Regular Expressions".
- JavaDoc for
java.util.regexPackage. - JavaDoc for
java.util.regex.PatternClass, which summarizes of the regex patterns.
Package java.util.regex (JDK 1.4)
Regular expression was introduced in Java 1.4 in package java.util.regex. This package contains only two classes:
java.util.regex.Pattern: represents a compiled regular expression. You can get aPatternobject viastaticmethodPattern.compile(String regexStr).java.util.regex.Matcher: an engine that performs matching operations on an inputCharSequence(such asString,StringBuffer,StringBuilder,CharBuffer,Segment) by interpreting apattern.
The steps are:
String regexStr = "......"; // Regex String String inputStr = "......"; // Input for matching, any CharSequence such as String, StringBuffer, StringBuilder, CharBuffer // Step 1: Compile a Regex String into a Pattern object Pattern pattern = Pattern.compile(regexStr); // Step 2: Allocate a matching engine for the regex pattern bind with the input string Matcher matcher = pattern.matcher(inputStr); // Step 3: Perform the matching matcher.matches() : attempts to match the ENTIRE input sequence matcher.find() : scans the input sequence looking for the next subsequence that matches the pattern matcher.lookingAt() : attempts to match the input sequence, starting at the beginning, against the pattern. matcher.replaceAll(replacementStr): Find and replace all matches. matcher.replaceFirst(replacementStr): Find and replace the first match. // Step 4: Processing matching result matcher.group() : returns the input subsequence matched by the previous match. matcher.start() : returns the start index of the previous match. matcher.end() : returns the offset after the last character matched.
Java Regex by Examples
Example: Check if the Input string Matches a Regex Pattern via matches()
For example, you want to check if the input is a 5-digit string.
import java.util.regex.Matcher; import java.util.regex.Pattern; /** Check if given input matches the specified regex */ public class RegexMatchTest { public static void main(String[] args) { // Method 1: one-liner matches() boolean isMatched1 = Pattern.matches("\\d{5}", "12345"); // 5-digit string System.out.println(isMatched1); // Method 2: compile(), matcher() and matches() Pattern p = Pattern.compile("\\d{5}"); // can be reused and more efficient Matcher m = p.matcher("1234"); boolean isMatched2 = m.matches(); System.out.println(isMatched2); // or boolean isMatched3 = Pattern.compile("\\d{5}").matcher("99999").matches(); System.out.println(isMatched3); } }
Example: Find Text
For example, given the input "This is an apple. These are 33 (thirty-three) apples.", you wish to find all occurrences of pattern "Th" (either case-sensitive or case-insensitive).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class TestRegexFindText {
public static void main(String[] args) {
// Input String for matching the regex pattern
String inputStr = "This is an apple. These are 33 (thirty-three) apples.";
// Regex to be matched
String regexStr = "Th";
// Step 1: Compile a regex via static method Pattern.compile(), default is case-sensitive
Pattern pattern = Pattern.compile(regexStr);
// Pattern.compile(regex, Pattern.CASE_INSENSITIVE); // for case-insensitive matching
// Step 2: Allocate a matching engine from the compiled regex pattern,
// and bind to the input string
Matcher matcher = pattern.matcher(inputStr);
// Step 3: Perform matching and process the matching results
// Try Matcher.find(), which finds the next match
while (matcher.find()) {
System.out.println("find() found substring \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
}
// Try Matcher.matches(), which tries to match the entrie input string
if (matcher.matches()) {
System.out.println("matches() found substring \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("matches() found nothing");
}
// Try Matcher.lookingAt(), which tries to match from the beginning of the input string
if (matcher.lookingAt()) {
System.out.println("lookingAt() found substring \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("lookingAt() found nothing");
}
}
}
|
Output
find() found substring "Th" starting at index 0 and ending at index 2 find() found substring "Th" starting at index 18 and ending at index 20 matches() found nothing lookingAt() found substring "Th" starting at index 0 and ending at index 2
How It Works
- Three steps are required to perform regex matching:
- Allocate a
Patternobject. There is no constructor for thePatternclass. Instead, you invoke thestaticmethodPattern.compile(regexStr)to compile theregexStr, which returns aPatterninstance. - Allocate a
Matcherobject (an matching engine). Again, there is no constructor for theMatcherclass. Instead, you invoke thematcher(inputStr)method from thePatterninstance (created in Step 1), and bind the input string to thisMatcher. - Use the
Matcherinstance (created in Step 2) to perform the matching and process the matching result. TheMatcherclass provides a fewbooleanmethods for performing the matches:boolean find(): scans the input sequence to look for the next subsequence that matches the pattern. If match is found, you can use thegroup(),start()andend()to retrieve the matched subsequence and its starting and ending indices, as shown in the above example.boolean matches(): try to match the entire input sequence against the regex pattern. It returnstrueif the entire input sequence matches the pattern. That is, include regex's begin and end position anchors^and$to thepattern.boolean lookingAt(): try to match the input sequence, starting from the beginning, against the regex pattern. It returnstrueif a prefix of the input sequence matches the pattern. That is, include regex's begin position anchors^to thepattern.
- Allocate a
- To perform case-insensitive matching, use
Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE)to create thePatterninstance (as commented out in the above example).
Example: Find Pattern (Expressed in Regular Expression)
The above example to find a particular piece of text from an input sequence is rather trivial. The power of regex is that you can use it to specify a pattern, e.g.,
(\w)+matches any word (delimited by space), where\wis a metacharacter matching any word character[a-zA-Z0-9_], and+is an occurrence indicator for one or more occurrences.\b[1-9][0-9]*\bmatches any number with a non-zero leading digit, separated by spaces from other words, where\bis the position anchor for word boundary,[1-9]is a character class for any character in the range of1to9, and*is an occurrence indicator for zero or more occurrences.
Try changing the regex pattern of the above example to the followings and observe the outputs. Take not that you need to use escape sequence '\\' for '\' inside a Java's string.
String regexStr = "\\w+"; // escape sequence \\ for \
String regexStr = "\\b[1-9][0-9]+\\b";
Output for Regex \w+
find() found substring "This" starting at index 0 and ending at index 4 find() found substring "is" starting at index 5 and ending at index 7 find() found substring "an" starting at index 8 and ending at index 10 find() found substring "apple" starting at index 11 and ending at index 16 find() found substring "These" starting at index 18 and ending at index 23 find() found substring "are" starting at index 24 and ending at index 27 find() found substring "33" starting at index 28 and ending at index 30 find() found substring "thirty" starting at index 32 and ending at index 38 find() found substring "three" starting at index 39 and ending at index 44 find() found substring "apples" starting at index 46 and ending at index 52 matches() found nothing lookingAt() found substring "This" starting at index 0 and ending at index 4
Output for Regex \b[1-9][0-9]*\b
find() found substring "33" starting at index 28 and ending at index 30 matches() found nothing lookingAt() found nothing
Check out the Javadoc for the Class java.util.regex.Pattern for the list of regular expression constructs supported by Java.
Example: Find and Replace Text
Finding a pattern and replace it with something else is probably one of the most frequent tasks in text processing. Regex allows you to express the pattern liberally, and also the replacement text/pattern. This is extremely useful in batch processing a huge text document or many text files. For example, searching for stock prices from many online HTML files, rename many files in a directory with a certain pattern, etc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class TestRegexFindReplace {
public static void main(String[] args) {
String inputStr = "This is an apple. These are 33 (Thirty-three) apples";
String regexStr = "apple"; // pattern to be matched
String replacementStr = "orange"; // replacement pattern
// Step 1: Allocate a Pattern object to compile a regex
Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE);
// Step 2: Allocate a Matcher object from the pattern, and provide the input
Matcher matcher = pattern.matcher(inputStr);
// Step 3: Perform the matching and process the matching result
//String outputStr = matcher.replaceAll(replacementStr); // all matches
String outputStr = matcher.replaceFirst(replacementStr); // first match only
System.out.println(outputStr);
}
}
|
Output for replaceAll()
This is an orange. These are 33 (Thirty-three) oranges.
Output for replaceFirst()
This is an orange. These are 33 (Thirty-three) apples.
How It Works
- First, create a
Patternobject to compile a regex pattern. Next, create aMatcherobject from thePatternand bind to the input string. - The
Matcherclass provides areplaceAll(replacementStr)to replace all the matched subsequence with thereplacementStr; orreplaceFirst(replacementStr)to replace the first match only.
Example: Find and Replace with Back References
Given the input "One:two:three:four", the following program produces "four-three-two-One" by matching the 4 words separated by colons, and uses the so-called parenthesized back-references $1, $2, $3 and $4 in the replacement pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class TestRegexBackReference {
public static void main(String[] args) {
String inputStr = "One:two:three:four";
String regexStr = "(.+):(.+):(.+):(.+)"; // pattern to be matched
String replacementStr = "$4-$3-$2-$1"; // replacement pattern with back references
// Step 1: Allocate a Pattern object to compile a regex
Pattern pattern = Pattern.compile(regexStr);
// Step 2: Allocate a Matcher object from the Pattern, and provide the input
Matcher matcher = pattern.matcher(inputStr);
// Step 3: Perform the matching and process the matching result
String outputStr = matcher.replaceAll(replacementStr); // all matches
//String outputStr = matcher.replaceFirst(replacementStr); // first match only
System.out.println(outputStr); // Output: four-three-two-One
}
}
|
Parentheses () have two meanings in regex:
- Grouping sub-expressions: For example
xyz+matches one'x', one'y', followed by one or more'z'. But(xyz)+matches one or more groups of'xyz', e.g.,'xyzxyzxyz'. - Parenthesized Back Reference: Provide back references to the matched subsequences. The matched subsequence of the first pair of parentheses can be referred to as
$1, second pair of patentee as$2, and so on. In the above example, there are 4 pairs of parentheses, which were referenced in the replacement pattern as$1,$2,$3, and$4. You can usegroupCount()(of theMatcher) to get the number of groups captured, andgroup(groupNumber),start(groupNumber),end(groupNumber)to retrieve the matched subsequence and their indices. In Java,$0denotes the entire regular expression. Try the following codes and check the output:while (matcher.find()) { System.out.println("find() found substring \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end()); System.out.println("Group count is: " + matcher.groupCount()); for (int i = 0; i < matcher.groupCount(); ++i) { System.out.println("Group " + i + ": substring=" + matcher.group(i) + ", start=" + matcher.start(i) + ", end=" + matcher.end(i)); } }find() found substring "One:two:three:four" starting at index 0 and ending at index 18 Group count is: 4 Group 0: substring=One:two:three:four, start=0, end=18 Group 1: substring=One, start=0, end=3 Group 2: substring=two, start=4, end=7 Group 3: substring=three, start=8, end=13
Example: Rename Files of a Given Directory
The following program rename all the files ending with ".class" to ".out" of the directory specified.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.File;
public class RegexRenameFiles {
public static void main(String[] args) {
String regexStr = ".class$"; // ending with ".class"
String replacementStr = ".out"; // replace with ".out"
// Allocate a Pattern object to compile a regex
Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE);
Matcher matcher;
File dir = new File("."); // directory to be processed
int count = 0;
File[] files = dir.listFiles(); // list all files and directories
for (File file : files) {
if (file.isFile()) { // file only, not directory
String inFilename = file.getName(); // get filename, exclude path
matcher = pattern.matcher(inFilename); // allocate Matches with input
if (matcher.find()) {
++count;
String outFilename = matcher.replaceFirst(replacementStr);
System.out.print(inFilename + " -> " + outFilename);
if (file.renameTo(new File(dir + "\\" + outFilename))) { // execute rename
System.out.println(" SUCCESS");
} else {
System.out.println(" FAIL");
}
}
}
}
System.out.println(count + " files processed");
}
}
|
You can use regex to specify the pattern, and back references in the replacement, as in the previous example.
Other Usages of Regex in Java
The String.split() Method
The String class contains a method split(), which takes a regular expression and splits this String object into an array of Strings.
// In String class
public String[] split(String regexStr)
For example,
public class StringSplitTest {
public static void main(String[] args) {
String source = "There are thirty-three big-apple";
String[] tokens = source.split("\\s+|-"); // whitespace(s) or -
for (String token : tokens) {
System.out.println(token);
}
}
}
There are thirty three big apple
The Scanner & useDelimiter()
The Scanner class, by default, uses whitespace as the delimiter in parsing input tokens. You can set the delimiter to a regex via use delimiter() methods:
public Scanner useDelimiter(Pattern pattern) public Scanner useDelimiter(String pattern)
For example,
import java.util.Scanner;
public class ScannerUseDelimiterTest {
public static void main(String[] args) {
String source = "There are thirty-three big-apple";
Scanner in = new Scanner(source);
in.useDelimiter("\\s+|-"); // whitespace(s) or -
while (in.hasNext()) {
System.out.println(in.next());
}
}
}
REFERENCES & RESOURCES