Java Regular Expression Syntax

Regex Syntax

The syntax of Regular Expressions is almost the same in all languages. There could be minor value additions and flavors of each language, but it is essentially the same. Having understood Regex for one language, we can easily use that knowledge for any other language.

Character Groups

Regular expressions allow us a lot more than character matches. We can add logical options like "one of these characters"
No.Character ClassDescription
1[abc]a, b, or c (simple class)
2[^abc]Any character except a, b, or c (negation)
3[a-zA-Z]a through z or A through Z, inclusive (range)
4[a-d[m-p]]a through d, or m through p: [a-dm-p] (union)
5[a-z&&[def]]d, e, or f (intersection)
6[a-z&&[^bc]]a through z, except for b and c: [ad-z] (subtraction)
7[a-z&&[^m-p]]a through z, and not m through p: [a-lq-z](subtraction)

Example:
package com.solegaonkar.learnjava;

import java.util.regex.*;  
class RegexExample3{  
 public static void main(String args[]){  
  System.out.println(Pattern.matches("[xyz]", "abcd"));    //false (not x or y or z)  
  System.out.println(Pattern.matches("[xyz]", "y"));    //true (among x or y or z)  
  System.out.println(Pattern.matches("[xyz]", "xyzxyzxxx"));    //false (more than one occurence)  
 }
}

Predefined character classes

There are some character groups that we use quite often. For example, to check if it is a number or an alphabet... To add to our convenience, RegEx syntax has some predefined character classes
ConstructDescription
.Any character (may or may not match line terminators)
\dA digit: [0-9]
\DA non-digit: [^0-9]
\sA whitespace character: [ \t\n\x0B\f\r]
\SA non-whitespace character: [^\s]
\wA word character: [a-zA-Z_0-9]
\WA non-word character: [^\w]

Example:
package com.solegaonkar.learnjava;

import java.util.regex.*;  
class RegexExample5{  
 public static void main(String args[]){  
  System.out.println("metacharacters d....");\\d means digit  
    
  System.out.println(Pattern.matches("\\d", "abc"));//false (non-digit)  
  System.out.println(Pattern.matches("\\d", "1"));//true (digit and comes once)  
  System.out.println(Pattern.matches("\\d", "4443"));//false (digit but comes more than once)  
  System.out.println(Pattern.matches("\\d", "323abc"));//false (digit and char)  
    
  System.out.println("metacharacters D....");\\D means non-digit  
    
  System.out.println(Pattern.matches("\\D", "abc"));//false (non-digit but comes more than once)  
  System.out.println(Pattern.matches("\\D", "1"));//false (digit)  
  System.out.println(Pattern.matches("\\D", "4443"));//false (digit)  
  System.out.println(Pattern.matches("\\D", "323abc"));//false (digit and char)  
  System.out.println(Pattern.matches("\\D", "m"));//true (non-digit and comes once)  
    
  System.out.println("metacharacters D with quantifier....");  
  System.out.println(Pattern.matches("\\D*", "mak"));//true (non-digit and may come 0 or more times)  
 }
}

Quantifiers

Another convenience feature of the RegEx syntax is the quantifier. At times we need to check for more than just one character.
GreedyReluctantPossessiveMeaning
X?X??X?+X, once or not at all
X*X*?X*+X, zero or more times
X+X+?X++X, one or more times
X{n}X{n}?X{n}+X, exactly n times
X{n,}X{n,}?X{n,}+X, at least n times
X{n,m}X{n,m}?X{n,m}+X, at least n but not more than m times

There are three main types of quantifiers. Greedy - Try to grab as much as possible. Reluctant grab as little as possible. While, possessive goes another step beyond greedy algorithms - to get multiples of itself.
Example:
package com.solegaonkar.learnjava;

import java.util.regex.*;  
class RegexExample4{  
 public static void main(String args[]){  
  System.out.println("? quantifier ....");  
  System.out.println(Pattern.matches("[amn]?", "a"));//true (a or m or n comes one time)  
  System.out.println(Pattern.matches("[amn]?", "aaa"));//false (a comes more than one time)  
  System.out.println(Pattern.matches("[amn]?", "aammmnn"));//false (a m and n comes more than one time)  
  System.out.println(Pattern.matches("[amn]?", "aazzta"));//false (a comes more than one time)  
  System.out.println(Pattern.matches("[amn]?", "am"));//false (a or m or n must come one time)  
    
  System.out.println("+ quantifier ....");  
  System.out.println(Pattern.matches("[amn]+", "a"));//true (a or m or n once or more times)  
  System.out.println(Pattern.matches("[amn]+", "aaa"));//true (a comes more than one time)  
  System.out.println(Pattern.matches("[amn]+", "aammmnn"));//true (a or m or n comes more than once)  
  System.out.println(Pattern.matches("[amn]+", "aazzta"));//false (z and t are not matching pattern)  
    
  System.out.println("* quantifier ....");  
  System.out.println(Pattern.matches("[amn]*", "ammmna"));//true (a or m or n may come zero or more times)  
 }
}

Boundary Matchers

In all the cases we checked above, the expressions match a particular set of characters in the input. Boundary matchers allow you to take a step further to matches borders - like beginning of a word, or the string or end of a word or string. These can help us with requirements like identify words that end with 'ed' or match the article a - not just any character a.
Boundary ConstructDescription
^The beginning of a line
$The end of a line
\bA word boundary
\BA non-word boundary
\AThe beginning of the input
\GThe end of the previous match
\ZThe end of the input but for the final terminator, if any
\zThe end of the input

Example:
package com.solegaonkar.learnjava;

import java.util.regex.*;
public class RegexExample5 {
 public static void main(String[] args) {
  String txt = "xyz xyzxyz";

  // Demonstrating ^
  String regex1 = "^xyz";
  Pattern pattern1 = Pattern.compile(regex1, Pattern.CASE_INSENSITIVE);
  Matcher matcher1 = pattern1.matcher(txt);
  while (matcher1.find()) {
   System.out.println("Start index: " + matcher1.start());
   System.out.println("End index: " + matcher1.end());
  }

  // Demonstrating $
  String regex2 = "xyz$";
  Pattern pattern2 = Pattern.compile(regex2, Pattern.CASE_INSENSITIVE);
  Matcher matcher2 = pattern2.matcher(txt);
  while (matcher2.find()) {
   System.out.println("\nStart index: " + matcher2.start());
   System.out.println("End index: " + matcher2.end());
  }
 }
}

Groups

Groups do not particularly contribute in matching a pattern. But have a good role to play when retrieving data from the Matcher. If the RegEx has groups built into it, we can extract individual elements of the matched sequence. For example if the RegEx is (\d)\d(\d) and the input string is 123, the pattern will match and when we invoke group(1), we get 1 - because that is the first group in the match. group(2) would return 3 - because that is the second group in the match.
MethodDescription
start(int group)Returns the start index of the subsequence captured by the given group during the previous match operation.
end (int group)Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation.
group (int group)Returns the input subsequence captured by the given group during the previous match operation.
This is very useful when we want to extract only a part of the string that matches. For example, if we want the domain names from all the Email Id's in a document, we can create a RegEx that matches any Email ID and have a group inside that RegEx that could extract the domain name.