Java Taking Strings Apart with StringTokenizer
Problem
You need to take a string apart into words or tokens.
Solution
Construct a StringTokenizer around your string and call its methods hasMoreTokens( ) and nextToken( ). The StringTokenizer methods implement the Iterator design pattern (see Recipe 7.4):
Construct a StringTokenizer around your string and call its methods hasMoreTokens( ) and nextToken( ). The StringTokenizer methods implement the Iterator design pattern (see Recipe 7.4):
// StrTokDemo.java
StringTokenizer st = new StringTokenizer("Hello World of Java");
while (st.hasMoreTokens( ))
System.out.println("Token: " + st.nextToken( ));
StringTokenizer also implements the Enumeration interface directly (also in Recipe 7.4), but if you use the methods thereof you need to cast the results to String. A StringTokenizer normally breaks the String into tokens at what we would think of as “word boundaries” in European languages. Sometimes you want to break at some other character. No problem. When you construct your StringTokenizer, in addition to passing in the string to be tokenized, pass in a second string that lists the “break characters.” For example:
// StrTokDemo2.java
StringTokenizer st = new StringTokenizer("Hello, World|of|Java", ", |");
while (st.hasMoreElements( ))
System.out.println("Token: " + st.nextElement( ));
But wait, there’s more! What if you are reading lines like:
FirstName|LastName|Company|PhoneNumber
and your dear old Aunt Begonia hasn’t been employed for the last 38 years? Her
“Company” field will in all probability be blank.* If you look very closely at the previous
code example, you’ll see that it has two delimiters together (the comma and the
space), but if you run it, there are no “extra” tokens. That is, the StringTokenizer
normally discards adjacent consecutive delimiters. For cases like the phone list,
where you need to preserve null fields, there is good news and bad news. The good
news is you can do it: you simply add a second argument of true when constructing
the StringTokenizer, meaning that you wish to see the delimiters as tokens. The bad
news is that you now get to see the delimiters as tokens, so you have to do the arithmetic
yourself. Want to see it? Run this program:
See Also
Now that Java includes Regular Expressions (as of JDK 1.4), many occurrences of StringTokenizer can be replaced with Regular Expressions with considerably more flexibility. For example, to extract all the numbers from a String, you can use this code:
This allows user input to be more flexible than you could easily handle with a StringTokenizer. Assuming that the numbers represent course numbers at some educational institution, the inputs “471,472,570” or “Courses 471 and 472, 570” or just “471 472 570” should all give the same results.
// StrTokDemo3.java
StringTokenizer st =
new StringTokenizer("Hello, World|of|Java", ", |", true);
while (st.hasMoreElements( ))
System.out.println("Token: " + st.nextElement( ));
and you get this output:
C:\javasrc>java StrTokDemo3
Token: Hello
Token: ,
Token:
Token: World
Token: |
Token: of
Token: |
Token: Java
This isn’t how you’d like StringTokenizer to behave, ideally, but it is serviceable
enough most of the time. Example 3-1 processes and ignores consecutive tokens,
returning the results as an array of Strings.
StrTokDemo4.java (StringTokenizer)
import java.util.*;
/** Show using a StringTokenizer including getting the delimiters back */
public class StrTokDemo4 {
public final static int MAXFIELDS = 5;
public final static String DELIM = "|";
/** Processes one String; returns it as an array of Strings */
public static String[] process(String line) {
String[] results = new String[MAXFIELDS];
// Unless you ask StringTokenizer to give you the tokens,
// it silently discards multiple null tokens.
StringTokenizer st = new StringTokenizer(line, DELIM, true);
int i = 0;
// stuff each token into the current slot in the array
while (st.hasMoreTokens( )) {
String s = st.nextToken( );
if (s.equals(DELIM)) {
if (i++>=MAXFIELDS)
// This is messy: See StrTokDemo4b which uses
// a Vector to allow any number of fields.
throw new IllegalArgumentException("Input line " +
line + " has too many fields");
continue;
}
results[i] = s;
}
return results;
}
public static void printResults(String input, String[] outputs) {
System.out.println("Input: " + input);
for (int i=0; i
When you run this, you will see that A is always in Field 1, B (if present) is in Field 2,
and so on. In other words, the null fields are being handled properly:
Input: A|B|C|D
Output 0 was: A
Output 1 was: B
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A||C|D
Output 0 was: A
Output 1 was: null
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A|||D|E
Output 0 was: A
Output 1 was: null
Output 2 was: null
Output 3 was: D
Output 4 was: E
See Also
Now that Java includes Regular Expressions (as of JDK 1.4), many occurrences of StringTokenizer can be replaced with Regular Expressions with considerably more flexibility. For example, to extract all the numbers from a String, you can use this code:
Matcher toke = Pattern.compile("\\d+").matcher(inputString);
while (toke.find( )) {
String courseString = toke.group(0);
int courseNumber = Integer.parseInt(courseString);
...
This allows user input to be more flexible than you could easily handle with a StringTokenizer. Assuming that the numbers represent course numbers at some educational institution, the inputs “471,472,570” or “Courses 471 and 472, 570” or just “471 472 570” should all give the same results.
No comments:
Post a Comment