Sunday, February 17, 2013

Guava - Strings - Splitters

There are many utilities for splitting Strings in Java but, as Guava's wiki page for Splitter points out, they "can have some quirky behaviors."  [1]

Consider, for example, what would happen if you used "#$" as the delimiter for a StringTokenizer and the String was "A#B#$C#D#$E#F".  You might think that it would split on the character sequence "#$", resulting in the tokens "A#B#$", "C#D#$", and "E#F".  Unfortunately the delimiter you passed into StringTokenizer is treated as a list of delimiters.  Therefore the actually tokens returned would be "A", "B", "C", "D", "E", and "F".  [2]

Instead of using StringTokenizer, Java now recommends that you use the split method from the String class.  But this too has its pitfalls.  Consider what happens when you call split("#") on the String "A#####B###C##D#E".  It may surprise you that the String "###" returns two empty Strings (ie. "").  To understand why it may help to visual the delimiters as fence posts.  Whatever gets fenced in between two posts gets put into the returned String array.  [3]

Another pitfall to watch out for with the split method is that the delimiter is now treated as a regular expression.  That means if you wanted to split a document into individual sentences, you'd have to do split("[.]").  "[Otherwise] the splitter would put fence posts everywhere [and] give you a big load of nothing!"  [3]

Guava's Splitter class avoids these pitfalls by allowing you to choose how to use it.  If you want to use a regular expression to split a String you can.  If you want to use a sequence of characters as the delimiter, you can do that too.  And when you split a String the result is returned in an object that implements the Iterable interface.  That means you can use a for-each loop to iterate over the tokens.   [4]
Iterable<String> result = Splitter.on(' ').split("The quick brown fox jumps over the lazy dog");

for(String str : result) {
  System.out.println(str);
}
"Splitter instances are thread-safe immutable, and are therefore safe to store as static final constants." [4]

References

[1] Guava's wiki page for Splitter
[2] API for Java's StringTokenizer class in the package java.util
[3] forum post by cjard on April 23rd, 2004 12:48 PM at forums.codeguru.com
[4] API for Guava's Splitter class in the package com.google.common.base

No comments:

Post a Comment