String#split on steroids

StringSplitter - String#split on steroids


gem "string_splitter"


require "string_splitter"

ss =

Same as String#split

ss.split("foo bar baz quux")
ss.split("foo bar baz quux", " ")
ss.split("foo bar baz quux", /\s+/)
# => ["foo", "bar", "baz", "quux"]

Split at the first delimiter

ss.split("foo:bar:baz:quux", ":", at: 1)
# => ["foo", "bar:baz:quux"]

Split at the last delimiter

ss.split("foo:bar:baz:quux", ":", at: -1)
# => ["foo:bar:baz", "quux"]

Split at multiple delimiter positions

ss.split("1:2:3:4:5:6:7:8:9", ":", at: [1..3, -2])
# => ["1", "2", "3", "4:5:6:7", "8:9"]

Split from the right

ss.rsplit("1:2:3:4:5:6:7:8:9", ":", at: [1..3, 5])
# => ["1:2:3:4", "5:6", "7", "8", "9"]

Full control via a block

result = ss.split('a:a:a:b:c:c:e:a:a:d:c', ":") do |split|
  split.index > 0 && split.lhs == split.rhs
# => ["a:a", "a:b:c", "c:e:a", "a:d:c"]


Many languages have built-in split functions/methods for strings. They behave similarly (notwithstanding the occasional surprise), and handle a few common cases e.g.:

  • limiting the number of splits
  • including the separator(s) in the results
  • removing (some) empty fields

But, because the API is squeezed into two overloaded parameters (the delimiter and the limit), achieving the desired results can be tricky. For instance, while String#split removes empty trailing fields (by default), it provides no way to remove all empty fields. Likewise, the cramped API means there's no way to e.g. combine a limit (positive integer) with the option to preserve empty fields (negative integer), or use backreferences in a delimiter pattern without including its captured subexpressions in the result.

If split was being written from scratch, without the baggage of its legacy API, it's possible that some of these options would be made explicit rather than overloading the parameters. And, indeed, this is possible in some implementations, e.g. in Crystal:

":foo:bar:baz:".split(":", remove_empty: false) # => ["", "foo", "bar", "baz", ""]
":foo:bar:baz:".split(":", remove_empty: true)  # => ["foo", "bar", "baz"]

StringSplitter takes this one step further by moving the configuration out of the method altogether and delegating the strategy — i.e. which splits should be accepted or rejected — to a block:

ss =

ss.split("foo:bar:baz", ":") { |split| split.index == 0 }
# => ["foo", "bar:baz"]

ss.split("foo:bar:baz", ":") { |split| split.position == split.count }
# => ["foo:bar", "baz"]

As a shortcut, the common case of splitting on delimiters at one or more positions is supported by an option:

ss.split('foo:bar:baz:quux', ':', at: [1, -1]) # => ["foo", "bar:baz", "quux"]


I wanted to split semi-structured output into fields without having to resort to a regex or a full-blown parser.

As an example, the nominally unstructured output of many Unix commands is often formatted in a way that's tantalizingly close to being machine-readable, apart from a few pesky exceptions e.g.:

$ ls -l

-rw-r--r-- 1 user users   87 Jun 18 18:16
-rw-r--r-- 1 user users  254 Jun 19 21:21 Gemfile
drwxr-xr-x 3 user users 4096 Jun 19 22:56 lib
-rw-r--r-- 1 user users 8952 Jun 18 18:16
-rw-r--r-- 1 user users 3134 Jun 19 22:59

These lines can almost be parsed into an array of fields by splitting them on whitespace. The exception is the date (columns 6-8) i.e.:

line = "-rw-r--r-- 1 user users   87 Jun 18 18:16"


["-rw-r--r--", "1", "user", "users", "87", "Jun", "18", "18:16", ""]

instead of:

["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", ""]

One way to work around this is to parse the whole line e.g.:

line.match(/^(\S+) \s+ (\d+) \s+ (\S+) \s+ (\S+) \s+ (\d+) \s+ (\S+ \s+ \d+ \s+ \S+) \s+ (.+)$/x)

But that requires us to specify everything. What we really want is a version of split which allows us to veto splitting for the 6th and 7th delimiters i.e. control over which splits are accepted, rather than being restricted to the single, baked-in strategy provided by the limit parameter.

By providing a simple way to accept or reject each split, StringSplitter makes cases like this easy to handle, either via a block:

ss.split(line) do |split|
  case split.position when 1..5, 8 then true end
# => ["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", ""]

Or via its option shortcut:

ss.split(line, at: [1..5, 8])
# => ["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", ""]


StringSplitter is tested and supported on all versions of Ruby supported by the ruby-core team, i.e., currently, Ruby 2.3 and above.





