Python 中是否有 `string.split()` 的生成器版本？-IT科技

摘要：问题描述：string.split()返回一个列表实例。是否有返回生成器的版本？是否有理由反对使用生成器版本？解决方案 1：很有可能re.finditer使用相当少的内存开销。def split_iter(string): return (x.group(0) for x in re.finditer...

问题描述：

string.split()返回一个列表实例。是否有返回生成器的版本？是否有理由反对使用生成器版本？

解决方案 1：

很有可能re.finditer使用相当少的内存开销。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

我已经确认，如果我的测试方法正确，这在 python 3.2.1 中会占用恒定内存。我创建了一个非常大的字符串（1GB 左右），然后使用循环遍历可迭代对象for（不是列表推导，这会产生额外的内存）。这不会导致明显的内存增长（也就是说，如果内存有所增长，那也远小于 1GB 字符串）。

更通用的版本：

在回复“我看不出与...的联系”的评论时str.split，这里有一个更通用的版本：

def splitStr(string, sep="s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

这个想法是((?!pat).)*通过确保它贪婪地匹配直到模式开始匹配来“否定”一个组（前瞻不会在正则表达式有限状态机中消耗字符串）。伪代码中：重复消耗（begin-of-stringxor {sep}）+as much as possible until we would be able to begin again (or hit end of string)

演示：

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '...'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

（需要注意的是str.splitsep=None有一个丑陋的行为：它像第一个那样特殊处理str.strip以删除前导和尾随的空格。上面的代码故意不这样做；参见最后一个例子，其中 sep= "s+"。）

（我在尝试实现这一点时遇到了各种错误（包括内部 re.error）...负向后视会限制您使用固定长度的分隔符，因此我们不使用它。除了上述正则表达式之外的几乎所有内容似乎都会导致字符串开头和字符串结尾边缘情况的错误（例如，r'(.*?)($|,)' 在末尾',,,a,,b,c'返回['', '', '', 'a', '', 'b', 'c', '']一个多余的空字符串；人们可以查看编辑历史记录以查找另一个看似正确但实际上存在细微错误的正则表达式。）

（如果您想自己实现更高的性能（尽管它们是重量级的，正则表达式最重要的是在 C 中运行），您可以编写一些代码（使用 ctypes？不确定如何让生成器使用它？），使用以下固定长度分隔符的伪代码：对长度为 L 的分隔符进行哈希处理。在使用运行哈希算法扫描字符串时，保留长度为 L 的运行哈希，更新时间为 O（1）。每当哈希值可能等于您的分隔符时，手动检查过去几个字符是否是分隔符；如果是，则产生自上次产生以来的子字符串。字符串开头和结尾的特殊情况。这将是执行 O（N）文本搜索的教科书算法的生成器版本。多处理版本也是可能的。它们可能看起来有点矫枉过正，但问题意味着人们正在处理非常大的字符串......此时，您可能会考虑一些疯狂的事情，例如如果其中很少，则缓存字节偏移量，或者使用一些磁盘支持的字节串视图对象从磁盘工作，购买更多 RAM 等。）

解决方案 2：

我能想到的最有效的方法是使用方法offset的参数来编写一个str.find()。这可以避免大量内存使用，并且在不需要时避免依赖正则表达式的开销。

[编辑 2016-8-2：更新此内容以可选地支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

这可以按照你想要的方式使用...

>>> print list(isplit("abcb","b"))
['a','c','']

虽然每次执行 find() 或切片时在字符串中查找都会产生一点成本，但由于字符串在内存中表示为连续数组，因此这应该是最小的。

解决方案 3：

对提出的各种方法进行了一些性能测试（这里就不重复了）。一些结果：

str.split（默认值 = 0.3461570239996945
手动搜索（按字符）（Dave Webb 的答案之一）= 0.8260340550004912
re.finditer（ninjagecko 的答案）= 0.698872097000276
str.find（Eli Collins 的答案之一）= 0.7230395330007013
itertools.takewhile（Ignacio Vazquez-Abrams 的答案）= 2.023023967998597
str.split(..., maxsplit=1)递归 = N/A†

†递归答案（string.split带有maxsplit = 1）无法在合理的时间内完成，考虑到string.split速度，它们可能在较短的字符串上运行得更好，但是我看不到内存不是问题的短字符串的用例。

测试使用timeit：

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

这就引发了另一个问题：为什么string.split尽管它占用了大量内存，但速度却快得多。

解决方案 4：

split()这是通过实现的生成器版本re.search()，不存在分配过多子字符串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r's+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑：如果没有给出分隔符，则更正周围空格的处理。

解决方案 5：

这是我的实现，它比这里的其他答案要快得多，也更完整。它有 4 个针对不同情况的独立子函数。

我只需复制主要str_split函数的文档字符串：

str_split(s, *delims, empty=None)

s用其余参数分割字符串，可能会省略空部分（empty关键字参数负责此操作）。这是一个生成器函数。

当仅提供一个分隔符时，字符串就会简单地被它分割。
默认情况empty下是True。

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

当提供多个分隔符时，默认情况下，字符串将按这些分隔符的最长可能序列进行拆分，或者，如果empty设置为
True，则还包括分隔符之间的空字符串。请注意，在这种情况下，分隔符只能是单个字符。

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

当没有提供分隔符时，string.whitespace将使用，因此效果与相同str.split()，只是此函数是一个生成器。

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \n    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \n    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \n    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\nSplit the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

此函数在 Python 3 中有效，并且可以通过一个简单的（虽然相当丑陋）修复程序使其在 2 和 3 版本中均有效。该函数的第一行应更改为：

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

解决方案 6：

不，但是使用编写一个应该很容易itertools.takewhile()。

编辑：

非常简单，半破的实现：

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

解决方案 7：

我没有看到生成器版本有任何明显的好处split()。生成器对象必须包含要迭代的整个字符串，因此您不会通过使用生成器来节省任何内存。

如果你想写一个，那就相当容易了：

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

解决方案 8：

我写了@ninjagecko 答案的一个版本，其行为更像 string.split （即默认用空格分隔，您可以指定分隔符）。

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

以下是我使用的测试（在 Python 3 和 Python 2 中）：

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1    2    3") == ["1", "2", "3"]
assert helper("1    2     3") == ["1", "2", "3"]
assert helper("1
2
3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1", "\\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python 的 regex 模块说它对 unicode 空格做了“正确的事情”，但我实际上还没有测试过它。

也可作为要点。

解决方案 9：

如果您还希望能够读取迭代器（以及返回迭代器），请尝试以下操作：

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

解决方案 10：

more_itertools.split_at提供迭代器的类似物str.split。

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools是第三方包。

解决方案 11：

我想展示如何使用 find_iter 解决方案返回给定分隔符的生成器，然后使用 itertools 中的成对配方构建前一个迭代，该迭代将像原始拆分方法一样获取实际的单词。

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

笔记：

我使用 prev & curr 而不是 prev & next，因为在 python 中覆盖 next 是一个非常糟糕的想法
这是非常有效的

解决方案 12：

最愚蠢的方法，没有 regex / itertools：

def isplit(text, split='
'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

解决方案 13：

这个问题很老了，但我对一种有效的算法做出了微薄的贡献：

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1

解决方案 14：

执行：

iter(io.StringIO(my_str))

使用示例：

>>> import io
>>> for x in iter(io.StringIO('hello')):
...   print(x)
...
hello
>>> for x in iter(io.StringIO('hello
world
')):
...   print(x)
...
hello

world

文档：https：//docs.python.org/3/library/io.html#io.StringIO

解决方案 15：

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

解决方案 16：

这是一个简单的回应

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

解决方案 17：

def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br's+', r's+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

解决方案 18：

这是基于 split 和 maxsplit 的答案。这不使用递归。

def gsplit(todo):
    chunk= 100
    while todo:
        splits = todo.split(maxsplit=chunk)
        if len(splits) == chunk:
            todo = splits.pop()
        else:
            todo=None
        for item in splits:
            yield item

解决方案 19：

def splitter(string, delimiter=" "):
    start = end = 0
    while end < len(string):
        while end<len(string) and string[end] != delimiter:
            end += 1
        yield string[start: end]
        start = end = end +1
    return string[end:]

print(list(splitter("abdcabcd", "b")))

#> ['a', 'dca', 'cd']