如何在 NumPy 数组中查找连续元素的组-IT科技

如何在 NumPy 数组中查找连续元素的组

2025-03-04 08:24:00

admin

原创

100

摘要：问题描述：我必须对 NumPy 数组中的连续元素进行聚类。考虑以下示例a = [ 0, 47, 48, 49, 50, 97, 98, 99] 输出应为如下的元组列表[(0), (47, 48, 49, 50), (97, 98, 99)] 此处的差异只是元素之间的差异。如果差异也可以指定为限制或硬编码数字，...

问题描述：

我必须对 NumPy 数组中的连续元素进行聚类。考虑以下示例

a = [ 0, 47, 48, 49, 50, 97, 98, 99]

输出应为如下的元组列表

[(0), (47, 48, 49, 50), (97, 98, 99)]

此处的差异只是元素之间的差异。如果差异也可以指定为限制或硬编码数字，那就太好了。

解决方案 1：

def consecutive(data, stepsize=1):
    return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)

a = np.array([0, 47, 48, 49, 50, 97, 98, 99])
consecutive(a)

产量

[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]

解决方案 2：

这是一个可能有帮助的小功能：

def group_consecutives(vals, step=1):
    """Return list of consecutive lists of numbers from vals (number list)."""
    run = []
    result = [run]
    expect = None
    for v in vals:
        if (v == expect) or (expect is None):
            run.append(v)
        else:
            run = [v]
            result.append(run)
        expect = v + step
    return result

>>> group_consecutives(a)
[[0], [47, 48, 49, 50], [97, 98, 99]]
>>> group_consecutives(a, step=47)
[[0, 47], [48], [49], [50, 97], [98], [99]]

PS 这是纯 Python。有关 NumPy 解决方案，请参阅 unutbu 的答案。

解决方案 3：

(a[1:]-a[:-1])==1 将生成一个布尔数组，其中False表示运行中的中断。您也可以使用内置的numpy.grad。

解决方案 4：

测试一维数组

获取diff不为 1 的位置

diffs = numpy.diff(array) != 1

获取 diff 的索引，抓住第一个维度并将一添加到所有维度，因为diff与前一个索引进行比较

indexes = numpy.nonzero(diffs)[0] + 1

根据给定的索引进行拆分

groups = numpy.split(array, indexes)

解决方案 5：

这是我目前想到的：不确定是否 100% 正确

import numpy as np
a = np.array([ 0, 47, 48, 49, 50, 97, 98, 99])
print np.split(a, np.cumsum( np.where(a[1:] - a[:-1] > 1) )+1)

>>>[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]

解决方案 6：

事实证明np.split，列表推导比更有效。因此，下面的函数（几乎与@unutbu 的consecutive函数类似，只是它使用列表推导来拆分数组）要快得多：

def consecutive_w_list_comprehension(arr, stepsize=1):
    idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
    return [arr[i:j] for i,j in zip(idx, idx[1:])]

例如，对于长度为 100_000 的数组，consecutive_w_list_comprehension速度快 4 倍以上：

arr = np.sort(np.random.choice(range(150000), size=100000, replace=False))

%timeit -n 100 consecutive(arr)
96.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n 100 consecutive_w_list_comprehension(arr)
23.2 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

事实上，无论数组的大小如何，这种关系都成立。下图显示了此处答案之间的运行时间差异。

在此处输入图片描述

用来生成上图的代码：

import perfplot
import numpy as np

def consecutive(data, stepsize=1):
    return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)

def consecutive_w_list_comprehension(arr, stepsize=1):
    idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
    return [arr[i:j] for i,j in zip(idx, idx[1:])]

def group_consecutives(vals, step=1):
    run = []
    result = [run]
    expect = None
    for v in vals:
        if (v == expect) or (expect is None):
            run.append(v)
        else:
            run = [v]
            result.append(run)
        expect = v + step
    return result


def JozeWs(array):
    diffs = np.diff(array) != 1
    indexes = np.nonzero(diffs)[0] + 1
    groups = np.split(array, indexes)
    return groups

perfplot.show(
    setup = lambda n: np.sort(np.random.choice(range(2*n), size=n, replace=False)),
    kernels = [consecutive, consecutive_w_list_comprehension, group_consecutives, JozeWs],
    labels = ['consecutive', 'consecutive_w_list_comprehension', 'group_consecutives', 'JozeWs'],
    n_range = [2 ** k for k in range(5, 22)],
    equality_check = lambda *lst: all((x==y).all() for x,y in zip(*lst)),
    xlabel = '~len(arr)'
)

解决方案 7：

这听起来有点像家庭作业，所以如果你不介意的话我会建议一种方法

您可以使用以下方式迭代列表

for i in range(len(a)):
    print a[i]

您可以测试列表中的下一个元素是否满足以下条件

if a[i] == a[i] + 1:
    print "it must be a consecutive run"

您可以将结果单独存储在

results = []

注意 - 上面隐藏着一个索引超出范围的错误，你需要处理它