有效地将一列的值替换到另一列 Pandas DataFrame-IT科技

有效地将一列的值替换到另一列 Pandas DataFrame

2025-04-10 09:46:00

admin

原创

摘要：问题描述：我有一个像这样的 Pandas DataFrame： col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0 0.4 0.4 4 0 0 0.3 5 0 0 0 6 0.1 0.4 0....

问题描述：

我有一个像这样的 Pandas DataFrame：

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.4  0.4

我想仅当值等于 0 时才用col1第二列 ( ) 中的值替换这些值，之后（对于剩余的零值）再执行一次，但使用第三列 ( )。所需结果是下一个：col2`col1`col3

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3   0.4  0.4  0.4
4   0.3    0  0.3
5     0    0    0
6   0.1  0.4  0.4

我使用该pd.replace函数完成了此操作，但似乎太慢了。我认为必须有更快的方法来实现此目的。

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)

有没有更快的方法来做到这一点？使用其他函数代替该pd.replace函数？

解决方案 1：

使用np.where更快。使用与类似的模式replace：

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

但是，使用嵌套np.where会稍微快一些：

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

时间安排

使用以下设置来生成更大的示例 DataFrame 和计时函数：

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

我得到以下时间：

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

我尝试对您的方法进行计时，但它已经运行了好几分钟却没有完成。作为比较，仅在 6 行示例 DataFrame（而不是上面测试的更大的 DataFrame）上对您的方法进行计时花费了 12.8 毫秒。

解决方案 2：

我不确定它是否更快，但你是对的，你可以切分数据框来获得你想要的结果。

df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)

输出：

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

或者，如果您希望它更简洁（虽然我不知道它是否更快）您可以将您所做的与我所做的结合起来。

df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)

输出：

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

解决方案 3：

方法使用pd.DataFrame.where和pd.DataFrame.bfill

df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df

在此处输入图片描述

另一种方法是使用np.argmax

def pir2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

我知道有更好的切片方法numpy。但我现在想不起来。

解决方案 4：

一般来说，有三种方法可以完成有条件替换任务。它们是：

numpy.where
pandas.Series.mask或者pandas.Series.where与Series.mask
pandas.DataFrame.loc

你可以尝试pandas.Series.mask

df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col2'])
df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col3'])

   col1  col2  col3
1   0.2   0.3   0.3
2   0.2   0.3   0.3
3   0.4   0.4   0.4
4   0.3   0.0   0.3
5   0.0   0.0   0.0
6   0.1   0.4   0.4

或者pandas.Series.where

df['col1'] = df['col1'].where(df['col1'].ne(0), df['col2'])
df['col1'] = df['col1'].where(df['col1'].ne(0), df['col3'])

最后，你可以尝试loc

df.loc[df['col1'].eq(0), 'col1'] = df['col2']
df.loc[df['col1'].eq(0), 'col1'] = df['col3']

解决方案 5：

或者你也可以使用combine：

replace_zeros = lambda x, y: y if x == 0 else x
df['col1'].combine(df['col2'], func=replace_zeros).combine(df['col3'], func=replace_zeros)

输出：

1    0.2
2    0.2
3    0.4
4    0.3
5    0.0
6    0.1
dtype: float64