摘要：问题描述：我在 df 中有几个同名的列。我需要重命名它们，但问题是该df.rename方法以相同的方式重命名它们。我如何将下面的 blah 重命名为 blah1、blah4、blah5？df = pd.DataFrame(np.arange(2*5).reshape(2,5)) df.columns = ['...

问题描述：

我在 df 中有几个同名的列。我需要重命名它们，但问题是该df.rename方法以相同的方式重命名它们。我如何将下面的 blah 重命名为 blah1、blah4、blah5？

df = pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns = ['blah','blah2','blah3','blah','blah']
df

#     blah  blah2  blah3  blah  blah
# 0   0     1      2      3     4
# 1   5     6      7      8     9

使用该方法时会发生以下情况df.rename：

df.rename(columns={'blah':'blah1'})

#     blah1  blah2  blah3  blah1  blah1
# 0   0      1      2      3      4
# 1   5      6      7      8      9

解决方案 1：

我们可以使用内部（未记录的）方法：

In [38]: pd.io.parsers.base_parser.ParserBase({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)
Out[38]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']

这是“神奇”的功能：

   def _maybe_dedup_names(self, names: Sequence[Hashable]) -> Sequence[Hashable]:
        # see gh-7160 and gh-9424: this helps to provide
        # immediate alleviation of the duplicate names
        # issue and appears to be satisfactory to users,
        # but ultimately, not needing to butcher the names
        # would be nice!
        if self.mangle_dupe_cols:
            names = list(names)  # so we can index
            counts: DefaultDict[Hashable, int] = defaultdict(int)
            is_potential_mi = _is_potential_multi_index(names, self.index_col)

            for i, col in enumerate(names):
                cur_count = counts[col]

                while cur_count > 0:
                    counts[col] = cur_count + 1

                    if is_potential_mi:
                        # for mypy
                        assert isinstance(col, tuple)
                        col = col[:-1] + (f"{col[-1]}.{cur_count}",)
                    else:
                        col = f"{col}.{cur_count}"
                    cur_count = counts[col]

                names[i] = col
                counts[col] = cur_count + 1

        return names

解决方案 2：

我希望在 Pandas 中找到一个解决方案，而不是一个通用的 Python 解决方案。如果 Column 的 get_loc() 函数找到重复项，并且“True”值指向找到重复项的位置，它将返回一个掩码数组。然后我使用掩码将新值分配给这些位置。就我而言，我提前知道我将获得多少个重复项以及我将为它们分配什么，但看起来 df.columns.get_duplicates() 会返回所有重复项的列表，然后您可以将该列表与 get_loc() 结合使用，如果您需要更通用的重复项清除操作

'''截至 2020 年 9 月更新'''

cols=pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]: 
    cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx) 
                                     if d_idx != 0 
                                     else dup 
                                     for d_idx in range(df.columns.get_loc(dup).sum())]
                                    )
df.columns=cols

    blah    blah2   blah3   blah.1  blah.2
 0     0        1       2        3       4
 1     5        6       7        8       9

新的更优方法（更新于 2019 年 12 月 3 日）

下面的代码比上面的代码好。从下面的另一个答案复制而来（@SatishSK）：

#sample df with duplicate blah column
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df

# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns

cols=pd.Series(df.columns)

for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

# rename the columns with the cols list.
df.columns=cols

df

输出：

    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9

解决方案 3：

你可以使用这个：

def df_column_uniquify(df):
    df_columns = df.columns
    new_columns = []
    for item in df_columns:
        counter = 0
        newitem = item
        while newitem in new_columns:
            counter += 1
            newitem = "{}_{}".format(item, counter)
        new_columns.append(newitem)
    df.columns = new_columns
    return df

然后

import numpy as np
import pandas as pd

df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']

因此df：

   blah  blah2  blah3   blah   blah
0     0      1      2      3      4
1     5      6      7      8      9

然后

df = df_column_uniquify(df)

因此df：

   blah  blah2  blah3  blah_1  blah_2
0     0      1      2       3       4
1     5      6      7       8       9

解决方案 4：

您可以直接分配给列：

In [12]:

df.columns = ['blah','blah2','blah3','blah4','blah5']
df
Out[12]:
   blah  blah2  blah3  blah4  blah5
0     0      1      2      3      4
1     5      6      7      8      9

[2 rows x 5 columns]

如果您只想动态地重命名重复的列，那么您可以执行以下操作（代码取自答案 2：python 列表中重复项的索引）：

In [25]:

import collections
dups = collections.defaultdict(list)
dup_indices=[]
col_list=list(df.columns)
for i, e in enumerate(list(df.columns)):
  dups[e].append(i)
for k, v in sorted(dups.items()):
  if len(v) >= 2:
    dup_indices = v

for i in dup_indices:
    col_list[i] = col_list[i] + ' ' + str(i)
col_list
Out[25]:
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']

然后您可以使用它来重新分配，您还可以使用一个函数来生成重命名之前在列中不存在的唯一名称。

解决方案 5：

我刚刚写了这段代码，它使用列表推导来更新所有重复的名称。

df.columns = [x[1] if x[1] not in df.columns[:x[0]] else f"{x[1]}_{list(df.columns[:x[0]]).count(x[1])}" for x in enumerate(df.columns)]

解决方案 6：


duplicated_idx = dataset.columns.duplicated()

duplicated = dataset.columns[duplicated_idx].unique()



rename_cols = []

i = 1
for col in dataset.columns:
    if col in duplicated:
        rename_cols.extend([col + '_' + str(i)])
    else:
        rename_cols.extend([col])

dataset.columns = rename_cols

解决方案 7：

感谢@Lamakaha 提供的解决方案。您的想法让我有机会对其进行修改，并使其在所有情况下都可行。

我正在使用 Python 3.7.3 版本。

我尝试在只有一列重复（即两列同名）的数据集上使用您的代码。不幸的是，列名保持原样，没有重命名。最重要的是，我收到一条警告，表示"get_duplicates()已弃用，未来版本将删除该警告”。我使用duplicated()coupled withunique()代替get_duplicates()which，但没有得到预期的结果。

我对你的代码做了一些修改，现在它对我的数据集以及其他一般情况都适用。

以下是针对问题中提到的示例数据集进行代码修改和未修改的运行代码以及结果：

df=pd.DataFrame(np.arange(2*5).reshape(2,5))

df.columns=['blah','blah2','blah3','blah','blah']
df

cols=pd.Series(df.columns)

for dup in df.columns.get_duplicates(): 
    cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols

df

f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 'get_duplicates' 已弃用，并将在未来版本中删除。您可以改用 idx[idx.duplicated()].unique()

输出：

    blah    blah2   blah3   blah    blah.1
0   0   1   2   3   4
1   5   6   7   8   9

三个“blah”中有两个未正确重命名。

修改后的代码

df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df

cols=pd.Series(df.columns)

for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns=cols

df

输出：

    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9

以下是另一个示例的修改后的代码运行：

cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])

for dup in cols[cols.duplicated()].unique():
    cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

cols

Output:
0       X
1       Y
2       Z
3       A
4       B
5       C
6     A_1
7     A_2
8       L
9       M
10    A_3
11    Y_1
12    M_1
dtype: object

希望这对任何寻求上述问题答案的人都能有所帮助。

解决方案 8：

在 Pandas v2.1 中，您可以使用该pd.io.common.dedup_names函数，例如：

In [137]: pd.io.common.dedup_names(df.columns, is_potential_multiindex=False)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']

早期方法（pd.io.parsers.base_parser.ParserBase({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)）已被删除，因此不再有效。

解决方案 9：

由于接受的答案（由 Lamakaha 提出）不适用于最新版本的 pandas，并且其他建议看起来有点笨拙，因此我制定了自己的解决方案：

def dedupIndex(idx, fmt=None, ignoreFirst=True):
    # fmt:          A string format that receives two arguments: 
    #               name and a counter. By default: fmt='%s.%03d'
    # ignoreFirst:  Disable/enable postfixing of first element.
    idx = pd.Series(idx)
    duplicates = idx[idx.duplicated()].unique()
    fmt = '%s.%03d' if fmt is None else fmt
    for name in duplicates:
        dups = idx==name
        ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
                      for i in range(dups.sum()) ]
        idx.loc[dups] = ret
    return pd.Index(idx)

使用该函数如下：

df.columns = dedupIndex(df.columns)
# Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
# Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']

解决方案 10：

这是一个也适用于多索引的解决方案

# Take a df and rename duplicate columns by appending number suffixes
def rename_duplicates(df):
    import copy
    new_columns = df.columns.values
    suffix = {key: 2 for key in set(new_columns)}
    dup = pd.Series(new_columns).duplicated()

    if type(df.columns) == pd.core.indexes.multi.MultiIndex:
        # Need to be mutable, make it list instead of tuples
        for i in range(len(new_columns)):
            new_columns[i] = list(new_columns[i])
        for ix, item in enumerate(new_columns):
            item_orig = copy.copy(item)
            if dup[ix]:
                for level in range(len(new_columns[ix])):
                    new_columns[ix][level] = new_columns[ix][level] + f"_{suffix[tuple(item_orig)]}"
                suffix[tuple(item_orig)] += 1

        for i in range(len(new_columns)):
            new_columns[i] = tuple(new_columns[i])

        df.columns = pd.MultiIndex.from_tuples(new_columns)
    # Not a MultiIndex
    else:
        for ix, item in enumerate(new_columns):
            if dup[ix]:
                new_columns[ix] = item + f"_{suffix[item]}"
                suffix[item] += 1
        df.columns = new_columns

解决方案 11：

创建了一个带有一些测试的函数，因此它应该可以随时投入使用；这与Lamakaha 的优秀解决方案略有不同，因为它重命名了重复列的第一次出现：

from collections import defaultdict
from typing import Dict, List, Set

import pandas as pd

def rename_duplicate_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Rename column headers to ensure no header names are duplicated.

    Args:
        df (pd.DataFrame): A dataframe with a single index of columns

    Returns:
        pd.DataFrame: The dataframe with headers renamed; inplace
    """
    if not df.columns.has_duplicates:
        return df
    duplicates: Set[str] = set(df.columns[df.columns.duplicated()].tolist())
    indexes: Dict[str, int] = defaultdict(lambda: 0)
    new_cols: List[str] = []
    for col in df.columns:
        if col in duplicates:
            indexes[col] += 1
            new_cols.append(f"{col}.{indexes[col]}")
        else:
            new_cols.append(col)
    df.columns = new_cols
    return df

def test_rename_duplicate_columns():
    df = pd.DataFrame(data=[[1, 2]], columns=["a", "b"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a", "b"]

    df = pd.DataFrame(data=[[1, 2]], columns=["a", "a"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "a.2"]

    df = pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "a"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "b", "a.2"]

解决方案 12：

我们只需为每一列指定不同的名称即可。

假设重复的列名称如下 =[a,b,c,d,d,c]

然后只需创建一个要分配的名称列表：

C = [a,b,c,d,D1,C1]
df.columns = c

这对我有用。

解决方案 13：

这是我的解决方案：

cols = []  # for tracking if we alread seen it before
new_cols = []

for col in df.columns:
    cols.append(col)
    count = cols.count(col)
    
    if count > 1:
        new_cols.append(f'{col}_{count}')
    else:
        new_cols.append(col)

df.columns = new_cols

解决方案 14：

这是一个优雅的解决方案：

隔离仅包含重复列的数据框（看起来它将是一个系列，但如果有 >1 列具有该名称，它将是一个数据框）：

df1 = df['blah']

对于每个“blah”列，赋予其一个唯一的编号

df1.columns = ['blah_' + str(int(x)) for x in range(len(df1.columns))]

隔离除重复列之外的所有数据框：

df2 = df[[x for x in df.columns if x != 'blah']]

按索引重新合并：

df3 = pd.merge(df1, df2, left_index=True, right_index=True)

瞧：

   blah_0  blah_1  blah_2  blah2  blah3
0       0       3       4      1      2
1       5       8       9      6      7

Panda 的 DataFrame - 重命名多个同名列