如何从 csv 文件中删除重复项-IT科技

如何从 csv 文件中删除重复项

2025-04-10 09:47:00

admin

原创

摘要：问题描述：我从 Hotmail 下载了一个 CSV 文件，但里面有很多重复项。这些重复项是完整的副本，我不知道我的手机为什么会创建它们。我想删除重复项。技术规格：Windows XP SP3 Python 2.7 包含 400 个联系人的 CSV 文件解决方案 1：更新：2016 年如果您乐意使用有用的mo...

问题描述：

我从 Hotmail 下载了一个 CSV 文件，但里面有很多重复项。这些重复项是完整的副本，我不知道我的手机为什么会创建它们。

我想删除重复项。

技术规格：

Windows XP SP3
Python 2.7
包含 400 个联系人的 CSV 文件

解决方案 1：

更新：2016 年

如果您乐意使用有用的more_itertools外部库：

from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
    out_file.writelines(unique_everseen(f))

@IcyFlame 解决方案的更高效版本

with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

要在现场编辑同一个文件，你可以使用这个（旧 Python 2 代码）

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

解决方案 2：

您可以使用 Pandas 有效地删除重复项，它可以随一起安装，或者随Anaconda的 python发行版pip一起安装。

看pandas.DataFrame.drop_duplicates

pip install pandas

代码

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="     or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output, index=False)

对于编码问题，请使用python 标准编码encoding=...中的相应类型进行设置。

有关更多详细信息，请参阅将 CSV 文件导入为 pandas DataFramepd.read_csv

解决方案 3：

您可以使用以下脚本：

前提：

1.csv是包含重复项的文件
2.csv是执行此脚本后将没有重复项的输出文件。

代码



inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

算法解释

在这里，我正在做的事情是：

以读取模式打开文件。这是包含重复项的文件。
然后在循环中直到文件结束，我们检查该行是否已经遇到。
如果遇到了，我们就不会将其写入输出文件。
如果没有，我们将把它写入输出文件，并将其添加到已经遇到的记录列表中

解决方案 4：

我知道这早已解决，但我遇到了一个密切相关的问题，即我需要根据一列删除重复项。输入的 csv 文件非常大，无法通过 MS Excel/Libre Office Calc/Google Sheets 在我的 PC 上打开；147MB，大约有 250 万条记录。由于我不想为这么简单的事情安装整个外部库，所以我编写了下面的 python 脚本，在不到 5 分钟的时间内完成了这项工作。我没有专注于优化，但我相信它可以优化为更快、更高效地运行更大的文件。该算法类似于上面的@IcyFlame，只是我根据列（“CCC”）而不是整行/行删除重复项。

import csv

with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
    # this list will hold unique ccc numbers,
    ccc_numbers = set()
    # read input file into a dictionary, there were some null bytes in the infile
    results = csv.DictReader(infile)
    writer = csv.writer(outfile)

    # write column headers to output file
    writer.writerow(
        ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
    )
    for result in results:
        ccc_number = result.get('CCC')
        # if value already exists in the list, skip writing it whole row to output file
        if ccc_number in ccc_numbers:
            continue
        writer.writerow([
            result.get('ID'),
            ccc_number,
            result.get('MFLCode'),
            result.get('datecollected'),
            result.get('DateTested'),
            result.get('Result'),
            result.get('Justification')
        ])

        # add the value to the list to so as to be skipped subsequently
        ccc_numbers.add(ccc_number)

解决方案 5：

@jamylak 解决方案的更高效版本:(少一条指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: 
            seen.add(line)
            out_file.write(line)

要就地编辑同一个文件，你可以使用这个

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line not in seen:
        seen.add(line)
        print line, # standard output is now redirected to the file