比较两个 CSV 文件并搜索相似项目
- 2025-04-10 09:46:00
- admin 原创
- 17
问题描述:
因此,我有两个 CSV 文件,我想比较它们并获取相似项目的结果。第一个文件 hosts.csv 如下所示:
Path Filename Size Signature
C: a.txt 14kb 012345
D: b.txt 99kb 678910
C: c.txt 44kb 111213
第二个文件masterlist.csv如下所示:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
如您所见,行不匹配,并且 masterlist.csv 始终大于 hosts.csv 文件。我唯一想搜索的部分是签名部分。我知道这看起来像:
hosts[3] == masterlist[1]
我正在寻找一个可以给我提供类似以下内容的解决方案(基本上是带有新 RESULTS 列的 hosts.csv 文件):
Path Filename Size Signature RESULTS
C: a.txt 14kb 012345 NOT FOUND in masterlist
D: b.txt 99kb 678910 FOUND in masterlist (row 1)
C: c.txt 44kb 111213 FOUND in masterlist (row 2)
我搜索了这些帖子,发现了一些类似的东西,但我不太理解,因为我还在学习 Python。
使用 Python 2.6编辑
解决方案 1:
srgerg 的答案非常低效,因为它以二次时间运行;这里是一个线性时间解决方案,使用与 Python 2.6 兼容的语法:
import csv
with open('masterlist.csv', 'rb') as master:
master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))
with open('hosts.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['RESULTS'])
for row in reader:
index = master_indices.get(row[3])
if index is not None:
message = 'FOUND in master list (row {})'.format(index)
else:
message = 'NOT FOUND in master list'
writer.writerow(row + [message])
这将生成一个字典,masterlist.csv
首先将签名从 映射到行号。字典中的查找需要恒定的时间,使得对行的第二次循环hosts.csv
与 中的行数无关masterlist.csv
。更不用说代码要简单得多了。
对于使用 Python 3 的用户,上述操作只需要open()
调整调用以在文本模式下打开(b
从文件模式中删除),并且您想要添加new line=''
以便 CSV 阅读器可以控制行分隔符。您可能希望明确说明要使用的编码,而不是依赖于系统默认值(使用encoding=...
)。master_indices
可以使用字典理解({r[1]: i for i, r in enumerate(csv.reader(master))}
)构建映射。
解决方案 2:
编辑:虽然我的解决方案正确,但请查看下面 Martijn 的回答以获得更有效的解决方案。
您可以在此处找到有关 python CSV 模块的文档。
你正在寻找的是这样的:
import csv
f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
masterlist = list(c2)
for hosts_row in c1:
row = 1
found = False
for master_row in masterlist:
results_row = hosts_row
if hosts_row[3] == master_row[1]:
results_row.append('FOUND in master list (row ' + str(row) + ')')
found = True
break
row = row + 1
if not found:
results_row.append('NOT FOUND in master list')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
解决方案 3:
Python 的CSV和 collections 模块(特别是OrderedDict)在这里非常有用。您想使用 OrderedDict 来保留键的顺序等。您不必这样做,但它很有用!
import csv
from collections import OrderedDict
signature_row_map = OrderedDict()
with open('hosts.csv') as file_object:
for line in csv.DictReader(file_object, delimiter=' '):
signature_row_map[line['Signature']] = {'line': line, 'found_at': None}
with open('masterlist.csv') as file_object:
for i, line in enumerate(csv.DictReader(file_object, delimiter=' '), 1):
if line['Signature'] in signature_row_map:
signature_row_map[line['Signature']]['found_at'] = i
with open('newhosts.csv', 'w') as file_object:
fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
writer = csv.DictWriter(file_object, fieldnames, delimiter=' ')
writer.writer.writerow(fieldnames)
for signature_info in signature_row_map.itervalues():
result = '{0} FOUND in masterlist {1}'
# explicit check for sentinel
if signature_info['found_at'] is not None:
result = result.format('', '(row %s)' % signature_info['found_at'])
else:
result = result.format('NOT', '')
payload = signature_info['line']
payload['RESULTS'] = result
writer.writerow(payload)
以下是使用测试 CSV 文件的输出:
Path Filename Size Signature RESULTS
C: a.txt 14kb 012345 NOT FOUND in masterlist
D: b.txt 99kb 678910 FOUND in masterlist (row 1)
C: c.txt 44kb 111213 FOUND in masterlist (row 2)
请原谅错位,它们是用制表符分隔的:)
解决方案 4:
该csv
模块在解析 csv 文件时非常有用。但只是为了好玩,我只是根据空格拆分输入以获取数据。
只需解析数据,dict
为 masterlist.csv 中的数据构建一个,以签名为键,以行号为值。现在,对于 hosts.csv 的每一行,我们只需查询dict
并找出 masterlist.csv 中是否存在相应的条目,如果存在,则位于哪一行。
#! /usr/bin/env python
def read_data(filename):
input_source=open(filename,'r')
input_source.readline()
return [line.split() for line in input_source]
if __name__=='__main__':
hosts=read_data('hosts.csv')
masterlist=read_data('masterlist.csv')
master=dict()
for index,data in enumerate(masterlist):
master[data[-1]]=index+1
for row in hosts:
try:
found="FOUND in masterlist (row %s)"%master[row[-1]]
except KeyError:
found="NOT FOUND in masterlist"
line=row+[found]
print "%s %s %s %s %s"%tuple(line)
解决方案 5:
我刚刚修复了 Martijn Pieters 代码中的一小块内容,以使其在 Python 3 中运行,在此代码中,我尝试将 file1 中的第一列元素row[0]
与 file2 中的第一列元素进行匹配row[0]
。
import csv
with open('file1.csv', 'rt', encoding='utf-8') as master:
master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))
with open('file2.csv', 'rt', encoding='utf-8') as hosts:
with open('result.csv', 'w') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['RESULTS'])
for row in reader:
index = master_indices.get(row[0])
if index is not None:
message = 'FOUND in master list (row {})'.format(index)
writer.writerow(row + [message])
else:
message = 'NOT FOUND in master list'
writer.writerow(row + [message])
results.close()
扫码咨询,免费领取项目管理大礼包!