摘要：问题描述：在 Python 中，计算两个列表之间的差异的最佳方法是什么？例子A = [1,2,3,4] B = [2,5] A - B = [1,3,4] B - A = [5] 解决方案 1：如果顺序不重要，您可以简单地计算集合差异：>>> set([1,2,3,4]) - set([2...

问题描述：

在 Python 中，计算两个列表之间的差异的最佳方法是什么？

例子

A = [1,2,3,4]
B = [2,5]

A - B = [1,3,4]
B - A = [5]

解决方案 1：

如果顺序不重要，您可以简单地计算集合差异：

>>> set([1,2,3,4]) - set([2,5])
set([1, 4, 3])
>>> set([2,5]) - set([1,2,3,4])
set([5])

解决方案 2：

set如果您不关心项目顺序或重复，请使用。如果您关心以下情况，请使用列表推导：

>>> def diff(first, second):
        second = set(second)
        return [item for item in first if item not in second]

>>> diff(A, B)
[1, 3, 4]
>>> diff(B, A)
[5]
>>>

解决方案 3：

你可以做一个

list(set(A)-set(B))

和

list(set(B)-set(A))

解决方案 4：

一句话：

diff = lambda l1,l2: [x for x in l1 if x not in l2]
diff(A,B)
diff(B,A)

或者：

diff = lambda l1,l2: filter(lambda x: x not in l2, l1)
diff(A,B)
diff(B,A)

解决方案 5：

上述示例简化了计算差异的问题。假设排序或去重肯定会使计算差异变得更容易，但如果您的比较无法承受这些假设，那么您将需要一个非平凡的 diff 算法实现。请参阅 Python 标准库中的 difflib。

#! /usr/bin/python2
from difflib import SequenceMatcher

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) )

或者 Python3......

#! /usr/bin/python3
from difflib import SequenceMatcher
from functools import reduce

A = [1,2,3,4]
B = [2,5]

squeeze=SequenceMatcher( None, A, B )

print( "A - B = [%s]"%( reduce( lambda p,q: p+q,
                               map( lambda t: squeeze.a[t[1]:t[2]],
                                    filter(lambda x:x[0]!='equal',
                                           squeeze.get_opcodes() ) ) ) ) )

输出：

A - B = [[1, 3, 4]]

解决方案 6：

Python 2.7.3（默认，2014 年 2 月 27 日，19:58:35） - IPython 1.1.0 - timeit：（github gist）

def diff(a, b):
  b = set(b)
  return [aa for aa in a if aa not in b]

def set_diff(a, b):
  return list(set(a) - set(b))

diff_lamb_hension = lambda l1,l2: [x for x in l1 if x not in l2]

diff_lamb_filter = lambda l1,l2: filter(lambda x: x not in l2, l1)

from difflib import SequenceMatcher
def squeezer(a, b):
  squeeze = SequenceMatcher(None, a, b)
  return reduce(lambda p,q: p+q, map(
    lambda t: squeeze.a[t[1]:t[2]],
      filter(lambda x:x[0]!='equal',
        squeeze.get_opcodes())))

结果：

# Small
a = range(10)
b = range(10/2)

timeit[diff(a, b)]
100000 loops, best of 3: 1.97 µs per loop

timeit[set_diff(a, b)]
100000 loops, best of 3: 2.71 µs per loop

timeit[diff_lamb_hension(a, b)]
100000 loops, best of 3: 2.1 µs per loop

timeit[diff_lamb_filter(a, b)]
100000 loops, best of 3: 3.58 µs per loop

timeit[squeezer(a, b)]
10000 loops, best of 3: 36 µs per loop

# Medium
a = range(10**4)
b = range(10**4/2)

timeit[diff(a, b)]
1000 loops, best of 3: 1.17 ms per loop

timeit[set_diff(a, b)]
1000 loops, best of 3: 1.27 ms per loop

timeit[diff_lamb_hension(a, b)]
1 loops, best of 3: 736 ms per loop

timeit[diff_lamb_filter(a, b)]
1 loops, best of 3: 732 ms per loop

timeit[squeezer(a, b)]
100 loops, best of 3: 12.8 ms per loop

# Big
a = xrange(10**7)
b = xrange(10**7/2)

timeit[diff(a, b)]
1 loops, best of 3: 1.74 s per loop

timeit[set_diff(a, b)]
1 loops, best of 3: 2.57 s per loop

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# too long to wait for

timeit[diff_lamb_filter(a, b)]
# TypeError: sequence index must be integer, not 'slice'

@roman-bodnarchuk 列表理解函数def diff(a, b)似乎更快。

解决方案 7：

A = [1,2,3,4]
B = [2,5]

#A - B
x = list(set(A) - set(B))
#B - A 
y = list(set(B) - set(A))

print x
print y

解决方案 8：

您可能想要使用 aset而不是 a list。

解决方案 9：

如果你想以递归方式深入列表项中的差异，我已经为 python 编写了一个包： https: //github.com/erasmose/deepdiff

安装

从 PyPi 安装：

pip install deepdiff

如果你是 Python3 你还需要安装：

pip install future six

示例用法

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function

同一对象返回空

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {}

商品类型已改变

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'type_changes': ["root[2]: 2=<type 'int'> vs. 2=<type 'str'>"]}

物品价值已改变

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> ddiff = DeepDiff(t1, t2)
>>> print (ddiff.changes)
    {'values_changed': ['root[2]: 2 ====>> 4']}

添加和/或移除商品

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes)
    {'dic_item_added': ['root[5, 6]'],
     'dic_item_removed': ['root[4]'],
     'values_changed': ['root[2]: 2 ====>> 4']}

字符串差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ 'root[2]: 2 ====>> 4',
                          "root[4]['b']:
--- 
+++ 
@@ -1 +1 @@
-world
+world!"]}
>>>
>>> print (ddiff.changes['values_changed'][1])
    root[4]['b']:
    --- 
    +++ 
    @@ -1 +1 @@
    -world
    +world!

字符串差异2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!
Goodbye!
1
2
End"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world
1
2
End"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'values_changed': [ "root[4]['b']:
--- 
+++ 
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
 1
 2
 End"]}
>>>
>>> print (ddiff.changes['values_changed'][0])
    root[4]['b']:
    --- 
    +++ 
    @@ -1,5 +1,4 @@
    -world!
    -Goodbye!
    +world
     1
     2
     End

类型更改

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world


End"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'type_changes': [ "root[4]['b']: [1, 2, 3]=<type 'list'> vs. world


End=<type 'str'>"]}

列表差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'list_removed': ["root[4]['b']: [3]"]}

列表差异 2：请注意，它不考虑顺序

>>> # Note that it DOES NOT take order into account
... t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { }

包含字典的列表：

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff.changes, indent = 2)
    { 'dic_item_removed': ["root[4]['b'][2][2]"],
      'values_changed': ["root[4]['b'][2][1]: 1 ====>> 3"]}

解决方案 10：

最简单的方法，

使用set().difference(set())

list_a = [1,2,3]
list_b = [2,3]
print set(list_a).difference(set(list_b))

答案是set([1])

解决方案 11：

有 3 种方法可以实现此目的，其中两种是可以接受的，一种则不应该这样做。

从高层次来看，这 3 个选项是：

减去两个集合（有时最好）
检查每个列表项是否存在于集合中（大多数情况下最好）
检查列表中每个列表项是否存在（不要这样做）

永远不要选择选项 3)，而要选择选项 2)。根据应用程序的需求，您可能更喜欢选项 1) 或 2)，而在大多数情况下，2) 可能是首选方法。2) 与 1) 的性能非常相似，因为它们都具有O(m + n)时间复杂度。相比之下，2) 在空间复杂度方面比 1) 略有优势，并且既保持了原始列表的顺序，又保持了原始列表中的任何重复项。

如果您想删除重复项而不关心顺序，那么 1) 可能是最适合您的。

import time

def fun1(l1, l2):
    # Order and duplications in l1 are both lost, O(m) + O(n)
    return set(l1) - set(l2)

def fun2(l1, l2):
    # Order and duplications in l1 are both preserved, O(m) + O(n)
    l2_set = set(l2)
    return [item for item in l1 if item not in l2_set]

def fun3(l1, l2):
    # Order and duplications in l1 are both preserved, O(m * n)
    # Don't do
    return [item for item in l1 if item not in l2]

A = list(range(7500))
B = list(range(5000, 10000))

loops = 100

start = time.time()
for _ in range(loops):
    fun1(A, B)
print(f"fun1 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun2(A, B)
print(f"fun2 time: {time.time() - start}")

start = time.time()
for _ in range(loops):
    fun3(A, B)
print(f"fun3 time: {time.time() - start}")

fun1 time: 0.03749704360961914
fun2 time: 0.04109621047973633
fun3 time: 32.55076885223389

解决方案 12：

对于字典列表，完整列表理解解决方案有效，而set解决方案提出

TypeError: unhashable type: 'dict'

测试用例

def diff(a, b):
    return [aa for aa in a if aa not in b]

d1 = {"a":1, "b":1}
d2 = {"a":2, "b":2}
d3 = {"a":3, "b":3}

>>> diff([d1, d2, d3], [d2, d3])
[{'a': 1, 'b': 1}]
>>> diff([d1, d2, d3], [d1])
[{'a': 2, 'b': 2}, {'a': 3, 'b': 3}]

解决方案 13：

如果您的顺序无关紧要并且两个集合都可以被散列，那么您可以在集合之间使用对称差异。

这将返回集合 A 或集合 B 中出现的值，但不会同时返回两者。

例如，问题显示对列表A 和列表B执行差异后的返回值。

如果我们（将两个列表转换为集合并）执行对称差分，我们将在一次操作中获得两者的合并结果。

A = [1,2,3,4]
B = [2,5]
print(set(A) ^ set(B)

# {1, 3, 4, 5}

添加此答案，因为我还没有看到现有答案中提供的对称差异

解决方案 14：

如果您需要，简单的代码可以为您提供多个项目之间的差异：

a=[1,2,3,3,4]
b=[2,4]
tmp = copy.deepcopy(a)
for k in b:
    if k in tmp:
        tmp.remove(k)
print(tmp)

解决方案 15：

添加一个答案来处理我们想要严格区分重复的情况，即，第一个列表中有重复，我们希望将其保留在结果中。例如，

[1, 1, 1, 2] - [1, 1] --> [1, 2]

我们可以使用附加计数器来实现优雅的差异函数。

from collections import Counter

def diff(first, second):
    secondCntr = Counter(second)
    second = set(second)
    res = []
    for i in first:
        if i not in second:
            res.append(i)
        elif i in secondCntr:
            if secondCntr[i] > 0:
                secondCntr[i] -= 1
            else:
                res.append(i)        
    return res

解决方案 16：

我没有看到此线程中保留 A 中重复项的解决方案。当 A 中的元素与 B 中的元素匹配时，必须在 B 中删除该元素，以便当相同的元素再次出现在 A 中时，如果该元素在 B 中仅出现一次，则它必须出现在差异中。

def diff(first, second):
   l2 = list(second)
   l3 = []
   for el in first:
      if el in l2:
         l2.remove(el)
      else:
         l3 += [el]
   return l3

l1 = [1, 2, 1, 3, 4]
l2 = [1, 2, 3, 3]
diff(l1, l2)
>>> [1, 4]

解决方案 17：

当查看In 运算符的时间复杂度时，最坏情况下它的复杂度为 O(n)。即使对于集合也是如此。

因此，当比较两个数组时，最佳情况下的时间复杂度为 O(n)，最坏情况下的时间复杂度为 O(n^2)。

另一种解决方案（但不幸的是更复杂）在最好和最坏的情况下都适用于 O（n），如下所示：

# Compares the difference of list a and b
# uses a callback function to compare items
def diff(a, b, callback):
  a_missing_in_b = []
  ai = 0
  bi = 0

  a = sorted(a, callback)
  b = sorted(b, callback)

  while (ai < len(a)) and (bi < len(b)):

    cmp = callback(a[ai], b[bi])
    if cmp < 0:
      a_missing_in_b.append(a[ai])
      ai += 1
    elif cmp > 0:
      # Item b is missing in a
      bi += 1
    else:
      # a and b intersecting on this item
      ai += 1
      bi += 1

  # if a and b are not of same length, we need to add the remaining items
  for ai in xrange(ai, len(a)):
    a_missing_in_b.append(a[ai])


  return a_missing_in_b

例如

>>> a=[1,2,3]
>>> b=[2,4,6]
>>> diff(a, b, cmp)
[1, 3]

计算列表差异[重复]