有没有办法通过列来实现“uniq”？-IT科技

摘要：问题描述：我有一个这样的.csv 文件：stack2@domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1 overflow@domain2.example,2009-11-27 00:58:29.793000000,dom...

问题描述：

我有一个这样的.csv 文件：

stack2@domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.793000000,domain2.example,255.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.646465785,domain2.example,256.255.255.0
...

我必须从文件中删除重复的电子邮件（整行）（即overflow@domain2.example上例中包含的行之一）。如何仅在字段 1 上使用uniq（以逗号分隔）？根据man，uniq没有列选项。

我尝试了一些方法sort | uniq，但没有效果。

解决方案 1：

sort -u -t, -k1,1 file

-u为独特的
-t,所以逗号是分隔符
-k1,1对于关键字段 1

测试结果：

overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

解决方案 2：

awk -F&quot;,&quot; &#039;!_[$1]++&#039; file

-F设置字段分隔符。
$1是第一个字段。
_[val]`val在哈希_`（常规变量）中查找。
++增加并返回旧值。
!返回逻辑非。
最后有一个隐含的打印。

解决方案 3：

考虑多列。

根据第 1 列和第 3 列进行排序并给出唯一列表：

sort -u -t : -k 1,1 -k 3,3 test.txt

-t :冒号是分隔符
-k 1,1 -k 3,3根据第 1 列和第 3 列

解决方案 4：

如果你想使用uniq：

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

给出：

1 01:05:47.893000000 2009-11-27 tack2@domain.example
2 00:58:29.793000000 2009-11-27 overflow@domain2.example
1

解决方案 5：

如果你想保留最后一个重复项，你可以使用

 tac a.csv | sort -u -t, -r -k1,1 |tac

这就是我的要求

这里

tac将逐行反转文件

解决方案 6：

这是一个非常巧妙的方法。

首先格式化内容，使要比较唯一性的列具有固定宽度。一种实现方法是使用带有字段/列宽度说明符（“%15s”）的 awk printf。

现在可以使用 uniq 的 -f 和 -w 选项来跳过前面的字段/列并指定比较宽度（列宽）。

以下有三个例子。

在第一个例子中...

1）暂时使感兴趣的列具有大于或等于字段最大宽度的固定宽度。

2）使用 -f uniq 选项跳过前面的列，并使用 -w uniq 选项将宽度限制为 tmp_fixed_width。

3）从列中删除尾随空格以“恢复”其宽度（假设之前没有尾随空格）。

printf &quot;%s&quot; &quot;$str&quot; \ n| awk &#039;{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i&lt;w;i++) { $uniq_col=$uniq_col&quot; &quot;}; printf &quot;%s
&quot;, $0 }&#039; \ n| uniq -f 7 -w 15 \ n| awk &#039;{ uniq_col=8; gsub(/ */, &quot;&quot;, $uniq_col); printf &quot;%s
&quot;, $0 }&#039;

在第二个例子中...

创建一个新的 uniq 列 1。然后在应用 uniq 过滤器后将其删除。

printf &quot;%s&quot; &quot;$str&quot; \ n| awk &#039;{ uniq_col_1=4; printf &quot;%15s %s
&quot;, uniq_col_1, $0 }&#039; \ n| uniq -f 0 -w 15 \ n| awk &#039;{ $1=&quot;&quot;; gsub(/^ */, &quot;&quot;, $0); printf &quot;%s
&quot;, $0 }&#039;

第三个示例与第二个示例相同，但是针对多列。

printf &quot;%s&quot; &quot;$str&quot; \ n| awk &#039;{ uniq_col_1=4; uniq_col_2=8; printf &quot;%5s %15s %s
&quot;, uniq_col_1, uniq_col_2, $0 }&#039; \ n| uniq -f 0 -w 5 \ n| uniq -f 1 -w 15 \ n| awk &#039;{ $1=$2=&quot;&quot;; gsub(/^ */, &quot;&quot;, $0); printf &quot;%s
&quot;, $0 }&#039;

解决方案 7：

awkCLI 的行为uniq与没有它类似sort，但只能捕获连续的重复

到目前为止，大多数其他答案都给出了删除重复项的方法，即使它们不是连续的。

这样做的问题在于，它需要先排序或在内存中存储可能巨大的地图，这对于大型输入文件来说可能会很慢/不可行。

因此，对于这些情况，这里有一个awk解决方案，uniq它只捕获出现在连续行上的重复项。例如，要删除第一列上的所有连续重复项，我们可以使用$1如下方法：

awk &#039;$1 != last { print $0; last = $1; }&#039; infile.txt

例如，考虑输入文件：

a 0
a 1
b 0
a 0
a 1

输出结果为：

a 0
b 0
a 0

这里：

第一a 1列已被删除，因为前a 0一行有重复的第一列a
但我们得到了第二a 0列，因为b 0行破坏了连续性

该awk脚本的工作原理很简单，将前一行的列值存储在值中last，然后将当前值与其进行比较，如果它们不同则跳过。

如果您知道输入数据有很多无用的连续重复数据，并且想要在进行任何更昂贵的排序处理之前稍微清理一下，那么这种连续方法会很有用。

如果您确实需要删除非连续的重复项，更强大的解决方案通常是使用像 SQLite 这样的关系数据库，例如：如何在 SQLite 中删除重复项？

快速 Python 脚本删除最后 N 行出现的重复项

如果您需要更多的灵活性，但仍然不想支付完整的排序费用：

唯一

#!/usr/bin/env python

import argparse
from argparse import RawTextHelpFormatter
import fileinput
import sys

parser = argparse.ArgumentParser(
    description=&#039;uniq but with a memory of the n previous distinct lines rather than just one&#039;,
    epilog=&quot;&quot;&quot;Useful if you know that duplicate lines in an input file are nearby to one another, but not necessarily immediately one afte the other.

This command was about 3x slower than uniq, and becomes highly CPU (?) bound even on rotating disks. We need to make a C++ version one day, or try PyPy/Cython&quot;&quot;&quot;,
    formatter_class=RawTextHelpFormatter,
)
parser.add_argument(&quot;-k&quot;, default=None, type=int)
parser.add_argument(&quot;-n&quot;, default=10, type=int)
parser.add_argument(&quot;file&quot;, nargs=&#039;?&#039;, default=[])
args = parser.parse_args()
k = args.k

lastlines = {}
for line in fileinput.input(args.file):
    line = line.rstrip(&#039;
&#039;)
    if k is not None:
        orig = line
        line = line.split()[k]
    else:
        orig = line
    if not line in lastlines:
        print(orig)
    lastlines.pop(line, None)
    lastlines[line] = True
    if len(lastlines) == args.n + 1:
        del lastlines[next(iter(lastlines))]

该脚本在前面的行上查找重复项-n，并可用于清理具有某种周期性模式的数据，以防止uniq对其进行过多处理。-k选择列。例如考虑输入文件：

唯一性检验

1 a
2 a
3 a
1 a
2 a
2 b
3 a

然后：

./uniqn -k0 -n3 uniqn-test

给出：

1 a
2 a
3 a

例如，第二个1 a看到前三1 a行并跳过它，结果是-n3。

uniq一些需要考虑的内置选项

虽然uniq没有一个很好的“仅考虑第 N 列”，但它确实有一些标志可以解决某些更受限制的情况，来自man uniq：

-f, --skip-fields=N：避免比较前 N 个字段
-s, --skip-chars=N：避免比较前 N 个字符
-w, --check-chars=N：比较行中不超过 N 个字符
字段是一连串的空白（通常是空格和/或 TAB），然后是非空白字符。在字符之前会跳过字段。

如果有人能将的--check-fields类似物添加到--check-chars其中，那么我们就不用了--skip-fields N-1 --check-fields 1。然而，它已经适用于第一个字段的特定情况。

在 Ubuntu 23.04 上测试。

解决方案 8：

好吧，比用 awk 隔离列更简单，如果您需要删除给定文件具有特定值的所有内容，为什么不直接执行 grep -v：

例如，删除第二行中值为“col2”的所有内容：col1、col2、col3、col4

grep -v &#039;,col2,&#039; file > file_minus_offending_lines

如果这还不够好，因为某些行可能会因匹配值出现在不同的列中而被错误地剥离，您可以执行以下操作：

awk 隔离有问题的列：例如

awk -F, &#039;{print $2 &quot;|&quot; $line}&#039;

-F 将字段分隔设置为“，”, $2 表示第 2 列，后跟一些自定义分隔符，然后是整行。然后，您可以通过删除以有问题的值开头的行来进行筛选：

 awk -F, &#039;{print $2 &quot;|&quot; $line}&#039; | grep -v ^BAD_VALUE

然后删除分隔符之前的内容：

awk -F, &#039;{print $2 &quot;|&quot; $line}&#039; | grep -v ^BAD_VALUE | sed &#039;s/.*|//g&#039;

（请注意 - sed 命令不太严谨，因为它不包含转义值。此外，sed 模式实际上应该是类似“¹+”的东西（即任何不是分隔符的东西）。但希望这足够清楚了。

解决方案 9：

通过首先对文件进行排序sort，然后您可以应用uniq。

似乎可以很好地对文件进行排序：

$ cat test.csv
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv | uniq
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

您还可以做一些 AWK 魔术：

$ awk -F, &#039;{ lines[$1] = $0 } END { for (l in lines) print lines[l] }&#039; test.csv
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0