究竟是什么标识了网站访问者？

Question

I'amNotYou

Asked:2020-01-08 14:07:25 +0800 CST2020-01-08 14:07:25 +0800 CST 2020-01-08 14:07:25 +0800 CST

删除文件中的重复行

772

有一个文本文件，它包含 1000 封电子邮件，每封电子邮件占一行。其中一些是重复的。处理后的输出必须是仅包含唯一电子邮件的文件。如何用Python 3实现这个？

2 个回答

Voted

Sergey Gornostaev · Answer 1 · 2020-01-08T14:53:40+08:00

Best Answer

Sergey Gornostaev

2020-01-08T14:53:40+08:002020-01-08T14:53:40+08:00

从列表中删除重复项的最快和最简单的方法是将其转换为集合。set() 集合构造函数接受任何可迭代对象，包括文件描述符。之后，只剩下将集合转换回字符串并将其写入另一个文件：

with open('emails.txt') as in_fh, open('deduplicated.txt', 'w') as out_fh:
    out_fh.write(''.join(set(in_fh)))

5

jfs · Answer 2 · 2020-01-09T00:12:57+08:00

要打印在命令行或标准输入中指定的文件中给出的唯一电子邮件：

#!/usr/bin/env python
import fileinput

print("\n".join(set(map(str.strip, fileinput.input()))))

例子：

$ dedup emails.txt >uniq-emails.txt

或者：

$ dedup < emails.txt >uniq-emails.txt

即使行中存在/不存在不可见空间，代码也能正常工作。例如，文件中的最后一行可能有也可能没有换行符——结果仍然是正确的。

存在set()导致结果以随机顺序打印，每次运行都可能发生变化。要模拟sort -u emails.txt，您可以使用groupby(sorted())：

#!/usr/bin/env python
import fileinput
from itertools import groupby

for line, _ in groupby(sorted(map(str.strip, fileinput.input()))):
    print(line)

用法相同：输入从文件或标准输入中读取，输出打印到标准输出。

对于电子邮件的情况，这不是必需的，但一般来说，要仅打印内存中放不下的大文件中的唯一行LC_ALL=C sort -u < input，Python 等效项是：

#!/usr/bin/env python3
import contextlib
import heapq
import sys
from itertools import groupby
from tempfile import TemporaryFile
from operator import itemgetter

def uniq(sorted_items):
    return map(itemgetter(0), groupby(sorted_items))

sorted_files = []
with contextlib.ExitStack() as stack:
    # sort lines in batches, write intermediate result to temporary files
    nbytes = 1 << 15 # read ~nbytes at a time
    for lines in iter(lambda f=sys.stdin.detach(): f.readlines(nbytes), []):
        lines.sort()
        file = stack.enter_context(TemporaryFile('w+b')) #NOTE: file is deleted on exit
        file.writelines(uniq(lines)) # write sorted unique lines
        file.seek(0) # rewind, to read later while merging partial results
        sorted_files.append(file) #NOTE: do not close the temporary file, yet

    # merge and write results
    sys.stdout = sys.stdout.detach() # suppress ValueError: underlying buffer has been detached
    sys.stdout.writelines(uniq(heapq.merge(*sorted_files)))

例子：

$ sort-u < emails.txt >uniq-emails.txt

在这种情况下，输入仅来自标准输入，字符串作为字节序列进行比较（假设所有字符串都以换行符结尾）。

相关问题：使用 Python 对文本文件进行排序。

删除文件中的重复行

onMousePressed 在 ScrollPane 上不起作用

如何关闭jFrame？

JavaFX someNode.getBoundsInLocal().getHeight() 返回 0.0

通过绑定更改图像透明度？

proto 和原型有什么区别？

阻塞进程直到线程/子进程退出

如何一键启动浏览器同步

代码的第一部分有效，但随后无效。我不知道为什么。Python

给出错误警告：filesize(): stat failed for Chrysanthemum.jpg in D:\OSPanel\domains\test\index.php 在第 2 行

同时测试包含“！=”运算符的两个条件