圣诞树动画

Question

TWOfish

Asked:2020-08-14 01:48:10 +0000 UTC2020-08-14 01:48:10 +0000 UTC 2020-08-14 01:48:10 +0000 UTC

删除一行（或非一行）包含三个或更多任意数字（字母）的行

772

题目分为几个类似的，以免每次都单独创建。字符串中使用的字符：所有数字、英文字母的所有小写和大写字母。查看线路：

a22tbf645
92STbfF4W
92rtRe7Ev
gyue73Pr4
u8t9D03gE
a2t4TA6Kk
Lj3D2Jrs1

同时应用以下所有内容后所需的响应：Lj3D2Jrs1

1.a 删除在任何位置包含三个或更多不同连续数字的行。

1.b 删除在任何位置连续包含三个或更多小写字母的行。

1.c 删除在任何位置连续包含三个或更多任何大写字母的行。

2.a 删除在任何位置包含四个或更多任何数字的行。

2.b 删除在任何位置包含四个或更多小写字母的行。

2.c 删除包含四个或更多大写字母的行。

删除在不同情况下任意位置包含相同字母的字符串，例如aA Aa

是否可以将所有这些条件组合成一个命令？最好是 sed, awk, grep, tr, cut, perl, python, 也许是其他东西 - 强调处理速度 - 大数组。（顺便说一下，更快？）谢谢

3 个回答

Voted

Ainar-G · Answer 1 · 2020-08-14T04:59:52Z

Ainar-G

2020-08-14T04:59:52Z2020-08-14T04:59:52Z

这是一个（可能）效率低下且坦率地说很愚蠢，但适用于经典正则表达式的版本：

grep -v\
 -e '[[:digit:]]\{3,\}'\
 -e '[[:lower:]]\{3,\}'\
 -e '[[:upper:]]\{3,\}'\
 -e '\([[:digit:]].*\)\{4,\}'\
 -e '\([[:lower:]].*\)\{4,\}'\
 -e '\([[:upper:]].*\)\{4,\}'\
 $(for i in {a..z}; do echo "-e $i${i^^} -e ${i^^}$i"; done)

4

Hellseher · Answer 2 · 2020-08-14T06:59:01Z

Python 版本基于 CO 的答案。

例如，我们从 100M 的随机样本中创建一个文件，该文件具有来自问题的给定模式：[a-zA-Z0-9]{9}

机器特点：

~$ uname -r; grep -im1 "model name" /proc/cpuinfo
4.17.3-200.fc28.x86_64
model name      : Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz

仅在没有条件 3 的正则表达式上更正版本

~$ cat lines_filter_re.py

#!/usr/bin/env python3
import sys
import re
import fileinput


for line in fileinput.input():
    pat_re = re.compile(r""""
       ([0-9]){3}            # Any 3 sequensial numbers
       |([a-z]){3}           # Any 3 lsequensia lower case letters
       |([A-Z]){3}           # Any 3 sequensia lupper case letters
       |(.*[0-9].*){4}       # any 4 digits in any place
       |(.*[A-Z].*){4}       # any 4 upper case ltters at any place
       |(.*[a-z].*){4}       # any 4 loser case lttters at any place
       """, re.VERBOSE)

    if not pat_re.findall(line):
        sys.stdout.write(line)
# end of script

~$ cat test.txt; echo; ./lines_filter_re.py < test.txt
a22tbf645
92STbfF4W
92rtRe7Ev
gyue73Pr4
u8t9D03gE
a2t4TA6Kk
aAb12cAB1
Lj3D2Jrs1
---------
a2t4TA6Kk
aAb12cAB1
Lj3D2Jrs1

~$ wc -l file_9w_100M
6180836 file_9w_100M

~$ time ./lines_filter_re.py >/dev/null < file_9w_100M
real    1m10.668s
user    1m10.424s
sys     0m0.060s

Rust 中的方法，工作速度快 15 倍：

extern crate regex;

use regex::RegexSet;
use std::io::{stdin, BufRead};

fn main() {
    let set = RegexSet::new(&[
        r"([0-9]){3}",
        r"([a-z]){3}",
        r"([A-Z]){3}",
        r"(.*[0-9].*){4}",
        r"(.*[A-Z].*){4}",
        r"(.*[a-z].*){4}",
    ]).unwrap();

    let stdin = stdin();
    let test_string = "Test me once";

    println!("{}", set.is_match(test_string));

    for line in stdin.lock().lines() {
        let mut line_str = line.unwrap();

        if set.is_match(&mut line_str) {
            continue;
        } else {
            println!("{}", line_str);
        }
    }
}
// End of main.rs

测试

~$ time ./lfr  < ../../../../Python/test.txt
true
a2t4TA6Kk
aAb12cAB1
Lj3D2Jrs1
real    0m0.004s
user    0m0.002s
sys     0m0.002s

~$ time ./lfr >/dev/null < ../../../../Python/file_9w_100M
real    0m2.934s
user    0m2.866s
sys     0m0.062s

链接

MBo · Answer 3 · 2020-08-14T21:12:59Z

使用简单状态机（确定性有限状态机）的解决方案是用Delphi正面写的，然后我尽我所能，几乎直接翻译成Python，所以代码很麻烦，显然不是pythonic，性能可能会受到影响因为我不使用一些内置的东西，也不知道是什么影响了 Python 的性能。例如，当我将字符突出显示替换为时，时间从 41 秒变为 33 秒

        for i in range(len(line)):
            ch = line[i]

在

        for ch in line:

对于属于指定集合的 1000 万行 9 个随机字符（110 兆字节）的文件，在 Celeron 2.8 GHz 上的 Windows 下的本机 (Delphi) 代码大约需要 4-5 秒（从 HDD 加载数据和大约一秒钟来处理自己）。这段代码 (Python 3.6) 需要 33 秒的时间（其中 11 秒用于加载文件，如果您只是阅读所有行并且什么都不做）。

代码遍历每一行，计算大写字母、小写字母和数字的数量，如果任何计数器超过限制 (3)（条件 2），则中止行处理。

state - 这是当前状态，表示之前处理的符号类型。0 - 不确定，1 - 数字，2 - 大写，3 - 小写字母。

如果state不改变，则seriescount检查当前类型的字符序列的长度（条件1）

如果它state从 2 变为 3 或反之亦然，则检查前一个字符是否是不同寄存器中当前字符的副本（条件 3）。

def dellines():
    infile = open("d:\m1.txt", "r")
    outfile = open("d:\m2.txt", "w")

    for line in infile:
        seriescnt = 0
        bigcnt = 0
        smallcnt = 0
        digitcnt = 0
        state = 0
        for ch in line:

            if (ch >= "0") and (ch <= "9"):

                if (state == 1):
                    seriescnt += 1
                    if (seriescnt > 2):
                        state = 0
                        break
                else:
                  seriescnt = 1

                digitcnt += 1
                if (digitcnt > 3):
                     state = 0
                     break
                state = 1
            elif (ch >= "A") and (ch <= "Z"):

                if (state == 3):
                    if (ord(lastch) == ord(ch) + 32):
                        state = 0
                        break

                if (state == 2):
                    seriescnt += 1
                    if (seriescnt > 2):
                        state = 0
                        break
                else:
                  seriescnt = 1

                bigcnt += 1
                if (bigcnt > 3):
                     state = 0
                     break

                lastch = ch
                state = 2
            elif (ch >= "a") and (ch <= "z"):

                if (state == 2):
                    if (ord(lastch) == ord(ch) - 32):
                        state = 0
                        break

                if (state == 3):
                    seriescnt += 1
                    if (seriescnt > 2):
                        state = 0
                        break
                else:
                  seriescnt = 1

                smallcnt += 1
                if (smallcnt > 3):
                     state = 0
                     break

                lastch = ch
                state = 3

        if (state):
            outfile.write(line)
    outfile.close()
    infile.close()

start = time.time()
dellines()
end = time.time()
print(end - start)

删除一行（或非一行）包含三个或更多任意数字（字母）的行

是否可以在 C++ 中继承类 <---> 结构？

这种神经网络架构适合文本分类吗？

为什么分配的工作方式不同？

控制台中的光标坐标

如何在 C++ 中删除类的实例？

点是否属于线段的问题

json结构错误

ServiceWorker 中的“获取”事件

c ++控制台应用程序exe文件[重复]

按多列从sql表中选择

删除一行（或非一行）包含三个或更多任意数字（字母）的行

3 个回答

相关问题