RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 1273193
Accepted
MaxU - stop genocide of UA
MaxU - stop genocide of UA
Asked:2022-04-22 04:07:09 +0000 UTC2022-04-22 04:07:09 +0000 UTC 2022-04-22 04:07:09 +0000 UTC

如何正确分解复杂的 SQL 函数?

  • 772

继续这个问题。

在工作中,我必须使用各种数据源,并且经常遇到质量较差的数据(手动输入系统)并且存储在不符合 3NF(3x 范式)规则的表中。那些。通常多个实体存储在一个单元格中。

例如:

email
-------------------------------------
mail1@mail.com, mail2@domain.com mail3@gmail.com

或者

phone
----------------------------------------------------
0172-1234/567, +49123456789 and 089 / 123-4567

我的团队面临的任务是在不更改数据模型的情况下对这些数据进行梳理和规范化,并以以下形式获取相同的数据:

email
-------------------------------------
mail1@mail.com; mail2@domain.com; mail3@gmail.com

和

phone
----------------------------------------------------
01721234567; 0049123456789; 0891234567

标准化个人电子邮件、电话等的功能。已经实现和测试要对此类数据进行规范化,首先需要以垂直形式(即 done UNPIVOT)解析和获取这些类似 CSV 的字符串,然后应用规范化函数,最后将已经规范化的值打包回 CSV 字符串相同的分隔符 ( '; ')。

在这个答案中,受人尊敬的0xdb展示了如何有效地解析此类数据并生成UNPIVOT. 根据答案中的函数,我编写了以下通用函数,它应该能够使用各种规范化函数对电话号码、WEB 地址和电子邮件进行规范化。

结果就是这个怪物:

create or replace function normalize_csv_values (
    str             varchar2,
    typ             varchar2    := 'phone',
    target_sep      char        := '; '
) return varchar2
as
    pattern     varchar2(64) := '([^[:space:]].*?)((\s*[,;]\s*)|($))';
    tokens      varchartab := varchartab ();
    s           varchar2(96);
    c           int := 0;
    l_res       varchar2(1024);
begin
    case
        when lower(trim(typ)) in ('phone', 'fax')
            then pattern := '([^[:space:]].*?)((' || '\s*(,|;|oder|o\.|or|und|and)\s*' || ')|($))';
        when lower(trim(typ)) in ('url', 'web', 'link')
            then pattern := '([^[:space:]].*?)((' || '[,;]' || ')|($))';
        when lower(trim(typ)) in ('email')
            then pattern := '([^[:space:]].*?)((' || '[[:space:],;/]' || ')|($))';
    end case;
    <<split>> loop c := c + 1;
        s := regexp_substr (str, pattern, 1, c, null, 1);
        exit split when s is null;
        if lower(typ) in ('phone', 'fax') then
            s := normalize_phone(s, 6, 17, '\s*\W?(,|;|oder|o.|und|or)\W?\s*');
        elsif lower(typ) in ('url', 'web', 'link') then
            s := normalize_url(s, '^(w{3,4}\.)');
        elsif lower(typ) in ('email') then
            s := normalize_email(s, '[:space:],;/');
        end if;
        tokens.extend;
        tokens(tokens.last) := s;
    end loop;
    -- pack parsed CSV values back to CSV form, using specified [target_sep]
    select listagg(column_value, target_sep) within group (order by rownum)
    into l_res
    from table(tokens);
    return l_res;
end;
/

问题:

SRP (Single Responsibility Principle)你能告诉我如何以原则和的方式分解这个函数DRY (Don't Repeat Yourself)吗?

也许您可以以某种方式编写通用 RegEx 并制作一个包装函数来调用必要的规范化函数?

регулярные-выражения
  • 3 3 个回答
  • 10 Views

3 个回答

  • Voted
  1. Best Answer
    0xdb
    2022-04-23T06:29:42Z2022-04-23T06:29:42Z

    实施建议。

    将所有函数中的正则表达式模式删除到单独的表中:

    create table normalizerconf (tabname, colname, ty, splitpatt, rmpatt) as
        select 'data1', 'col1', 'phone', '(.*?)((,\s+)|($))', '\D+' from dual  
    /
    

    然后一个工作查询将如下所示(在db<>fiddle上):

    create table data1 (id, col1) as
        select 1, '123/456789, 032-156789' from dual
    /
    
    with 
    function getphonelist (
        col varchar2, splitpatt varchar2, rmpatt varchar2) return varchar2 is
    begin
        return NormalizePhone (col, splitpatt, rmpatt).join();
    end;
    select d.col1, getphonelist (col1, splitpatt, rmpatt) result
    from data1 d
    cross join normalizerconf c
    where c.tabname = 'data1' and colname = 'col1'
    /
    
    COL1                   RESULT                
    ---------------------- ----------------------
    123/456789, 032-156789 123456789; 032156789  
    

    为每个规范创建一个基本自定义类型。数据类型其继承类型:

    create or replace type Normalizer as object (
        tokens tokenList,
        member procedure split (str varchar2, pattern char), -- return tokenList,
        not instantiable member procedure normalize (rmpatt char),
        member function join (delimiter char := '; ') return varchar2
    ) not instantiable not final
    /
    create or replace type NormalizePhone under Normalizer (
        constructor function NormalizePhone (
            str varchar2, splitpatt char, rmpatt char) return self as result,
        overriding member procedure normalize (rmpatt char) -- return tokenList,
    ) instantiable final
    /
    

    类型实现示例:

    create or replace type body Normalizer as
        member procedure split (str varchar2, pattern char) is
            s varchar2(96);
            c int := 0;
        begin
            <<split>> loop c := c + 1;
                s := regexp_substr (str, pattern, 1, c, null, 1);
                exit split when s is null;
                self.tokens.extend;
                self.tokens(tokens.last) := s;
            end loop;
        end split;
        member function join (delimiter char := '; ') return varchar2 is
            ret varchar2 (32767);
        begin
            for i in 1..self.tokens.count loop 
                ret := ret||tokens(i)||delimiter; end loop;
            return rtrim (ret, delimiter);
        end join;
    end;
    /
    
    create or replace type body NormalizePhone as
        constructor function NormalizePhone (
            str varchar2, splitpatt char, rmpatt char) return self as result is
        begin
            self.tokens := tokenList(); 
            self.split (str, splitpatt);
            normalize (rmpatt);
            return;
        end;
        overriding member procedure normalize (rmpatt char) is
        begin 
            for i in 1..self.tokens.count loop
                self.tokens(i) := regexp_replace (self.tokens(i), rmpatt);
            end loop;
        end;
    end;
    /
    
    • 5
  2. 0xdb
    2022-04-23T23:41:02Z2022-04-23T23:41:02Z

    这就是包中包含函数的解决方案的外观。

    来自此答案的数据的最终请求将如下所示:

    with function getphonelist (
        col varchar2, splitpatt varchar2, rmpatt varchar2) return varchar2 is
    begin
        return
            packNormalizer.join (
                packNormalizer.normalize (
                    packNormalizer.split (col, splitpatt), rmpatt));
    end;
    select d.col1, getphonelist (col1, splitpatt, rmpatt) result
    from data1 d
    cross join normalizerconf c
    where c.tabname = 'data1' and colname = 'col1'
    /
    
    COL1                   RESULT                          
    ---------------------- --------------------------------
    123/456789, 032-156789 123456789; 032156789            
    

    包实现(在db<>fiddle上):

    create or replace package packNormalizer as 
        function split (str varchar2, pattern char) return tokenList;
        function normalize (tl tokenList, rmpatt char) return tokenList;
        function join (tokens tokenList, delimiter char := '; ') return varchar2;
    end;
    /
    
    create or replace package body packNormalizer as
        function split (str varchar2, pattern char) return tokenList is
            tokens tokenList := tokenList ();
            s varchar2(96);
            c int := 0;
        begin
            <<split>> loop c := c + 1;
                s := regexp_substr (str, pattern, 1, c, null, 1);
                exit split when s is null;
                tokens.extend;
                tokens(tokens.last) := s;
            end loop;
            return tokens;
        end split;
        function normalize (tl tokenList, rmpatt char) return tokenList is
            tokens tokenList := tl;
        begin 
            for i in 1..tokens.count loop
                tokens(i) := regexp_replace (tokens(i), rmpatt);
            end loop;
            return tokens;
        end; 
        function join (tokens tokenList, delimiter char := '; ') return varchar2 is
            ret varchar2 (32767);
        begin
            for i in 1..tokens.count loop ret := ret||tokens(i)||delimiter; end loop;
            return rtrim (ret, delimiter);
        end join;
    end;
    /
    
    • 5
  3. MaxU - stop genocide of UA
    2022-04-22T21:57:30Z2022-04-22T21:57:30Z

    到目前为止,我已经完成了分解,取出了获取 RegEx 并将各种实体规范化为单独函数的逻辑。

    create or replace function normalize_csv_values (
        str             varchar2,
        typ             varchar2    := 'phone',
        target_sep      char        := '; '
    ) return varchar2
    as
        s           varchar2(96);
        n           int := 0;
        pattern     varchar2(64)    := '\s*[,;]\s*';
        tokens      varchartab      := varchartab ();
        l_re_pref   varchar2(64)    := '([^[:space:]].*?)((';
        l_re_suff   varchar2(64)    := ')|($))';
        l_type      varchar2(128)   := lower(trim(typ));
        l_res       varchar2(4096);
    begin
        -- normalizes CSV-like strings in the following way:
        --  1. in a loop split [str] by the separator that is returned from "get_csv_split_re_pattern()"
        --      1.1 get a single splitted value
        --      1.2 normalize it, using "normalize_value(val, l_type)"
        --      1.3 append normalized value to collection [tokens]
        --  2. pack values from [tokens] collection back to CSV string using [target_sep] separator
        --  3. return resulting CSV string
    
        -- get split RegEx for the [typ] entity type
        pattern := l_re_pref || get_csv_split_re_pattern(l_type) || l_re_suff;
    
        -- loop through values in the CSV string
        <<split>> loop n := n + 1;
            -- get N-th value from the CSV string
            s := regexp_substr (str, pattern, 1, n, null, 1);
            exit split when s is null;
            s := normalize_value(s, l_type);
            tokens.extend;
            tokens(tokens.last) := s;
        end loop;
    
        -- pack parsed CSV values back to CSV form, using specified [target_sep]
        select listagg(column_value, target_sep) within group (order by rownum)
        into l_res
        from table(tokens);
    
        return l_res;
    end;
    /
    
    • 3

相关问题

  • PHP 帮助编写正则表达式来解析 URL

  • 密码的正则表达式

  • 从正则表达式中排除特定数字

  • 在生成的表格列中将引号转换为人字形

  • 在 .htaccess 中使用正则表达式提示

  • RegExp - 找到一组相同类型的字符串,它们之间可以有任意字符串

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    表格填充不起作用

    • 2 个回答
  • Marko Smith

    提示 50/50,有两个,其中一个是正确的

    • 1 个回答
  • Marko Smith

    在 PyQt5 中停止进程

    • 1 个回答
  • Marko Smith

    我的脚本不工作

    • 1 个回答
  • Marko Smith

    在文本文件中写入和读取列表

    • 2 个回答
  • Marko Smith

    如何像屏幕截图中那样并排排列这些块?

    • 1 个回答
  • Marko Smith

    确定文本文件中每一行的字符数

    • 2 个回答
  • Marko Smith

    将接口对象传递给 JAVA 构造函数

    • 1 个回答
  • Marko Smith

    正确更新数据库中的数据

    • 1 个回答
  • Marko Smith

    Python解析不是css

    • 1 个回答
  • Martin Hope
    Alexandr_TT 2020年新年大赛! 2020-12-20 18:20:21 +0000 UTC
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5