free 出于某种原因不会从内存中删除数组

Question

Грузчик

Asked:2023-02-23 03:55:50 +0800 CST2023-02-23 03:55:50 +0800 CST 2023-02-23 03:55:50 +0800 CST

在这种情况下，使用 UTF-8 需要实现的最少 C 函数集是什么？

772

根据您的经验，使用 UTF-8 编码需要实现的最少功能集是什么，以便可以在它们之上编写用于处理字符串的其他典型函数而无需转码字节？

1 个回答

Voted

avp · Answer 1 · 2023-02-23T06:56:49+08:00

除了 strchr/islower/toupper 和其他使用单个字符的函数外，所有字符串函数（嗯，也许几乎所有）都可以很好地使用 utf-8（如果 strtok/strspn 等中的分隔符是正常的 ascii（在典型任务中这几乎总是这样。

虽然可以使用 strstr 代替 strchr，但一些面向 utf-8 的函数仍然有用。

例如，这是这两个：

// returns length in bytes of utf-8 symbol
// for invalid symbols returns 1 too
// see https://ru.wikipedia.org/wiki/UTF-8
int
utf8_len (const char *s)
{
  static int step[32] = {[0 ...  15] = 1, // 0xxx xxxx  ascii
                         [16 ... 23] = 1, // 10xx xxxx  invalid 
                         [24 ... 27] = 2, // 110x xxxx  2-bytes utf-8
                         3, 3,            // 1110 xxxx  3 bytes utf-8
                         4,               // 1111 0xxx  4 bytes utf-8
                         1                // 1111 1xxx  invalid
  };

  return step[((unsigned char *)s)[0] >> 3];
}

// for valid utf-8 symbol put unicode to `*puc` and zero terminated utf8 bytes to `utf8_sym[]`
//
// returns length in bytes of valid utf-8 symbol 
// 0   if invalid first byte
// -N  if invalid N-th byte
int
utf8_symbol (const char *str, int *puc, char utf8_sym[])
{
  utf8_sym[0] = *str; utf8_sym[1] = 0;
  
  int s = utf8_len(str);
  static int mask[5] = {0, 0, 0x1f, 0xf, 0x7};
  int uc = 0;

  if (s == 1 && (*str & 0x80))
    return 0;

  uc = *str & mask[s];
  for (int i = 1; i < s; i++) {
    if ((str[i] & 0xc0) != 0x80)
      return -i;
    utf8_sym[i] = str[i]; utf8_sym[i + 1] = 0;
    uc = (uc << 6) | (str[i] & 0x3f);
  }

  if (puc)
    *puc = uc;
  return s;
}

基于它utf8_len()，很容易制作一个strlen()以 utf-8 字符返回字符串长度的模拟（很明显，utf8_len 的工作速度比 utf8_symbol 快一个数量级，utf8_symbol 做了很多额外的工作）。

可以检查返回的utf8_symbolunicode 是否包含在西里尔字符范围内（从 0x0400 到 0x04ff，俄语字母从 0x0410 到 0x044f，0x0401 (Ё) 和 0x0451 (ё)），并生成 is_rus_letter () 函数或类似的函数。

如果您专注于使用俄文字母，那么通常您可以进行以下转换：

// gets valid 2-bytes utf-8 symbol
// returns remap russian unicode to 0 ... 65 (А = 0 ... я = 63; Ё = 64,  ё = 65)
//         or unicode  for other (not russian) 2-bytes utf-8 
// call this function only if utf8_symbol() (well, may be utf8_len() too) returns 2 
// see https://symbl.cc/en/unicode/blocks/cyrillic/
int
get_rus_code (const char *s)
{
  int u_code = ((s[0] & 0x1f) << 6) | (s[1] & 0x3f); // unicode for 2-bytes utf-8

  // remap russian unicode to 0 ... 63, 
  if (u_code >= 0x410 && u_code <= 0x44f)
    return u_code - 0x410; // А ... я

  if (u_code == 0x401) 
    return 64; // Ё

  if (u_code == 0x451) 
    return 65; // ё

  return u_code; // for other valid 2-bytes utf-8 the result is from 0x80 to 0x7FF
}

使用此功能，可以轻松地将俄语字母分析为大写/小写（前 32 个为大写）、元音/辅音（位掩码属于uint64_t）。
（顺便说一句，在这里你可以看到字母Ёё在我们的语言中显然是多余的 - ））

好吧，作为补充（以及答案第一段的说明），这里有一个函数可以让您逐字读取 utf-8 中的字符串：

// returns position of word in `s[]`
//  or EOF if there are no words
// puts length of the word (in bytes) into `*wlen`
int
get_word (const char *s, int *wlen)
{
#define W_SEPARATORS    " \n\r\t"
  int pfx_l = strspn(s, W_SEPARATORS); // skip initial spaces

  if (!s[pfx_l])
    return EOF;

  *wlen = strcspn(s + pfx_l, W_SEPARATORS); // all non-spaces are taken as a word

  return pfx_l;
}

及其使用示例：

  ....
  char *cur_ptr = str;
  for (int word_start = 0, word_len = 0;
       (word_start = get_word(cur_ptr, &word_len)) != EOF;
       cur_ptr += (word_start + word_len)) {

    char word[word_len + 1];
    memcpy(word, cur_ptr + word_start , word_len); word[word_len] = 0;
    
    ....
  }

如果您有任何问题，请随时在评论中提问

在这种情况下，使用 UTF-8 需要实现的最少 C 函数集是什么？

我看不懂措辞

请求的模块“del”不提供名为“default”的导出

"!+tab" 在 HTML 的 vs 代码中不起作用

我正在尝试解决“猜词”的问题。Python

可以使用哪些命令将当前指针移动到指定的提交而不更改工作目录中的文件？

Python解析野莓

问题：“警告：检查最新版本的 pip 时出错。”

帮助编写一个用值填充变量的循环。解决这个问题

尽管依赖数组为空，但在渲染上调用了 2 次 useEffect

数据不通过 Telegram.WebApp.sendData 发送

在这种情况下，使用 UTF-8 需要实现的最少 C 函数集是什么？

1 个回答

相关问题