2020年新年大赛！

Question

Антон

Asked:2020-01-06 23:29:13 +0000 UTC2020-01-06 23:29:13 +0000 UTC 2020-01-06 23:29:13 +0000 UTC

sregex_iterator 在字符串上找不到匹配项

772

我有一个字符串 - 需要使用正则表达式解析的 html 代码。我需要将页面上的所有 URL 写入 std::vector href=""。我的常规 C++ 代码不起作用。

#include <regex>
#include <iostream>
#include <string>

using std::string;
using std::regex;
using std::cout;
using std::endl;
using std::sregex_iterator;
using std::smatch;

int main()
{
    string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>");

    try {
        regex re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"");
        sregex_iterator next(subject.begin(), subject.end(), re);
        sregex_iterator end;

        if (next == end)
            cout << "Oops" << endl;

        while (next != end) {
            smatch match = *next;
            cout << match.str() << endl;
            next++;
        }
    } catch (std::regex_error& e) {
        ; // Syntax error in the regular expression
    }

    return 0;
}

只有 Python'ovsky 有效。

#!/usr/bin/python3
import re

html = '<head><title>Search engines</title></head><body><a href="https://yandex.ru">Yandex</a><a href="https:/google.com"></a></body>'

title = re.findall(r'<title>(.*?)</title>', html)[0]
links = [ x[1] for x in re.findall(r'<a\s+(?:[^>]*?\s+)?href=(["\'])(.*?)\1', html)]

print (title)
print (links)

我想你可以花一周时间翻阅 Jeffrey Friedl 的正则表达式指南和 regex 库并得到你想要的结果，但 stackoverflow 并不是为了“阅读 Friedl，不要要求消化粥”之类的建议。此外，对于这样一个看似有用的问题，堆栈上没有答案可以让它发挥作用。

1 个回答

Voted

Wiktor Stribiżew · Answer 1 · 2020-01-07T00:36:20Z

Best Answer

Wiktor Stribiżew

2020-01-07T00:36:20Z2020-01-07T00:36:20Z

您可以使用 flag 修复代码std::regex_constants::icase，也可以使用sregex_token_iteratorc1作为第四个参数（以获取捕获子模式 #1 中的值）。在 Pythonre.findall中，如果在模式中指定了捕获子模式，则仅返回捕获的子字符串，而 C++ 中没有这种方法。

一个有效的 C++ 代码示例：

#include <iostream>
#include <string>
#include <vector>
#include <regex>
using namespace std;

int main() {
    regex re("<\\s*A\\s+(?:[^>]*?\\s+)?href\\s*=\\s*\"([^\"]*)\"", std::regex_constants::icase);
    string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>");
    vector<string> result(sregex_token_iterator(subject.begin(), subject.end(), re, 1),
                               sregex_token_iterator());

    for( auto & s : result ) cout << s << endl;
    return 0;
}
// => https://yandex.ru, https://google.com

1

sregex_iterator 在字符串上找不到匹配项

根据浏览器窗口的大小调整背景图案的大小

理解for循环的执行逻辑

复制动态数组时出错（C++）

Or and If,elif,else 构造[重复]

如何构建支持 x64 的 APK

如何使按钮的输入宽度？

如何显示对象变量的名称？

如何循环一个函数？

LOWORD 宏有什么作用？

从字符串的开头删除直到并包括一个字符

sregex_iterator 在字符串上找不到匹配项

1 个回答

相关问题