RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 966726
Accepted
danilshik
danilshik
Asked:2020-04-08 01:30:53 +0000 UTC2020-04-08 01:30:53 +0000 UTC 2020-04-08 01:30:53 +0000 UTC

Python Selenium Webdriver网站分页问题

  • 772

大家好,您需要通过网站上的分页。

问题出现在第30页某处,加载货物的动画无限期挂起,因此无法执行任何操作。

此问题仅发生在 Selenium 中。

如果我自己浏览浏览器,一切都会好起来的。什么可能是这样的问题?

from selenium import webdriver
from selenium.common.exceptions import *
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time




def get_index_develop():
    global index_develop
    index_develop += 1
    return index_develop

def get_translated_text(text):
    return text

def init_driver():
    ff = "../install/geckodriver.exe"
    # chrome_option = webdriver.ChromeOptions()
    # # chrome_option.add_argument("headless")
    # prefs = {"profile.managed_default_content_settings.images": 2}
    # chrome_option.add_experimental_option("prefs", prefs)


    try:
        driver = webdriver.Firefox(executable_path=ff)
        # driver = webdriver.Chrome(executable_path=ff, options=chrome_option)
        # driver = webdriver.Chrome(executable_path=ff, chrome_options=chrome_option, service_args=service_args)
    except SessionNotCreatedException:
        print("Ошибка инициализации браузера. Скорее всего у вас не установлен браузер. Пожалуйста обратитесь к разработчику парсера")

    return driver



def close_pop_up_window(driver):
    blocks = driver.find_elements_by_css_selector(
        "div.b-popup.js-popup >div > div.b-popup__header.js-popup__header > div")
    for block in blocks:
        try:
            block.click()
            break
        except:
            continue
    time.sleep(1)


def parse_list_projects(driver):
    urls = []
    driver.get("https://www.hurriyetemlak.com/projeler/projects")




    # Блок пагинации
    while True:
        close_pop_up_window(driver)
        refresher = WebDriverWait(driver, 300).until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.b-scroll.js-search-left-content.js-preload-parent.b-preload-block.load")))

        items = driver.find_elements_by_css_selector("div.b-snippet__wrapper.js-complex__wrapper")
        for item in items:
            href = item.get_attribute("data-href")
            print("Найдена ссылка на проект", href)
            urls.append(href)

        try:
            pagination_block = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
                (By.CSS_SELECTOR, "span.b-pagination__item.b-pagination__item--next.js-pagination-next")))
            pagination_block.click()
            print("Перешли на следующую страницу")
        except Exception as e:
            try:
                print("Проверка наличия всплываюшего окна")
                button_close = driver.find_element_by_css_selector(
                    "button.b-button.b-button--full.b-button--confirm")
                time.sleep(2)
                button_close.click()
                time.sleep(2)
                print("Окно закрыли")
                pagination_block = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
                    (By.CSS_SELECTOR, "span.b-pagination__item.b-pagination__item--next.js-pagination-next")))
                pagination_block.click()
                print("Нажатие на pagination снова")
            except Exception as e:
                try:
                    close_pop_up_window(driver)
                    pagination_block = WebDriverWait(driver, 15).until(EC.visibility_of_element_located(
                        (By.CSS_SELECTOR, "span.b-pagination__item.b-pagination__item--next.js-pagination-next")))
                    pagination_block.click()

                except:
                    print("Pagination не найдены. Конец перехода между страницами", e)
                    break

    return urls, driver



if __name__ == '__main__':
    start = time.time()
    driver = init_driver()
    urls, driver = parse_list_projects(driver)
    print("Парсинг проектов")

    print("Парсинг окончен. Время выполнения", time.time() - start)

refresher 只是在数据加载期间指示的元素

python
  • 1 1 个回答
  • 10 Views

1 个回答

  • Voted
  1. Best Answer
    Sergey Nudnov
    2020-04-08T11:46:42Z2020-04-08T11:46:42Z

    好问题,有趣!由于站点具有针对过于频繁的请求的保护,因此加载挂起。在第三十个请求中,它给出了代码429 Too Many Requests。

    在查看了网站在浏览器中打开之前立即给出的内容后,我决定为了保护 CloudFlare,每个新会话都单独计算。事情就这样发生了。

    解决方案是每隔 20 页关闭浏览器,然后再次打开。

    我还稍微清理了您关于翻译页面和关闭弹出对话框的逻辑。

    这是发生的事情:

    import sys
    sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='cp1251', buffering=1)
    
    from selenium import webdriver
    from selenium.common.exceptions import *
    from selenium.webdriver.support.select import Select
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver import ActionChains
    import time
    
    
    def get_index_develop():
        global index_develop
        index_develop += 1
        return index_develop
    
    def get_translated_text(text):
        return text
    
    def init_driver():
        ff = "../install/geckodriver.exe"
        chrome_driver = 'C:/Tools/ChromeDriver/chromedriver.exe' 
        chrome_options = webdriver.ChromeOptions()
        # # chrome_option.add_argument("headless")
        # prefs = {"profile.managed_default_content_settings.images": 2}
        # chrome_option.add_experimental_option("prefs", prefs)
    
    
        try:
            # driver = webdriver.Firefox(executable_path=ff)
            driver = webdriver.Chrome(executable_path=chrome_driver, options=chrome_options)
            # driver = webdriver.Chrome(executable_path=ff, chrome_options=chrome_option, service_args=service_args)
        except SessionNotCreatedException:
            print("Ошибка инициализации браузера. Скорее всего у вас не установлен браузер. Пожалуйста обратитесь к разработчику парсера")
    
        return driver
    
    
    
    def close_pop_up_window(driver):
        blocks = driver.find_elements_by_css_selector(
            "div.b-popup.js-popup >div > div.b-popup__header.js-popup__header > div")
        for block in blocks:
            try:
                block.click()
                print("Окно закрыли")
                time.sleep(1)
                break
            except:
                continue
    
    
    def goto_page(driver, pagenum, url=None, attempts=10):
      if url:
        driver.get("{}/page{}".format(url,pagenum))
        return True
    
      try:
        driver.find_element_by_xpath("//a[@data-page='{}']".format(pagenum))
      except:
        return False
    
      count = 0
      while count < attempts:
        try:
          close_pop_up_window(driver)
          WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
              (By.XPATH, "//a[@data-page='{}']".format(pagenum))
          )).click()
          return True
        except:
          count += 1
          print("Попытка {}".format(count))
          time.sleep(0.1)
    
      return False
    
    
    def parse_list_projects(url):
        driver = None
        pages_per_driver = 20
        page = 1
        urls = []
    
        # Блок пагинации
        while True:
            print("Cтраницa {}".format(page))
            if not driver: 
                driver = init_driver()
                goto_page(driver, page, url=url)
    
                cookies_accept = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
                    (By.CSS_SELECTOR, ".b-button.b-button--full.b-button--confirm.b-cookies-notification__accept.js-cookies-notification__accept")))
                driver.execute_script('arguments[0].scrollIntoView(true);', cookies_accept)
                cookies_accept.click()
            else: 
                close_pop_up_window(driver)
                footer = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
                    (By.CSS_SELECTOR, ".b-footer__divider")))
                driver.execute_script('arguments[0].scrollIntoView(true);', footer)
                if not goto_page(driver, page):
                    break
    
            close_pop_up_window(driver)
            refresher = WebDriverWait(driver, 300).until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.b-scroll.js-search-left-content.js-preload-parent.b-preload-block.load")))
    
            items = driver.find_elements_by_css_selector("div.b-snippet__wrapper.js-complex__wrapper")
            for item in items:
                href = item.get_attribute("data-href")
                print("Найдена ссылка на проект", href)
                urls.append(href)
    
            page += 1
            if not page % pages_per_driver:
                print("Перезапускаем браузер")
                driver.quit()
                driver = None
    
        driver.quit()
        return urls, driver
    
    
    
    if __name__ == '__main__':
        start = time.time()
        print("Парсинг проектов")
        urls, driver = parse_list_projects("https://www.hurriyetemlak.com/projeler/projects")
        print("Парсинг окончен. Время выполнения", time.time() - start)
    

    相反,它next_page被使用goto_page,它在两种模式下工作:

    • 打开浏览器后立即转换到所需页面 - 通过 URL
    • 通过在分页器中单击其编号转换到所需页面 - 它执行得更快,因此值得保留

    另外,当我打开浏览器时,我接受了 cookie 请求。在加载每张纸时 - 我向下滚动它。也许这不是必需的 - 您可以自己进一步调试。

    另外,我更喜欢 Google Chrome,但我相信它也可以在 FireFox 下工作。

    在写答案时 - 脚本被打到最后。总时间 - 1094 秒

    PS 前两行代码用于我的控制台上的西里尔文输出......

    • 3

相关问题

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    根据浏览器窗口的大小调整背景图案的大小

    • 2 个回答
  • Marko Smith

    理解for循环的执行逻辑

    • 1 个回答
  • Marko Smith

    复制动态数组时出错(C++)

    • 1 个回答
  • Marko Smith

    Or and If,elif,else 构造[重复]

    • 1 个回答
  • Marko Smith

    如何构建支持 x64 的 APK

    • 1 个回答
  • Marko Smith

    如何使按钮的输入宽度?

    • 2 个回答
  • Marko Smith

    如何显示对象变量的名称?

    • 3 个回答
  • Marko Smith

    如何循环一个函数?

    • 1 个回答
  • Marko Smith

    LOWORD 宏有什么作用?

    • 2 个回答
  • Marko Smith

    从字符串的开头删除直到并包括一个字符

    • 2 个回答
  • Martin Hope
    Alexandr_TT 2020年新年大赛! 2020-12-20 18:20:21 +0000 UTC
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5