RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 1609740
Accepted
Vyacheslav
Vyacheslav
Asked:2025-04-02 14:14:51 +0000 UTC2025-04-02 14:14:51 +0000 UTC 2025-04-02 14:14:51 +0000 UTC

Word 文件无法在 Python 代码中读取

  • 772

为什么python代码中无法读取并打印word文件的内容:

import bs4
import time
import random
import requests
import docx
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import magic
import chardet
import codecs
from io import BytesIO
from docx import Document

from selenium import webdriver  # pip install selenium


# Список пользовательских агентов
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.4 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
]


# Функция для получения случайного пользовательского агента
def get_random_user_agent():
    return random.choice(user_agents)


# Настройка браузера
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f"user-agent={get_random_user_agent()}")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)

data = []
# Использование webdriver

session = HTMLSession()
response = session.get(
    "https://mos-gorsud.ru/mgs/search?caseDateFrom=16.02.2023&caseDateTo=28.02.2023&courtAlias=mgs&documentStatus=2&processType=6&formType=fullForm&page=2")
time.sleep(3)  # Дополнительная задержка на случай, если нужно, но избегайте чрезмерного использования sleep

soup = BeautifulSoup(response.text, 'html.parser')
heads = soup.find('table', class_='custom_table').find_all('tr')
print(len(heads))
for head in heads[1:]:
    link = 'https://mos-gorsud.ru' + head.find('nobr').find('a')['href']
    print(link)
    loom = session.get(link)
    abble = BeautifulSoup(loom.text, 'html.parser')
    
    documents = abble.find('table', {'class': 'custom_table mainTable'}).find('tbody').find_all('tr')
    for document in documents:
        if "Приговор" in document.text:
            score = document.find_all('td')

            print(len(score))
            for soc in score:
                stock = soc.find_all('a')
                for sto in stock:
                    print('Prigovor: ' + 'https://mos-gorsud.ru' + sto['href'])
                    link_doc = 'https://mos-gorsud.ru' + sto['href']
                    response = requests.get(link_doc, get_random_user_agent())

                    # Проверка успешности запроса
                    if response.status_code == 200:
                        # Сохранение файла на диск
                        with open('prigovor.docx', 'wb') as file:
                            file.write(response.content)

                        # Открытие Word-документа и извлечение текста
                        document = Document('prigovor.docx')
                        text = '\n'.join([paragraph.text for paragraph in document.paragraphs])
                        resheniye = ' '.join(text.split())

                        # Вывод ссылки и текста
                        print('Ссылка на файл: https://mos-gorsud.ru' + sto['href'])
                        print(resheniye)
                    else:
                        print(f"Error downloading file: {response.status_code}")

        elif "Постановление суда апелляционной инстанции" in document.text:
            score = document.find_all('td')

            print(len(score))
            for soc in score:
                stock = soc.find_all('a')
                for sto in stock:
                    print('Postanovleniye : ' + 'https://mos-gorsud.ru' + sto['href'])
                    link_pod = 'https://mos-gorsud.ru' + sto['href']
                    response = requests.get(link_pod, get_random_user_agent())

                    # Проверка успешности запроса
                    if response.status_code == 200:
                        # Сохранение файла на диск
                        with open('resheniye.docx', 'wb') as file:
                            file.write(response.content)

                        # Открытие Word-документа и извлечение текста
                        document = Document('resheniye.docx')
                        text = '\n'.join([paragraph.text for paragraph in document.paragraphs])
                        postanov = ' '.join(text.split())

                        # Вывод ссылки и текста
                        print('Ссылка на файл: https://mos-gorsud.ru' + sto['href'])
                        print(postanov)
                    else:
                        print(f"Error downloading file: {response.status_code}")

            print('\n')

写道:

Traceback (most recent call last):
  File "C:\Users\user\PycharmProjects\cases_pars\little.py", line 143, in <module>
    document = Document('resheniye.docx')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\PycharmProjects\cases_pars\.venv\Lib\site-packages\docx\api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
                                         ^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\PycharmProjects\cases_pars\.venv\Lib\site-packages\docx\opc\package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\PycharmProjects\cases_pars\.venv\Lib\site-packages\docx\opc\pkgreader.py", line 22, in from_file
    phys_reader = PhysPkgReader(pkg_file)
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\PycharmProjects\cases_pars\.venv\Lib\site-packages\docx\opc\phys_pkg.py", line 21, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
docx.opc.exceptions.PackageNotFoundError: Package not found at 'resheniye.docx'
python
  • 2 2 个回答
  • 47 Views

2 个回答

  • Voted
  1. Best Answer
    Eugene Evstafev
    2025-04-02T18:32:22Z2025-04-02T18:32:22Z

    显然,错误发生的原因是

    document = Document('resheniye.docx')
    

    无法读取文件。这可能是因为无法正确下载和/或保存文件。

    要解决该问题,您可以尝试以下操作:

    1. 在请求中使用正确的标头分配。您不应该将用户代理作为位置参数传递,而应该通过 传递它headers。例如,替换

      requests.get(link_pod, get_random_user_agent())
      

      在

      requests.get(link_doc, headers={"User-Agent": get_random_user_agent()})
      
    2. 要将文件从内存直接传输到Document,您可以尝试:

      import requests
      from io import BytesIO
      from docx import Document
      
      link_doc = "https://calibre-ebook.com/downloads/demos/demo.docx"
      
      response = requests.get(link_doc) # , headers={"User-Agent": get_random_user_agent()}
      
      if response.status_code == 200:
         document = Document(BytesIO(response.content))
         for para in document.paragraphs:
            print(para.text)
      
    3. 添加 try-except 块来跟踪错误。这将允许您捕获异常并获取有关问题的更多详细信息:

      try:
          document = Document('resheniye.docx')
      except Exception as e:
          print(f"Ошибка при открытии файла: {e}")
      

    PS:另外,我建议改变代码结构以避免重复。例如,您可以将条件检查移到循环if "Приговор" in document.text内部,以避免重复代码。elif "Постановление суда апелляционной инстанции" in document.textfor soc in score:

    • 2
  2. Vyacheslav
    2025-04-02T20:21:27Z2025-04-02T20:21:27Z

    解决方案:

    import bs4
    import time
    import random
    import requests
    import docx
    from requests_html import HTMLSession
    from win32com.client import Dispatch
    import codecs
    from io import BytesIO
    from docx import Document
    from selenium import webdriver  # pip install selenium
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager  # pip install webdriver-manager
    
    # Список пользовательских агентов
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.4 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
    ]
    
    
    # Функция для получения случайного пользовательского агента
    def get_random_user_agent():
        return random.choice(user_agents)
    
    
    # Настройка браузера
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f"user-agent={get_random_user_agent()}")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option("useAutomationExtension", False)
    
    data = []
    # Использование webdriver
    with webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options) as driver:
        driver.maximize_window()
        driver.get(
            "https://mos-gorsud.ru/mgs/search?caseDateFrom=16.02.2023&caseDateTo=28.02.2023&courtAlias=mgs&documentStatus=2&processType=6&formType=fullForm&page=2")
    
        time.sleep(3)  # Дополнительная задержка на случай, если нужно, но избегайте чрезмерного использования sleep
    
        soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
        heads = soup.find('table', class_='custom_table').find_all('tr')
        print(len(heads))
        for head in heads[1:]:
            link = 'https://mos-gorsud.ru' + head.find('nobr').find('a')['href']
            # print(link)
            driver.get(link)
    
            time.sleep(3)  # Дополнительная задержка на случай, если нужно, но избегайте чрезмерного использования sleep
    
            soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
            documents = soup.find('table', {'class': 'custom_table mainTable'}).find('tbody').find_all('tr')
            for document in documents:
                if "Приговор" in document.text:
                    score = document.find_all('td')
    
                    print(len(score))
                    for soc in score:
                        stock = soc.find_all('a')
                        for sto in stock:
                            # print('Prigovor : ' + 'https://mos-gorsud.ru' + sto['href'])
                            link_doc = 'https://mos-gorsud.ru' + sto['href']
                            session = HTMLSession()
                            response = session.get(
                                link_doc)
    
                            with open("prigovor.doc", "wb") as file:
                                file.write(response.content)
    
    
                                def convert_prigovor_to_docx(input_file, output_file):
                                    """
                                    Конвертирует файл .doc в .docx формат.
    
                                    Аргументы:
                                    input_file -- путь к входному файлу .doc
                                    output_file -- путь к выходному файлу .docx
                                    """
                                    word = Dispatch("Word.Application")
                                    word.Visible = False
    
                                    try:
                                        doc = word.Documents.Open(input_file)
                                        doc.SaveAs2(output_file, FileFormat=16)  # 16 - формат .docx
                                        doc.Close()
                                    except Exception as e:
                                        print(f"Ошибка при конвертации файла: {e}")
                                    finally:
                                        word.Quit()
    
                                    # Пример использования
    
    
                                input_file = r"C:\Users\user\PycharmProjects\cases_pars\\prigovor.doc"
                                output_file = r"C:\Users\user\PycharmProjects\cases_pars\\prigovor.docx"
                                convert_prigovor_to_docx(input_file, output_file)
    
                                # Открытие Word-документа и извлечение текста
                                document = Document("C:\\Users\\user\PycharmProjects\cases_pars\\prigovor.docx")
                                text = '\n'.join([paragraph.text for paragraph in document.paragraphs])
                                prigovor = ' '.join(text.split())
    
                                # Вывод ссылки и текста
                                print('Ссылка на файл: https://mos-gorsud.ru' + sto['href'])
                                print(prigovor)
                        else:
                            print(f"Error downloading file: {response.status_code}")
    
                elif "Постановление суда апелляционной инстанции" in document.text:
                    score = document.find_all('td')
    
                    print(len(score))
                    for soc in score:
                        stock = soc.find_all('a')
                        for sto in stock:
                            # print('Postanovleniye : ' + 'https://mos-gorsud.ru' + sto['href'])
                            link_pod = 'https://mos-gorsud.ru' + sto['href']
                            session = HTMLSession()
                            response = session.get(
                                link_pod)
    
                            with open("resheniye.doc", "wb") as file:
                                file.write(response.content)
    
    
                            def convert_doc_to_docx(input_file, output_file):
                                """
                                Конвертирует файл .doc в .docx формат.
    
                                Аргументы:
                                input_file -- путь к входному файлу .doc
                                output_file -- путь к выходному файлу .docx
                                """
                                word = Dispatch("Word.Application")
                                word.Visible = False
    
                                try:
                                    doc = word.Documents.Open(input_file)
                                    doc.SaveAs2(output_file, FileFormat=16)  # 16 - формат .docx
                                    doc.Close()
                                except Exception as e:
                                    print(f"Ошибка при конвертации файла: {e}")
                                finally:
                                    word.Quit()
    
                                # Пример использования
    
    
                            input_file = r"C:\Users\user\PycharmProjects\cases_pars\\resheniye.doc"
                            output_file = r"C:\Users\user\PycharmProjects\cases_pars\\resheniye.docx"
                            convert_doc_to_docx(input_file, output_file)
    
                            # Открытие Word-документа и извлечение текста
                            document = Document("C:\\Users\\user\PycharmProjects\cases_pars\\resheniye.docx")
                            text = '\n'.join([paragraph.text for paragraph in document.paragraphs])
                            resheniye = ' '.join(text.split())
    
                            # Вывод ссылки и текста
                            print('Ссылка на файл: https://mos-gorsud.ru' + sto['href'])
                            print(resheniye)
                    else:
                        print(f"Error downloading file: {response.status_code}")
    
                print('\n')
    
    • 0

相关问题

  • 是否可以以某种方式自定义 QTabWidget?

  • telebot.anihelper.ApiException 错误

  • Python。检查一个数字是否是 3 的幂。输出 无

  • 解析多个响应

  • 交换两个数组的元素,以便它们的新内容也反转

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    我看不懂措辞

    • 1 个回答
  • Marko Smith

    请求的模块“del”不提供名为“default”的导出

    • 3 个回答
  • Marko Smith

    "!+tab" 在 HTML 的 vs 代码中不起作用

    • 5 个回答
  • Marko Smith

    我正在尝试解决“猜词”的问题。Python

    • 2 个回答
  • Marko Smith

    可以使用哪些命令将当前指针移动到指定的提交而不更改工作目录中的文件?

    • 1 个回答
  • Marko Smith

    Python解析野莓

    • 1 个回答
  • Marko Smith

    问题:“警告:检查最新版本的 pip 时出错。”

    • 2 个回答
  • Marko Smith

    帮助编写一个用值填充变量的循环。解决这个问题

    • 2 个回答
  • Marko Smith

    尽管依赖数组为空,但在渲染上调用了 2 次 useEffect

    • 2 个回答
  • Marko Smith

    数据不通过 Telegram.WebApp.sendData 发送

    • 1 个回答
  • Martin Hope
    Alexandr_TT 2020年新年大赛! 2020-12-20 18:20:21 +0000 UTC
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5