在 HTML 页面的两列中,您需要:
- 只选择绿色的文本。
- 如果文本不是绿色,而是黑色,则改为保留 NaN。
- 将所有这些值存储在 pandas 数据框中。
HTML 表格的示例(对不起,excel):
我想在最终数据框中得到的结果是:
编码:
from bs4 import BeautifulSoup
import pandas as pd
cons_df = pd.DataFrame()
data = []
with open("test.html", encoding='utf-8') as html:
soup = BeautifulSoup(html, "html.parser") #заранее взята страница из таблицами
table = soup.select('font[color="#00875a"]') #выбор по зеленому цвету (текст, который мне нужно скрепить находится только под этим тегом "font[color"
for i in range(0,len(table)):
rows = [table[i].get_text()]
data.append(rows)
df = pd.DataFrame(data, columns=['mix']) #датафрейм только с зелеными значениями
df['mix'] = df['mix'].str.strip()
#мне нужно было каким-то образом разделить стринги от дат и я решил выфильтровать их с помощью startswith:
val_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
str_val = df[df.mix.str.startswith(tuple(val_list))]
dates = df[~df.mix.str.startswith(tuple(val_list))]
str_val = str_val.reset_index(drop=True)
dates = dates.reset_index(drop=True)
cons_df = pd.concat([cons_df, str_val, dates], axis=1)
代码似乎工作正常,但我需要添加一个会留下 NaN 而不是黑色值的部分。使用此代码,我得到以下结果:
这是我在 Google 上找到的,但我无法为自己重写:
A function for BeautifulSoup in Python that returns the text of the first tag if it exists, or an empty string if not. Useful for web scraping where empty string NaNs are desired. This function is one I use a lot for scraping projects, but it is likely something you should modify for your own needs.
Parameters: soup-> the bs4 soup item, tag_class-> the class of the desired tag (optional), return_text-> should the function return the text of the item if possible or the item itself(?).
def get_text_if_exists(soup, tag, tag_class=None, return_text=True):
if tag_class:
item = soup.find(tag, {"class":tag_class})
else:
item = soup.find(tag)
if item and return_text:
return item.text
elif item:
return item
return ""
[更新]
测试.html:
屏幕上大约有50-60个这样的项目,但它们都有相同的结构。
html表格代码:
<div class="table-wrap">
<table class="confluenceTable"><tbody>
<tr>
<td class="confluenceTd"><b>1column</b></td>
<td class="confluenceTd"><b>2column</b></td>
<td class="confluenceTd"><b>3column</b></td>
<td class="confluenceTd"><b>4column</b></td>
</tr>
<tr>
<td class="confluenceTd">1A</td>
<td class="confluenceTd"> <font color="#00875a"><b>TEST1</b></font></td>
<td class="confluenceTd"> <font color="#00875a"><b>15-Jul-2022 6 PM CET</b></font></td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">2A</td>
<td class="confluenceTd"> TEST2</td>
<td class="confluenceTd">18 July 2022 1 PM CET</td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">3A</td>
<td class="confluenceTd"> <font color="#00875a"><b>TEST3</b></font></td>
<td class="confluenceTd">18 July 2022 1 PM CET</td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">4A</td>
<td class="confluenceTd"> <font color="#00875a"><b>TEST4</b></font></td>
<td class="confluenceTd"> <font color="#00875a"><b>15-Jul-2022 6 PM CET</b></font></td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">5A</td>
<td class="confluenceTd"> TEST5</td>
<td class="confluenceTd">18 July 2022 1 PM CET</td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">6A</td>
<td class="confluenceTd"> <font color="#00875a"><b>TEST6</b></font></td>
<td class="confluenceTd"> <font color="#00875a"><b>15-Jul-2022 6 PM CET</b></font></td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">7A</td>
<td class="confluenceTd"> TEST7</td>
<td class="confluenceTd">18 July 2022 1 PM CET</td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">8A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">9A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">10A</td>
<td class="confluenceTd"> <font color="#00875a"><b>TEST8</b></font></td>
<td class="confluenceTd">18 July 2022 1 PM CET</td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">11A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">12A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">16A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
<tr>
<td class="confluenceTd">17A</td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
<td class="confluenceTd"> </td>
</tr>
</tbody></table>
</div>