我有一个这样的数据框:
from to incident check_string occurrences
npp
1 00001234567 01011234567 12345 0000123456701011234567 1
2 00001234567 01011234567 45678 0000123456701011234567 2
3 00001234567 01011234567 45678 0000123456701011234567 3
4 00001234567 01011234567 45678 0000123456701011234567 4
5 00001234568 01011234568 81289 0000123456801011234568 1
6 00001234568 01011234568 27811 0000123456801011234568 2
7 00001234568 01011234568 27811 0000123456801011234568 3
我需要incident
第一次出现的值occurrences == 1
分布在重复出现的字符串check_string
中,即 where occurrences > 1
。可能有也可能没有重复的行。我这样做:
def distribute(row):
return df2['incident'][df2['check_string'] == row['check_string']].item()
df2 = df[['check_string', 'incident']][df['occurrences'] == 1]
df['incident_src'] = df.apply(lambda row: distribute(row), axis=1)
结果如下:
from to incident check_string occurrences incident_src
npp
1 00001234567 01011234567 12345 0000123456701011234567 1 12345
2 00001234567 01011234567 45678 0000123456701011234567 2 12345
3 00001234567 01011234567 45678 0000123456701011234567 3 12345
4 00001234567 01011234567 45678 0000123456701011234567 4 12345
5 00001234568 01011234568 81289 0000123456801011234568 1 81289
6 00001234568 01011234568 27811 0000123456801011234568 2 81289
7 00001234568 01011234568 27811 0000123456801011234568 3 81289
这可以以更快的方式完成吗?