数据帧由向量“cleanUrl”和“code_url”组成,其中“cleanUrl”是一个引用,“code_url”是一个转换为数字的引用,使用:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
文件示例:
cleanUrl,code_url
amerikan-gruzovik.ru,4590
tinatube.net,74861
sextelevizor.net,66791
ru.anysex.com,62743
www.asiamobil.ru,86865
www.chinamobil.ru,90045
ad-k.ru,2637
www.nik-store.ru,105112
video-seks.net,80108
russkoe-porno.info,63946
www.foxporns.com,94819
www.chrono24.com.ru,90117
www.wibes.ru,118283
german242.com,26297
santdom.ru,65100
treningchess.com,76231
razvedem.web-3.ru,60517
aktis-stroy.ru,3525
www.aktis-stroy.ru,85600
plot.name,56170
www.lichnycabinet.ru,100979
www.worldfishing.narod.ru,118532
sekretka.su,66123
www.a-centre.ru,85011
www.suzukirus.ru,113986
pornogl.com,57123
wmid234ru.ru,83678
hsi.ru,29794
infometer.ru,31244
www.git77.rostrud.ru,95784
www.packagetrackr.com,106632
www.tns-global.ru,115139
www.vipgroup.net,117281
www.toysrus.com,115433
moskva.wisell.ru,46046
www.shopjustice.com,111904
deti75.ru,16625
crimeacity.info,15195
baza.crimea.ua,8838
atelica-oazis.bron.me,6647
gokurort.ru,26990
mitula17.imhonet.ru,44811
foxbrest.imhonet.ru,24645
xavi.imhonet.ru,120090
ural.kp.ru,78539
spb.kp.ru,69996
pinkmarie.com,55650
geneva2015.cars.ru,26188
domodedovo.rujazi.com,18057
xn------5cdjccgu2avckptly3ad8p.xn--e1arcbfn.xn--p1ai,120241
baikalpress.ru,8328
klimovsk.mnogonado.net,35750
svet-modern.ru,72656
www.forex-kf.ru,94627
www.uniq-ip.com,116401
www.terrawoman.ua,114714
www.gorsovet.mk.ua,96192
vmr.gov.ua,81250
helpstu.su,28874
www.helpstu.su,96823
zab-nanny.ru,122892
kursak-diplom.com.ua,37838
kgu-journalist.ucoz.ru,34771
mospf.ru,46093
newdiplom.ucoz.ru,49231
www.autoezda.com,87258
referats.nashisrael.ru,60990
www.hotdiplom.ru,97129
fotorakom.com,24577
redirect.disqus.com,60900
www.sq.com.ua,113207
member.newsnet.in.ua,43580
bankomet.com.ua,8537
po4emu.ru,56252
www.po4emu.ru,107650
tric.info,76258
myotpusk.com,47714
yspehx.narod.ru,122777
vozhatiki.ru,81885
kirent.narod.ru,35483
www.festivalsearcher.com,94080
hotasianz.com.6716069.yupiromo.ru,29549
starblag.ucoz.ua,70955
www.medalbum.ru,102495
ab28ru.narod.ru,2336
diel.ks.ua,16931
aniplay.tv,5091
ugolzreniya.narod.ru,77854
vrn.vestipk.ru,81990
afg-hist.ucoz.ru,3023
www.shanson-plus.ru,111700
www.vsmolenske.ru,117854
vsetutonline.com,82254
stomatologmova.ucoz.ua,71506
xn----8sbgjprccxgonf4d1dya7b.xn--p1ai,120742
yarcube.ru,122335
www.pion.com.ru,107364
76yar.ru,1961
loveplanet-online.ru,40510
我们需要返回与数据帧格式中的“cleanUrl”条目匹配的“code_url”条目。
该文件的完整版本包含 130,000 条记录。我尝试了一个嵌套循环,但是这个过程很长一段时间都起来了。:
d=[]
for a in range(len(df_label_url)):
for b in range(len(df_label_url)):
if df_label_url['code_url'][a]==df_label_url['cleanUrl'][b]:
d.append(df_label_url['code_url'][a])
大概只有数据框格式:
[4590, 4590, 4590, 4590, 4590, 4590, 4590, 4590, 4590, 4590, 74861, 74861, 74861, 74861, 74861, 74861, 74861, 74861, 74861, 74861, 66791, 66791, 66791, 66791, 66791, 66791, 66791, 66791, 66791, 66791, 62743, 62743, 62743, 62743, 62743, 62743, 62743, 62743, 62743, 62743, 86865, 86865, 86865, 86865, 86865, 86865, 86865, 86865, 86865, 86865, 90045, 90045, 90045, 90045, 90045, 90045, 90045, 90045, 90045, 90045, 2637, 2637, 2637, 2637, 2637, 2637, 2637, 2637, 2637, 2637, 105112, 105112, 105112, 105112, 105112, 105112, 105112, 105112, 105112, 105112, 80108, 80108, 80108, 80108, 80108, 80108, 80108, 80108, 80108, 80108, 63946, 63946, 63946, 63946, 63946, 63946, 63946, 63946, 63946, 63946]
要查找出现两次或更多次的行,可以使用DataFrame.duplicated():
原始数据框:
检查重复项:
解决方案:
PS您可以搜索自定义列的重复项 - 为此您需要使用参数
subset: