输入是一个大小为 5 万行的小 txt 文件
data_dir = 'data'
filename = 'gender_age_dataset.txt'
file_path = '/'.join([data_dir,filename])
df = pd.read_csv(file_path, sep='\t')
该文件有一列包含 json 字符串 user_json ,由 2 个属性组成:url 和时间戳。
任务是在将json字符串反序列化到整个数据集之前处理2个json属性(对于每个json字符串,数据集中的一个字符串将被复制多少个json字符串中的记录)
结果,将有 500 万行,在弱机器上进行后处理将需要数小时。
代码反序列化 json 字符串,并在其上复制 json 的主字符串:
data = df.drop('user_json',1).join(df.user_json.apply(lambda x: json.loads(x))).to_dict('records')
res = json_normalize(data, [['user_json','visits']], ['uid','gender','age'])
文件中的 2 行:
gender age uid user_json
F 18-24 d50192e5-c44e-4ae8-ae7a-7cfe67c8b777 {"visits": [{"url": "http://zebra-zoya.ru/200028-chehol-organayzer-dlja-macbook-11-grid-it.html?utm_campaign=397720794&utm_content=397729344&utm_medium=cpc&utm_source=begun", "timestamp": 1419688144068}, {"url": "http://news.yandex.ru/yandsearch?cl4url=chezasite.com/htc/htc-one-m9-delay-86327.html&lr=213&rpt=story", "timestamp": 1426666298001}, {"url": "http://www.sotovik.ru/news/240283-htc-one-m9-zaderzhivaetsja.html", "timestamp": 1426666298000}, {"url": "http://news.yandex.ru/yandsearch?cl4url=chezasite.com/htc/htc-one-m9-delay-86327.html&lr=213&rpt=story", "timestamp": 1426661722001}, {"url": "http://www.sotovik.ru/news/240283-htc-one-m9-zaderzhivaetsja.html", "timestamp": 1426661722000}]}
M 25-34 d502331d-621e-4721-ada2-5d30b2c3801f {"visits": [{"url": "http://sweetrading.ru/?p=900", "timestamp": 1419717886224}, {"url": "http://sweetrading.ru/?p=884", "timestamp": 1419717884437}, {"url": "http://sweetrading.ru/?p=1002", "timestamp": 1419717816375}, {"url": "http://101.ru/?an=port_channel_mp3", "timestamp": 1419717804934}, {"url": "http://sweetrading.ru/?cat=62", "timestamp": 1419714194423}, {"url": "http://sweetrading.ru/?p=1046", "timestamp": 1419713998481}, {"url": "http://sweetrading.ru/?p=978", "timestamp": 1419713927085}, {"url": "http://sweetrading.ru/?cat=171", "timestamp": 1419713908863}, {"url": "http://sweetrading.ru/?cat=62", "timestamp": 1419713908679}, {"url": "http://sweetrading.ru/?p=3648", "timestamp": 1419713798879}, {"url": "http://oesex.ru/955457", "timestamp": 1419595564407}, {"url": "http://www.interfax.ru/russia/408800", "timestamp": 1419542965224}, {"url": "http://101.ru/?an=port_channel_mp3&channel=30", "timestamp": 1418818241900}, {"url": "http://www.interfax.ru/russia/413508", "timestamp": 1418802080857}, {"url": "http://www.euroavtoprokat.ru/sitemap/car-rental/france.htm", "timestamp": 1418722961181}, {"url": "http://www.euroavtoprokat.ru/sitemap/car-rental.htm", "timestamp": 1418722945825}, {"url": "http://www.euroavtoprokat.ru/car-rental/germany.htm", "timestamp": 1418722937847}, {"url": "http://www.euroavtoprokat.ru/car-rental/germany.htm", "timestamp": 1418722923196}, {"url": "http://www.euroavtoprokat.ru/sitemap/car-rental.htm", "timestamp": 1418722909804}, {"url": "http://www.eavtoprokat.ru/prokat-avto/france", "timestamp": 1418646101953}, {"url": "http://www.wordparts.ru/numeral/", "timestamp": 1418592793587}, {"url": "http://rsdn.ru/forum/alg/3305190.flat", "timestamp": 1418591162814}, {"url": "http://www.euroavtoprokat.ru/car-rental/turkey/istanbul.htm", "timestamp": 1418571531780}, {"url": "http://citieslist.ru/", "timestamp": 1418488992092}, {"url": "http://www.euroavtoprokat.ru/car-rental/turkey/istanbul.htm", "timestamp": 1418480798674}, {"url": "http://rutv.ru/brand/show/episode/453757", "timestamp": 1418253037406}, {"url": "http://www.fodors.com/community/europe/best-car-rental-company-in-italy.cfm", "timestamp": 1418247198586}, {"url": "http://wheelsabroad.com/car-rental/united-kingdom/england/london?gclid=cjwkeaia-5-kbrdylpg5096r8masjabqedm4cmiichc-_-ewkbtsqyci5bu9ucwvjmxp4o0tficaarocljdw_wcb", "timestamp": 1418245144696}, {"url": "http://lestinet.com/site/stopagent.ru", "timestamp": 1418243376170}, {"url": "http://android-help.ru/q2a/16774/\u043a\u0430\u043a-\u043f\u043e\u043b\u0443\u0447\u0438\u0442\u044c-root-\u043f\u0440\u0430\u0432\u0430-\u043d\u0430-philips-w832-android-4-0-4", "timestamp": 1418169606439}, {"url": "http://club.dns-shop.ru/rabinovich/blog/\u044f-\u0432\u0441\u0435-\u0435\u0449\u0435-\u0434\u0435\u0440\u0436\u0443\u0441\u044c-\u043e\u0431\u0437\u043e\u0440-\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u0430-philips-xenium-w832/", "timestamp": 1418169602505}, {"url": "http://www.supportforum.philips.com/ru/showthread.php?1529-philips-xenium-w832/page6", "timestamp": 1418167859617}, {"url": "http://www.supportforum.philips.com/ru/showthread.php?842-\u043d\u0435-\u0440\u0430\u0431\u043e\u0442\u0430\u0435\u0442-gps-\u0432-\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u0435-philips-xenium-w832", "timestamp": 1418166430112}, {"url": "http://rabota.ua/info/jobsearcher/post/umora.aspx", "timestamp": 1418114698621}, {"url": "http://www.enter.ru/product/appliances/myasorubka-philips-hr2728-2020103007131", "timestamp": 1418053557067}, {"url": "http://www.ferra.ru/ru/byt/news/2013/12/02/polaris-pmg-1805/", "timestamp": 1417866883735}, {"url": "http://www.ferra.ru/ru/byt/news/2013/10/12/bosch-mfw6-propower/", "timestamp": 1417862586856}, {"url": "http://www.linotype.com/1266/neuehelvetica-family.html", "timestamp": 1417856979616}, {"url": "http://www.linotype.com/1546/tradegothic-family.html?site=webfonts", "timestamp": 1417812010753}, {"url": "http://www.vandelaydesign.com/best-ecommerce-website-designs/", "timestamp": 1417807232287}, {"url": "http://www.awwwards.com/20-of-the-very-best-e-commerce-web-sites.html", "timestamp": 1417805189928}, {"url": "http://101.ru/?an=port_channel_mp3&channel=82", "timestamp": 1417711286305}, {"url": "http://www.just.ru/myasorubki/56658_elektromyasorybky_kenwood_mg_450/?from=yandex_msk&utm_source=yandex&utm_medium=cpc&utm_campaign=10817239_model_bytovaya-tehnika-melkaya_msk_p_api&utm_content=612422293_2792852770_\u043c\u044f\u0441\u043e\u0440\u0443\u0431\u043a\u0443 mg 450&position_type=premi", "timestamp": 1417701042306}, {"url": "http://101.ru/?an=port_channel_mp3&channel=5", "timestamp": 1417695760398}, {"url": "http://101.ru/?an=port_channel_mp3&channel=5", "timestamp": 1417689964129}, {"url": "http://101.ru/?an=port_channel_mp3&channel=17", "timestamp": 1417683034834}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417608945879}, {"url": "http://101.ru/?an=port_channel_mp3&channel=24", "timestamp": 1417605700777}, {"url": "http://101.ru/?an=port_channel_mp3&channel=24", "timestamp": 1417605639264}, {"url": "http://101.ru/?an=port_channel_mp3&channel=82", "timestamp": 1417605624817}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg470-meat-grinder-0wmg470008", "timestamp": 1417604804579}, {"url": "http://livedemo00.template-help.com/magento_48517/blackberry-bold-9000-phone.html", "timestamp": 1417604730951}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg470-meat-grinder-0wmg470008", "timestamp": 1417548651645}, {"url": "http://www.kenwoodworld.com/en-int/products/blenders/meat-grinders/mg474-meat-grinder", "timestamp": 1417548321763}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417548310507}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417548309162}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?feat=6405fda1-43cc-42cc-8860-1c2a492555c5&tabsegment=key-features", "timestamp": 1417548297576}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?tabsegment=key-features", "timestamp": 1417548284970}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417548264964}, {"url": "http://www.kenwoodworld.com/en-int/products/blenders/meat-grinders", "timestamp": 1417546314287}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg700-meat-grinder-0wmg700006", "timestamp": 1417545459520}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg700-meat-grinder-0wmg700006", "timestamp": 1417545200191}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/kmix-by-kenwood/kmix-kitchen-machines-/kmx51-kmix-kitchen-machine-0wkmx51002", "timestamp": 1417545116313}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/---mg517---0wmg517007", "timestamp": 1417544991760}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417544967371}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?feat=ac86d868-3ea4-4523-93e1-885bbf4222cd&tabsegment=key-features", "timestamp": 1417544772661}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?feat=3a288c22-e5f2-448e-a573-ccde95fd2341&tabsegment=key-features", "timestamp": 1417544765049}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?feat=ac86d868-3ea4-4523-93e1-885bbf4222cd&tabsegment=key-features", "timestamp": 1417544748628}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001?tabsegment=key-features", "timestamp": 1417544731238}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417544522237}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417544351791}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/-mg350-0w21910001", "timestamp": 1417544282950}, {"url": "http://www.kenwoodworld.com/ru-ru", "timestamp": 1417544269909}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg516-meat-grinder-and-roto-food-cutter-0wmg516006?tabsegment=specifications", "timestamp": 1417544204394}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg516-meat-grinder-and-roto-food-cutter-0wmg516006", "timestamp": 1417544190747}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg516-meat-grinder-and-roto-food-cutter-0wmg516006", "timestamp": 1417544045014}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg516-meat-grinder-and-roto-food-cutter-0wmg516006?tabsegment=specifications", "timestamp": 1417544035023}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg516-meat-grinder-and-roto-food-cutter-0wmg516006", "timestamp": 1417544015196}, {"url": "http://www.kenwoodworld.com/ru-ru", "timestamp": 1417544004579}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009?tabsegment=specifications", "timestamp": 1417543914820}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543814629}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543642699}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543628088}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543616074}, {"url": "http://www.kenwoodworld.com/uk/products/food-mixers/chef-major-attachments/potato-peeler-at444-awat444001", "timestamp": 1417543439173}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543352117}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543294005}, {"url": "http://www.kenwoodworld.com/uk/search-results", "timestamp": 1417543192107}, {"url": "http://www.kenwoodworld.com/uk", "timestamp": 1417543022466}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009?tabsegment=specifications", "timestamp": 1417542940415}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009?tabsegment=support", "timestamp": 1417542907491}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009?tabsegment=specifications", "timestamp": 1417542866623}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542858206}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542839578}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542795850}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542742883}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542725367}, {"url": "http://www.kenwoodworld.com/ru-ru/all-products/blenders-mixers-and-meat-grinders/meat-grinders-ru/mg510-meat-grinder-0wmg510009", "timestamp": 1417542659966}, {"url": "http://www.kenwoodworld.com/ru-ru", "timestamp": 1417542501523}, {"url": "http://101.ru/?an=port_channel_mp3&channel=24", "timestamp": 1417542435930}, {"url": "http://www.shop-script.ru/platform/", "timestamp": 1417473193974}, {"url": "http://101.ru/?an=port_channel_mp3&channel=34", "timestamp": 1417451297674}]}
来自 json 的 url 的后处理代码:
import re
import os, sys
import json
from urllib.parse import urlparse
from urllib.request import urlretrieve, unquote
for c in range(len(res)):
a=urlparse(unquote(res['url'][c]))
res['url'][c]=str(re.search("(?:www\.)?(.*)",a.netloc).group(1))
Было: http://news.yandex.ru/yandsearch?cl4url=chezasite.com/htc/htc-one-m9-delay-86327.html&lr=213&rpt=story
Стало:news.yandex.ru
来自json的'timestamp'时间后处理代码:
import datetime
from datetime import timedelta
mytime=datetime.datetime.fromtimestamp(int(res['timestamp'][1]/1000))
if mytime.replace(minute=30) < mytime:
mytime = mytime.replace(second=0, microsecond=0, minute=0) + timedelta(hours=1)
else:
mytime=mytime.replace(second=0, microsecond=0, minute=0)
Было 1426666298001 Стало:2015-03-18 11:00:00
如何在代码中实现后处理代码以反序列化 json 字符串,并在 which json 上复制主字符串?如果有不清楚的地方我会写

矢量化解决方案:
结果:
5.350.000 行的速度测量(在我的硬件上大约 9 秒):
PS 用法
urlparse更惯用,可能涵盖更多“棘手”的 URL 选项,但很难将其加速到与 Pandas 矢量化方法相媲美。因此,我建议使用正则表达式 (RegEx) 并针对更复杂的情况调整/纠正它们。