python - Using pandas for loading huge json files -
i having 500+ huge json files, each of size 400 mb, when in compressed format(3 gigs, when uncompressed). using standard json library in python 2.7 process data, , time taking same , think json.loads()
major culprit time consumption. thinking of using pandas in python loading data gzip files , doing analysis.
i heard pandas, , not sure if right tool using. concern is, using pandas,will me in substantial improvement in speed?
nb: of course can parallelise work , achieve speed, , still find things pretty lagging.
also, adding data reading gzip.open()
, converting json dictionary json.loads()
, storing in sqlite3, me in anyway furthur analysis.
json entry sample:
{"created_at":"sun dec 01 01:19:00 +0000 2013","id":406955558441193472,"id_str":"406955558441193472","text":"todo va estar bn :d","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003etwitter blackberry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":483470963,"id_str":"483470963","name":"katheryn rodriguez","screen_name":"katheryn_93","location":"","url":null,"description":"no pretendo ser nadie mas y no soy perfecta lo se, tengo muchos errores tambi\u00e9n lo se pero me acepto y me amo como soy.","protected":false,"followers_count":71,"friends_count":64,"listed_count":0,"created_at":"sun feb 05 02:04:16 +0000 2012","favourites_count":218,"utc_offset":-21600,"time_zone":"central time (us & canada)","geo_enabled":true,"verified":false,"statuses_count":10407,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"dbe9ed","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/483470963\/1385144720","profile_link_color":"9d1dcf","profile_sidebar_border_color":"ffffff","profile_sidebar_fill_color":"e6f6f9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}
and can find json entries of kind:
{"delete":{"status":"id":380315814080937984,"user_id":318430801,"id_str":"380315814080937984","user_id_str":"318430801"}}}
3 gb json files huge when stored nested dicts in python, many times larger, hence using lot of memory. watch how memory usage increases during load of 1 of these files , notice machine starts swapping.
you need either parse each line json (if are) or split files smaller chunks.
Comments
Post a Comment