python - Creating a matrix of joint number of hits from two columns using numpy/pandas -


i have 2 large columns of data (some 1.5million values). structured :

     col1 = [2,2,1,4,5,4,3,4,4,4,5,2,3,1,1 ..] etc.,      col2 = [1,1,8,8,3,5,6,7,2,3,10.........] etc., 

i want make joint count matrix countsmat. col1 has values ranging 1 5 , col2 has range of 1 10.

     counts of [(1,2),(1,3),...(1,10),                 (2,1), (2,2),....(2,10),                 (3,1),......,(3,3)...(3,10),                 ...........................                 (5,1),(5,2),...(5,5).....(5,10)] required ie.,       countsmat of size (5,10)       max(col1) = 5; max(col2) = 10 ;   

i've implemented version defaultdict & loop, takes while. sure more adeptly handled pandas directly , i'd appreciate optimal implementation using numpy/pandas. many other similar queries point 'groupby' of pandas, not well-versed it.

getting output in matrix format trickier dict based implementation. guess easier pandas/numpy. thanks!

this might work (using numpy.histogram2d):

import numpy np  col1 = np.random.random_integers(1, 5, 100) col2 = np.random.random_integers(1, 10, 100) bin1 = np.arange(0.5, 6.5, 1) bin2 = np.arange(0.5, 11.5, 1)  mat = np.histogram2d(col1, col2, bins=[bin1, bin2])[0] print mat.shape print mat 

which yields

(5, 10) array([[ 4.,  2.,  0.,  6.,  2.,  2.,  1.,  2.,  1.,  2.],        [ 3.,  3.,  3.,  0.,  3.,  0.,  1.,  4.,  1.,  0.],        [ 4.,  2.,  1.,  1.,  3.,  2.,  5.,  2.,  2.,  2.],        [ 1.,  1.,  3.,  2.,  3.,  1.,  4.,  4.,  0.,  0.],        [ 0.,  2.,  1.,  4.,  3.,  1.,  3.,  2.,  0.,  1.]]) 

i haven't tested using lists lengths on couple thousand, think should scale ok few million values.


edit:

hi @nahsivar. i'm not familiar pandas (but should be), spent few minutes playing around. here several ways want (i think):

#instantiate dataframe import pandas pd random import choice x_choice = 'abcde' y_choice = 'abcdefghij' x, y = [], [] in range(100):     x[i] = choice(x_choice)     y[i] = choice(y_choice)  df = pd.dataframe(data={'col1': x, 'col2': y})  # 1 df.pivot_table(rows='col1', cols='col2', aggfunc=len)  # use fill_value=0 replace nans 0 # output: col2    b   c   d   e  f  g  h    j col1                                       nan  1   3   1   2  2  2  1  4   2 b      1  1 nan   3   5  1  2  3  1 nan c      4  1   2 nan nan  4  3  2  1   2 d      2  2   2   1   1  3  3  4  4   2 e      1  1   1 nan   4  2  6  3  2   2  # 2 df.groupby('col2')['col1'].value_counts().unstack(level=0) # output: col2    b   c   d   e  f  g  h    j    nan  1   3   1   2  2  2  1  4   2 b      1  1 nan   3   5  1  2  3  1 nan c      4  1   2 nan nan  4  3  2  1   2 d      2  2   2   1   1  3  3  4  4   2 e      1  1   1 nan   4  2  6  3  2   2  # 3 pd.crosstab(df.col1, df.col2) # output: col2   b  c  d  e  f  g  h   j col1                                   0  1  3  1  2  2  2  1  4  2 b     1  1  0  3  5  1  2  3  1  0 c     4  1  2  0  0  4  3  2  1  2 d     2  2  2  1  1  3  3  4  4  2 e     1  1  1  0  4  2  6  3  2  2 

Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -