python - Creating a matrix of joint number of hits from two columns using numpy/pandas -
i have 2 large columns of data (some 1.5million values). structured :
col1 = [2,2,1,4,5,4,3,4,4,4,5,2,3,1,1 ..] etc., col2 = [1,1,8,8,3,5,6,7,2,3,10.........] etc.,
i want make joint count matrix countsmat. col1 has values ranging 1 5 , col2 has range of 1 10.
counts of [(1,2),(1,3),...(1,10), (2,1), (2,2),....(2,10), (3,1),......,(3,3)...(3,10), ........................... (5,1),(5,2),...(5,5).....(5,10)] required ie., countsmat of size (5,10) max(col1) = 5; max(col2) = 10 ;
i've implemented version defaultdict & loop, takes while. sure more adeptly handled pandas directly , i'd appreciate optimal implementation using numpy/pandas. many other similar queries point 'groupby' of pandas, not well-versed it.
getting output in matrix format trickier dict based implementation. guess easier pandas/numpy. thanks!
this might work (using numpy.histogram2d
):
import numpy np col1 = np.random.random_integers(1, 5, 100) col2 = np.random.random_integers(1, 10, 100) bin1 = np.arange(0.5, 6.5, 1) bin2 = np.arange(0.5, 11.5, 1) mat = np.histogram2d(col1, col2, bins=[bin1, bin2])[0] print mat.shape print mat
which yields
(5, 10) array([[ 4., 2., 0., 6., 2., 2., 1., 2., 1., 2.], [ 3., 3., 3., 0., 3., 0., 1., 4., 1., 0.], [ 4., 2., 1., 1., 3., 2., 5., 2., 2., 2.], [ 1., 1., 3., 2., 3., 1., 4., 4., 0., 0.], [ 0., 2., 1., 4., 3., 1., 3., 2., 0., 1.]])
i haven't tested using lists lengths on couple thousand, think should scale ok few million values.
edit:
hi @nahsivar. i'm not familiar pandas (but should be), spent few minutes playing around. here several ways want (i think):
#instantiate dataframe import pandas pd random import choice x_choice = 'abcde' y_choice = 'abcdefghij' x, y = [], [] in range(100): x[i] = choice(x_choice) y[i] = choice(y_choice) df = pd.dataframe(data={'col1': x, 'col2': y}) # 1 df.pivot_table(rows='col1', cols='col2', aggfunc=len) # use fill_value=0 replace nans 0 # output: col2 b c d e f g h j col1 nan 1 3 1 2 2 2 1 4 2 b 1 1 nan 3 5 1 2 3 1 nan c 4 1 2 nan nan 4 3 2 1 2 d 2 2 2 1 1 3 3 4 4 2 e 1 1 1 nan 4 2 6 3 2 2 # 2 df.groupby('col2')['col1'].value_counts().unstack(level=0) # output: col2 b c d e f g h j nan 1 3 1 2 2 2 1 4 2 b 1 1 nan 3 5 1 2 3 1 nan c 4 1 2 nan nan 4 3 2 1 2 d 2 2 2 1 1 3 3 4 4 2 e 1 1 1 nan 4 2 6 3 2 2 # 3 pd.crosstab(df.col1, df.col2) # output: col2 b c d e f g h j col1 0 1 3 1 2 2 2 1 4 2 b 1 1 0 3 5 1 2 3 1 0 c 4 1 2 0 0 4 3 2 1 2 d 2 2 2 1 1 3 3 4 4 2 e 1 1 1 0 4 2 6 3 2 2
Comments
Post a Comment