pandas - Counting occurrence of a word in a column of a tsv file using python -

June 15, 2011

question python beginner! have tsv file looking this:

whi5    yor083w cdc28   ybr160w physical interactions   19823668 whi5    yor083w cdc28   ybr160w physical interactions   21658602 whi5    yor083w cdc28   ybr160w physical interactions   24186061 whi5    yor083w rpd3    ynl330c physical interactions   19823668 whi5    yor083w swi4    yer111c physical interactions   15210110 whi5    yor083w swi4    yer111c physical interactions   15210111

i count lines containing same word in row[3], , output first 1 number of occurrence in new column.

whi5    yor083w cdc28   ybr160w physical interactions   19823668    3 whi5    yor083w rpd3    ynl330c physical interactions   19823668    1 whi5    yor083w swi4    yer111c physical interactions   15210110    2

so far tried combination of 'csv' , 'counter' or 'pandas' , 'counter' without success...

using pandas:

>>> import pandas pd >>> io import bytesio >>> df = pd.read_table(bytesio("""\ ... col1 col2 col3 col4 col5 col6 ... whi5    yor083w cdc28   ybr160w "physical interactions"   19823668 ... whi5    yor083w cdc28   ybr160w "physical interactions"   21658602 ... whi5    yor083w cdc28   ybr160w "physical interactions"   24186061 ... whi5    yor083w rpd3    ynl330c "physical interactions"   19823668 ... whi5    yor083w swi4    yer111c "physical interactions"   15210110 ... whi5    yor083w swi4    yer111c "physical interactions"   15210111"""), ... delim_whitespace=true)

the pandas data-frame like:

>>> df    col1     col2   col3     col4                   col5      col6 0  whi5  yor083w  cdc28  ybr160w  physical interactions  19823668 1  whi5  yor083w  cdc28  ybr160w  physical interactions  21658602 2  whi5  yor083w  cdc28  ybr160w  physical interactions  24186061 3  whi5  yor083w   rpd3  ynl330c  physical interactions  19823668 4  whi5  yor083w   swi4  yer111c  physical interactions  15210110 5  whi5  yor083w   swi4  yer111c  physical interactions  15210111  [6 rows x 6 columns]

to count, group col3 , take length of each group:

>>> df['cnt'] = df.groupby('col3')['col3'].transform(len) >>> df    col1     col2   col3     col4                   col5      col6 cnt 0  whi5  yor083w  cdc28  ybr160w  physical interactions  19823668   3 1  whi5  yor083w  cdc28  ybr160w  physical interactions  21658602   3 2  whi5  yor083w  cdc28  ybr160w  physical interactions  24186061   3 3  whi5  yor083w   rpd3  ynl330c  physical interactions  19823668   1 4  whi5  yor083w   swi4  yer111c  physical interactions  15210110   2 5  whi5  yor083w   swi4  yer111c  physical interactions  15210111   2  [6 rows x 7 columns]

to pick first of each group:

>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))          col1     col2   col3     col4                   col5      col6 cnt col3 cdc28 0  whi5  yor083w  cdc28  ybr160w  physical interactions  19823668   3 rpd3  3  whi5  yor083w   rpd3  ynl330c  physical interactions  19823668   1 swi4  4  whi5  yor083w   swi4  yer111c  physical interactions  15210110   2  [3 rows x 7 columns]

Search This Blog

And

pandas - Counting occurrence of a word in a column of a tsv file using python -

Comments

Post a Comment

Popular posts from this blog

visual studio - vb.net filter binding source by time -

php - SPIP: From Tag directly to an article -

jquery - isAjaxRequest always return false -