pandas - Counting occurrence of a word in a column of a tsv file using python -
question python beginner! have tsv file looking this:
whi5 yor083w cdc28 ybr160w physical interactions 19823668 whi5 yor083w cdc28 ybr160w physical interactions 21658602 whi5 yor083w cdc28 ybr160w physical interactions 24186061 whi5 yor083w rpd3 ynl330c physical interactions 19823668 whi5 yor083w swi4 yer111c physical interactions 15210110 whi5 yor083w swi4 yer111c physical interactions 15210111 i count lines containing same word in row[3], , output first 1 number of occurrence in new column.
whi5 yor083w cdc28 ybr160w physical interactions 19823668 3 whi5 yor083w rpd3 ynl330c physical interactions 19823668 1 whi5 yor083w swi4 yer111c physical interactions 15210110 2 so far tried combination of 'csv' , 'counter' or 'pandas' , 'counter' without success...
using pandas:
>>> import pandas pd >>> io import bytesio >>> df = pd.read_table(bytesio("""\ ... col1 col2 col3 col4 col5 col6 ... whi5 yor083w cdc28 ybr160w "physical interactions" 19823668 ... whi5 yor083w cdc28 ybr160w "physical interactions" 21658602 ... whi5 yor083w cdc28 ybr160w "physical interactions" 24186061 ... whi5 yor083w rpd3 ynl330c "physical interactions" 19823668 ... whi5 yor083w swi4 yer111c "physical interactions" 15210110 ... whi5 yor083w swi4 yer111c "physical interactions" 15210111"""), ... delim_whitespace=true) the pandas data-frame like:
>>> df col1 col2 col3 col4 col5 col6 0 whi5 yor083w cdc28 ybr160w physical interactions 19823668 1 whi5 yor083w cdc28 ybr160w physical interactions 21658602 2 whi5 yor083w cdc28 ybr160w physical interactions 24186061 3 whi5 yor083w rpd3 ynl330c physical interactions 19823668 4 whi5 yor083w swi4 yer111c physical interactions 15210110 5 whi5 yor083w swi4 yer111c physical interactions 15210111 [6 rows x 6 columns] to count, group col3 , take length of each group:
>>> df['cnt'] = df.groupby('col3')['col3'].transform(len) >>> df col1 col2 col3 col4 col5 col6 cnt 0 whi5 yor083w cdc28 ybr160w physical interactions 19823668 3 1 whi5 yor083w cdc28 ybr160w physical interactions 21658602 3 2 whi5 yor083w cdc28 ybr160w physical interactions 24186061 3 3 whi5 yor083w rpd3 ynl330c physical interactions 19823668 1 4 whi5 yor083w swi4 yer111c physical interactions 15210110 2 5 whi5 yor083w swi4 yer111c physical interactions 15210111 2 [6 rows x 7 columns] to pick first of each group:
>>> df.groupby('col3').apply(lambda obj: obj.head(n=1)) col1 col2 col3 col4 col5 col6 cnt col3 cdc28 0 whi5 yor083w cdc28 ybr160w physical interactions 19823668 3 rpd3 3 whi5 yor083w rpd3 ynl330c physical interactions 19823668 1 swi4 4 whi5 yor083w swi4 yer111c physical interactions 15210110 2 [3 rows x 7 columns]
Comments
Post a Comment