Merge duplicates and assign value of highest frequency (except neutrals!) in R -


i posted similar question need change conditions. have data.frame full multiple entries. columns "no", "article", , "class" ("p"=positive, "n"=negative, "x"=neutral). looks this:

no <- c(3, 3, 5, 5, 5, 24, 24, 35, 35, 41, 41, 41) article <- c("earnings went up.", "earnings went up.", "massive layoff.", "they moved offices.", "mr. x joined company.", "class action filed.", "accident in warehouse.", "blabla one.", "blabla two.", "blabla three.", "blabla four.", "blabla five.") class <- c("p","p","n","x","x","n","n","x","p","p","n","p")  mydf <- data.frame(no, article, class) mydf  #    no                   article class # 1   3         earnings went up.     p # 2   3         earnings went up.     p # 3   5           massive layoff.     n # 4   5 moved offices.     x # 5   5 mr. x joined company.     x # 6  24       class action filed.     n # 7  24    accident in warehouse.     n # 8  35               blabla one.     x # 9  35               blabla two.     p # 10 41             blabla three.     p # 11 41              blabla four.     n # 12 41              blabla five.     p 

i want rid of multiple entries. articles of multiple entries should merged, if articles not same! then, want class highest frequency assigned except "x". "x" means neutral, if there e.g. duplicate "x", "p" still want "p" assigned. if there's "n", "x" --> "n" should assigned. same other multiple entries. if there's equal frequency of "p" , "n" --> "x" should assigned.

# examples: # "p", "x"      --> "p" # "p", "n"      --> "x"  # "x", "n", "x" --> "n"  # "p", "n", "p" --> "p"    # resulting data.frame should this:  #    no                                                            article  class # 1   3                                                   earnings went up.     p # 2   5 massive layoff. moved offices. mr. x joined company.     n # 3  24                          class action filed. accident in warehouse.     n # 4  35                                             blabla one. blabla two.     p # 5  41                                           blabla four. blabla five.     p 

in old question articles merged if same, , class highest frequency assigned ("x", "n", "p" treated same). if there no highest frequency, "x" assigned. helpful approaches were:

library(qdap) df2 <- with(mydf, sentcombine(article, no))  df2$class <- df2$no %l% vect2df(c(tapply(mydf[, 3], mydf[, 1], function(x){ tab <- table(x) ifelse(sum(tab %in% max(tab)) > 1, "x", names(tab)[max(tab) == tab]) }))) 

i tried change code know little how write functions , qdap understand this.

how dplyr

require(dplyr) # aggregation  getclass<-function(class){   n.n<-length(class[class=="n"])   n.p<-length(class[class=="p"])   ret<-"x"                         # return x, unless   if(n.n>n.p)ret<-"n"              # there more n's p's (return p)   if(n.n<n.p)ret<-"p"              # or more p's n's (return n)   return(ret) }  group_by(mydf,no) %.%   summarise(article=paste0(unique(article),collapse=" "),class=getclass(class))  source: local data frame [5 x 3]    no                                                             article class 1  3                                                   earnings went up.     p 2  5 massive layoff. moved offices. mr. x joined company.     n 3 24                          class action filed. accident in warehouse.     n 4 35                                             blabla one. blabla two.     p 5 41                             blabla three. blabla four. blabla five.     p 

Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -