dataframe - Efficient conditional summing by multiple conditions in R -
i'm struggling finding efficient solution following problem:
i have large manipulated data frame around 8 columns , 80000 rows includes multiple data types. want create new data frame includes sum of 1 column if conditions large data frame met.
imagine head of original data frame looks this. column $years.raw indicates company measured data x years.
> cbind(company.raw,years.raw,source,amount.inkg) company.raw years.raw source amount.inkg [1,] "c1" "1" "ink" "5" [2,] "c1" "1" "recycling" "2" [3,] "c2" "1" "coffee" "10" [4,] "c2" "1" "combusted" "15" [5,] "c2" "2" "printer" "14" [6,] "c2" "2" "tea" "18"
what need create new data frame sums values of column $amount.inkg every company , every year based on string elements. saved string elements in 3 vectors below. string elements part of column $source original data frame.
> vector1 <- c("tea","coffee") > vector2 <- c("ink","printer") > vector3 <- c("recycling","combusted")
the preferred data frame this:
company year amount.vector1 amount.vector 2 amount.vector 3 c1 1 0 5 2 c2 1 10 0 15 c2 2 18 14 0
the general approach $amount.vector1 be: sum values of column $amount.inkg every company , every year string elements of original data frame column $source== string elements of vector1. same column $amount.vector2 except elements different of course.
if there no values available, "0" should added instead na error. needs done whole raw data frame includes around 250 companies data every company years 1:8 (differs quite lot).
edit: data frame need one-row-per-company-per-year.
c1 year 1 c1 year 2 c1 year 3 c2 year 1 c2 year 2
i tried write function combines these conditions failed. i'm quite new r , didn't know how link these conditions , apply them on whole data frame.
your data in 'long form' (multiple rows of company, source, year, ...)
you want aggregate amount.inkg on each company , year, multiple values of source. want aggregate conditionals on 'source' field.
again, please give reproducible example. (thanks josilber). four-liner either split-apply-combine(ddply) or logical indexing:
df = data.frame(company.raw = c("c1", "c1", "c2", "c2", "c2", "c2"), years.raw = c(1, 1, 1, 1, 2, 2), source = c("ink", "recycling", "coffee", "combusted", "printer", "tea"), amount.inkg = c(5, 2, 10, 15, 14, 18)) # option 1. split-apply-combine: ddply(...summarize) conditional on data require(plyr) # dplyr if performance on large d.f. becomes issue ddply(df, .(company.raw,years.raw), summarize, amount.vector1=sum(amount.inkg[source %in% c('tea','coffee')]), amount.vector2=sum(amount.inkg[source %in% c('ink','printer')]), amount.vector3=sum(amount.inkg[source %in% c('recycling','combusted')]) ) # option 2. sum logical indexing on df: # (this before modified question one-row-per-company-and-per-year) df$amount.vector1 <- sum( df[(df$source %in% c('tea','coffee')),]$amount.inkg ) # josilber clarifies want one-row-per-company ...
option 3. use aggregate
(manpage here) subset(...)
, although aggregate sum overkill.
aggregate(df, source %in% c('tea','coffee'), fun = sum)
the by
argument aggregate action (selecting, subsetting criteria).
note: %in%
performs scan operation, if vector , d.f. large, or scalability, you'd need break boolean operations can vectorized: (source=='tea' | source=='coffee')
as preventing na sums if subset empty, sum(c()) = 0
don't worry that. if do, either use na.omit, or ifelse(is.na(x),0,x)
on final result.
Comments
Post a Comment