Text Categorization in R -
my objective automatically route feedback email respective division.
fields fnumber
,category
, subcategory
, description
.
have last 6 months data in above format - entire email stored in description
along category
, subcategory
.
i have analyse description
column , find keywords
each category/subcategory
, when next feedback email enters , should automatically categorize categories , sub categories based on keyword
generated history data.
i have imported xml file r - text categorization in r , converted xml data frame required fields. have 23017 records particular month - have listed first twenty columns dataframe below.
i have more 100 categories , sub categries.
new text mining concept - of , tm package - have tried below code:
step1 <- structure(list(fnumber = structure(1:20, .label = c(" 20131202-0885 ", "20131202-0886 ", "20131202-0985 ", "20131202-1145 ", "20131202-1227 ", "20131202-1228 ", "20131202-1235 ", "20131202-1236 ", "20131202-1247 ", "20131202-1248 ", "20131202-1249 ", "20131222-0157 ", "20131230-0668 ", "20131230-0706 ", "20131230-0776 ", "20131230-0863 ", "20131230-0865 ", "20131230-0866 ", "20131230-0868 ", "20131230-0874 "), class = "factor"), category = structure(c(9l, 14l, 11l, 6l, 10l, 12l, 7l, 11l, 13l, 13l, 6l, 1l, 2l, 5l, 4l, 8l, 8l, 3l, 11l, 11l), .label = c(" bvl-vocational licence (vl) investigation ", " bvl - bus licensing ", " corporate transformation office (cto) ", " csv - customer service ", " deregistration - transfer/split/encash rebates ", " enf - enforcement matters ", " enf - illegal parking ", " marina coastal expressway ", " ptq - public transport quality ", " road asset management ", " service quality (sq) ", " traffic management & cycling ", " vr - issuance/disputes of bookings vendors ", " vrlso - update owner's particulars " ), class = "factor"), subcategory = structure(c(2l, 15l, 5l, 1l, 3l, 14l, 6l, 12l, 8l, 8l, 18l, 17l, 11l, 10l, 16l, 7l, 9l, 4l, 13l, 12l), .label = c(" abandoned vehicles ", " bus driver behaviour ", " claims accident ", " corporate development ", " faq ", " illegal parking ", " intra group (straddling case) ", " issuance/disputes of bookings vendors ", " mce ", " parf (transfer/split/encash) ", " private bus related matters ", " referrals ", " straddle cases (across groups) ", " traffic flow ", " update owner particulars ", " vehicle related matters ", " vl holders (complaint/investigation/appeal) ", " warrant of arrrest " ), class = "factor"), description = structure(c(3l, 1l, 2l, 9l, 4l, 7l, 8l, 6l, 5l, 3l, 1l, 2l, 9l, 4l, 7l, 8l, 6l, 5l, 7l, 8l), .label = c(" street road leading & exit vehicles , buses (i think) 4 temples and, latest addition of 8b, 4 (!!) industrial estate.", "could kindly increase frequencies service 58. colleagues travelled avoid 58!!!\nthey rather take 62-87 instead of 3-58", "i saw bus no. 169a approaching bus stop. @ time, passengers had boarded , alighted bus.", "i want apologise , excuse summon because dont know can't park motorcycle @ double line when friday prayer ..please forgive me", "many prompt action. please note rectification rather short term it's replacing bulb without proper cover protect against elements.ps. same job done i.e. without installing cover few months back; , same problem happen again.", "placed in such manner cannot seen due background ahead; colours blend.there not room angle divert 1st lane 2nd lane. outer cone covers more 1st lane", "the vehicle gx3368k observed driving along pie towards changi on 28th november 2013, 3:48pm without functioning braking lights during day.", "the vehicle behaving suspiciously many sudden brakes - caused vehicles behind heavy "jam brakes" due no warnings @ (no brake lights).", "we have received feedback regarding lane of said address being blocked items.\nkindly investigate , keep in loop on actions taken while fire safety issues on case again." ), class = "factor")), .names = c("fnumber", "category", "subcategory", "description"), class = "data.frame", row.names = c(na, -20l)) dim(step1) names(step1) library(tm) m <- list(id = "fnumber", content = "description") myreader <- readtabular(mapping = m) txt <- corpus(dataframesource(step1), readercontrol = list(reader = myreader)) summary(txt) txt <- tm_map(txt,tolower) txt <- tm_map(txt,removenumbers) txt <- tm_map(txt,removepunctuation) txt <- tm_map(txt,stripwhitespace) txt <- tm_map(txt,removewords,stopwords("english")) txt <- tm_map(txt,stemdocument) tdm <- termdocumentmatrix(txt, control = list(removepunctuation = true, stopwords = true)) tdm
update: have got frequntly occuring keywords on whole dataset:
tdm3 <-removesparseterms(tdm, 0.98) tdm.dense <- as.matrix(tdm3) tdm.dense = melt(tdm.dense, value.name = "count") attach(tdm.dense) tdm_final <- aggregate(count, list(terms), sum) colnames(tdm_final) <- c("words","word_freq")
i stuck after this. not sure how get:
1.the relevant keywords
(unigrams,bi -grams , trigrams) each category/subcategory
there generating taxonomy list
(keywords ctaegory/subcategory).
2.when next feedback email entered how categorize categories , sub categories. (there 100+ categories ) based on keyword taxonomy list generated on above step.
3. or if above understanding , solution part not correct, advise me on other possible option.
i have went through materials in internet (i can able see classification of text inot 2 classes, not more that) - not able proceed further.i new text mining in r - excuse me , if naive.
any or starting point great.
i'll give brief answer here because question little vague.
this code below create tdm each category 2-grams.
library(rweka) library(snowballc) #create function produce 'nvalue'-gram underlying dataset. notice function accesses step1 data.frame external (it's not fed function). i'll leave else fix up! makengramfeature=function(nvalue){ tokenize=function(x){ngramtokenizer(x,weka_control(min=nvalue,max=nvalue))} m <- list(id = "fnumber", content = "description") myreader <- readtabular(mapping = m) txt <- corpus(dataframesource(step1), readercontrol = list(reader = myreader)) summary(txt) txt <- tm_map(txt,tolower) txt <- tm_map(txt,removenumbers) txt <- tm_map(txt,removepunctuation) txt <- tm_map(txt,stripwhitespace) txt <- tm_map(txt,removewords,stopwords("english")) txt <- tm_map(txt,stemdocument) tdm <- termdocumentmatrix(txt, control = list(removepunctuation = true, stopwords = true, tokenize=tokenize)) return(tdm) } # list of tdm each category. create 'cascade' of functions, or create unique list of category/sub-category pairs analyse. all=by(step1,indices=step1$category,fun=function(x){makengramfeature(2)})
the resulting list 'all' little ugly. can run names(all)
@ categories. i'm sure there cleaner way solve this, gets going on 1 of many correct paths...
Comments
Post a Comment