Text Categorization in R -


my objective automatically route feedback email respective division.
fields fnumber,category, subcategory, description.
have last 6 months data in above format - entire email stored in description along category , subcategory.

i have analyse description column , find keywords each category/subcategory , when next feedback email enters , should automatically categorize categories , sub categories based on keyword generated history data.

i have imported xml file r - text categorization in r , converted xml data frame required fields. have 23017 records particular month - have listed first twenty columns dataframe below.

i have more 100 categories , sub categries.
new text mining concept - of , tm package - have tried below code:

step1 <-  structure(list(fnumber = structure(1:20, .label = c(" 20131202-0885 ",  "20131202-0886 ", "20131202-0985 ", "20131202-1145 ", "20131202-1227 ",  "20131202-1228 ", "20131202-1235 ", "20131202-1236 ", "20131202-1247 ",  "20131202-1248 ", "20131202-1249 ", "20131222-0157 ", "20131230-0668 ",  "20131230-0706 ", "20131230-0776 ", "20131230-0863 ", "20131230-0865 ",  "20131230-0866 ", "20131230-0868 ", "20131230-0874 "), class = "factor"),      category = structure(c(9l, 14l, 11l, 6l, 10l, 12l, 7l, 11l,      13l, 13l, 6l, 1l, 2l, 5l, 4l, 8l, 8l, 3l, 11l, 11l), .label = c(" bvl-vocational licence (vl) investigation ",      " bvl - bus licensing ", " corporate transformation office (cto) ",      " csv - customer service ", " deregistration - transfer/split/encash rebates ",      " enf - enforcement matters ", " enf - illegal parking  ",      " marina coastal expressway ", " ptq - public transport quality ",      " road asset management ", " service quality (sq) ", " traffic management & cycling ",      " vr - issuance/disputes of bookings vendors ", " vrlso - update owner's particulars "     ), class = "factor"), subcategory = structure(c(2l, 15l,      5l, 1l, 3l, 14l, 6l, 12l, 8l, 8l, 18l, 17l, 11l, 10l, 16l,      7l, 9l, 4l, 13l, 12l), .label = c(" abandoned vehicles ",      " bus driver behaviour ", " claims accident ", " corporate development ",      " faq ", " illegal parking ", " intra group (straddling case) ",      " issuance/disputes of bookings vendors ", " mce ", " parf (transfer/split/encash) ",      " private bus related matters ", " referrals ", " straddle cases (across groups) ",      " traffic flow ", " update owner particulars ", " vehicle related matters ",      " vl holders (complaint/investigation/appeal) ", " warrant of arrrest "     ), class = "factor"), description = structure(c(3l, 1l, 2l,      9l, 4l, 7l, 8l, 6l, 5l, 3l, 1l, 2l, 9l, 4l, 7l, 8l, 6l, 5l,      7l, 8l), .label = c(" street road leading &amp; exit vehicles , buses (i think) 4 temples and, latest addition of 8b, 4 (!!) industrial estate.",      "could kindly increase frequencies service 58. colleagues travelled avoid 58!!!\nthey rather take 62-87 instead of 3-58",      "i saw bus no. 169a approaching bus stop. @ time, passengers had boarded , alighted bus.",      "i want apologise , excuse summon because dont know can&apos;t park motorcycle @ double line when friday prayer ..please forgive me",      "many prompt action. please note rectification rather short term it&apos;s replacing bulb without proper cover protect against elements.ps. same job done i.e. without installing cover few months back; , same problem happen again.",      "placed in such manner cannot seen due background ahead; colours blend.there not room angle divert 1st lane 2nd lane. outer cone covers more 1st lane",      "the vehicle gx3368k observed driving along pie towards changi on 28th november 2013, 3:48pm without functioning braking lights during day.",      "the vehicle behaving suspiciously many sudden brakes - caused vehicles behind heavy &quot;jam brakes&quot; due no warnings @ (no brake lights).",      "we have received feedback regarding lane of said address being blocked items.\nkindly investigate , keep in loop on actions taken while fire safety issues on case again."     ), class = "factor")), .names = c("fnumber", "category",  "subcategory", "description"), class = "data.frame", row.names = c(na,  -20l))    dim(step1) names(step1) library(tm) m <- list(id = "fnumber", content = "description") myreader <- readtabular(mapping = m) txt <- corpus(dataframesource(step1), readercontrol = list(reader = myreader))  summary(txt) txt <- tm_map(txt,tolower) txt <- tm_map(txt,removenumbers) txt <- tm_map(txt,removepunctuation) txt <- tm_map(txt,stripwhitespace) txt <- tm_map(txt,removewords,stopwords("english")) txt <- tm_map(txt,stemdocument)   tdm <- termdocumentmatrix(txt,                       control = list(removepunctuation = true,                                      stopwords = true)) tdm 

update: have got frequntly occuring keywords on whole dataset:

tdm3 <-removesparseterms(tdm, 0.98) tdm.dense <- as.matrix(tdm3) tdm.dense = melt(tdm.dense, value.name = "count") attach(tdm.dense) tdm_final <- aggregate(count, list(terms), sum) colnames(tdm_final) <- c("words","word_freq") 

i stuck after this. not sure how get:

1.the relevant keywords (unigrams,bi -grams , trigrams) each category/subcategory there generating taxonomy list (keywords ctaegory/subcategory).

2.when next feedback email entered how categorize categories , sub categories. (there 100+ categories ) based on keyword taxonomy list generated on above step.
3. or if above understanding , solution part not correct, advise me on other possible option.

i have went through materials in internet (i can able see classification of text inot 2 classes, not more that) - not able proceed further.i new text mining in r - excuse me , if naive.

any or starting point great.

i'll give brief answer here because question little vague.

this code below create tdm each category 2-grams.

library(rweka) library(snowballc)  #create function produce 'nvalue'-gram underlying dataset. notice function accesses step1 data.frame external (it's not fed function). i'll leave else fix up! makengramfeature=function(nvalue){    tokenize=function(x){ngramtokenizer(x,weka_control(min=nvalue,max=nvalue))}    m <- list(id = "fnumber", content = "description")   myreader <- readtabular(mapping = m)   txt <- corpus(dataframesource(step1), readercontrol = list(reader = myreader))    summary(txt)   txt <- tm_map(txt,tolower)   txt <- tm_map(txt,removenumbers)   txt <- tm_map(txt,removepunctuation)   txt <- tm_map(txt,stripwhitespace)   txt <- tm_map(txt,removewords,stopwords("english"))   txt <- tm_map(txt,stemdocument)     tdm <- termdocumentmatrix(txt,                             control = list(removepunctuation = true,                                            stopwords = true,                                            tokenize=tokenize))   return(tdm) }  # list of tdm each category. create 'cascade' of functions, or create unique list of category/sub-category pairs analyse. all=by(step1,indices=step1$category,fun=function(x){makengramfeature(2)}) 

the resulting list 'all' little ugly. can run names(all) @ categories. i'm sure there cleaner way solve this, gets going on 1 of many correct paths...


Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -