count - Python script that performs line matching over stale files generates inconsistent output -

March 15, 2015

i created python script parse mail (exim) logfiles , execute pattern matching in order top 100 list send domains on smtp servers. however, everytime execute script different count. these stale logfiles, , cannot find functional flaw in code.

example output:     1:     70353 gmail.com     68337 hotmail.com     53657 yahoo.com     2:     70020 gmail.com     67741 hotmail.com     54397 yahoo.com     3:     70191 gmail.com     67917 hotmail.com     54438 yahoo.com  code:  #!/usr/bin/env python  import os import datetime import re collections import defaultdict  class domaincounter(object): def __init__(self):     self.base_path = '/opt/mail_log'     self.tmp = []     self.date = datetime.date.today() - datetime.timedelta(days=14)     self.file_out = '/var/tmp/parsed_exim_files-'  + str(self.date.strftime('%y%m%d')) + '.decompressed'  def parse_log_files(self):     sub_dir = os.listdir(self.base_path)     directory in sub_dir:         if re.search('smtp\d+', directory):             fileinput = self.base_path + '/' + directory + '/maillog-' + str(self.date.strftime('%y%m%d')) + '.bz2'             if not os.path.isfile(self.file_out):                  os.popen('touch ' + self.file_out)             proccessfiles = os.popen('/bin/bunzip2 -cd ' + fileinput + ' > ' + self.file_out)             accessfilehandle =  open(self.file_out, 'r')             readfilehandle = accessfilehandle.readlines()             print "proccessing %s." % fileinput             line in readfilehandle:                 if '<=' in line , ' ' in line , '<>' not in line:                     distinctline = line.split(' ')                     recipientaddresses = distinctline[1].strip()                     recipientaddresslist = recipientaddresses.strip().split(' ')                     if len(recipientaddresslist) > 1:                         emailaddress in recipientaddresslist:                             # since syslog messages transmitted on udp messages dropped , needs filtered out.                             if '@' in emailaddress:                                 (login, domein) = emailaddress.split("@")                                 self.tmp.append(domein)                                 continue     else:         try:                                 (login, domein) = recipientaddresslist[0].split("@")                                 self.tmp.append(domein)         except exception e:              print e, '<<no valid email address found, skipping line>>'      accessfilehandle.close()     os.unlink(self.file_out) return self.tmp   if __name__ == '__main__': domaincounter = domaincounter() result = domaincounter.parse_log_files() domaincounts = defaultdict(int) top = 100 domain in result:     domaincounts[domain] += 1  sorteddict = dict(sorted(domaincounts.items(), key=lambda x: x[1], reverse=true)[:int(top)]) w in sorted(sorteddict, key=sorteddict.get, reverse=true):     print '%-3s %s' % (sorteddict[w], w)

proccessfiles = os.popen('/bin/bunzip2 -cd ' + fileinput + ' > ' + self.file_out)

this line non-blocking. therefore start command, few following lines reading file. concurrency issue. try wait command complete before reading file.

also see: python popen command. wait until command finished since os.popen deprecated since python-2.6 (depending on version using).

sidenote - same happens line below. file may, or may not, exist after executing following line:

os.popen('touch ' + self.file_out)

Search This Blog

And

count - Python script that performs line matching over stale files generates inconsistent output -

Comments

Post a Comment

Popular posts from this blog

google app engine - 403 Forbidden POST - Flask WTForms -

Android layout hidden on keyboard show -

Parse xml element into list in Python -