count - Python script that performs line matching over stale files generates inconsistent output -
i created python script parse mail (exim) logfiles , execute pattern matching in order top 100 list send domains on smtp servers. however, everytime execute script different count. these stale logfiles, , cannot find functional flaw in code.
example output: 1: 70353 gmail.com 68337 hotmail.com 53657 yahoo.com 2: 70020 gmail.com 67741 hotmail.com 54397 yahoo.com 3: 70191 gmail.com 67917 hotmail.com 54438 yahoo.com code: #!/usr/bin/env python import os import datetime import re collections import defaultdict class domaincounter(object): def __init__(self): self.base_path = '/opt/mail_log' self.tmp = [] self.date = datetime.date.today() - datetime.timedelta(days=14) self.file_out = '/var/tmp/parsed_exim_files-' + str(self.date.strftime('%y%m%d')) + '.decompressed' def parse_log_files(self): sub_dir = os.listdir(self.base_path) directory in sub_dir: if re.search('smtp\d+', directory): fileinput = self.base_path + '/' + directory + '/maillog-' + str(self.date.strftime('%y%m%d')) + '.bz2' if not os.path.isfile(self.file_out): os.popen('touch ' + self.file_out) proccessfiles = os.popen('/bin/bunzip2 -cd ' + fileinput + ' > ' + self.file_out) accessfilehandle = open(self.file_out, 'r') readfilehandle = accessfilehandle.readlines() print "proccessing %s." % fileinput line in readfilehandle: if '<=' in line , ' ' in line , '<>' not in line: distinctline = line.split(' ') recipientaddresses = distinctline[1].strip() recipientaddresslist = recipientaddresses.strip().split(' ') if len(recipientaddresslist) > 1: emailaddress in recipientaddresslist: # since syslog messages transmitted on udp messages dropped , needs filtered out. if '@' in emailaddress: (login, domein) = emailaddress.split("@") self.tmp.append(domein) continue else: try: (login, domein) = recipientaddresslist[0].split("@") self.tmp.append(domein) except exception e: print e, '<<no valid email address found, skipping line>>' accessfilehandle.close() os.unlink(self.file_out) return self.tmp if __name__ == '__main__': domaincounter = domaincounter() result = domaincounter.parse_log_files() domaincounts = defaultdict(int) top = 100 domain in result: domaincounts[domain] += 1 sorteddict = dict(sorted(domaincounts.items(), key=lambda x: x[1], reverse=true)[:int(top)]) w in sorted(sorteddict, key=sorteddict.get, reverse=true): print '%-3s %s' % (sorteddict[w], w)
proccessfiles = os.popen('/bin/bunzip2 -cd ' + fileinput + ' > ' + self.file_out)
this line non-blocking. therefore start command, few following lines reading file. concurrency issue. try wait command complete before reading file.
also see: python popen command. wait until command finished since os.popen
deprecated since python-2.6 (depending on version using).
sidenote - same happens line below. file may, or may not, exist after executing following line:
os.popen('touch ' + self.file_out)
Comments
Post a Comment