python - Comparing a file containing a chromosomal region and another containing point coordinates -
could please advised on following problem. have csv files compare. first contains coordinates of specific points in genome (e.g. chr3: 987654 – 987654). other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). cross compare first file other files see if point locations in first file overlaps regional coordinates in other files , write set of overlap separate file. make things simple start, have drafted simple code cross compare between 2 csv files. strangely, code runs , prints coordinates not write point coordinates separate file. first question if approach (from code) @ comparing these 2 files optimal or there better way of doing this? secondly, why not writing separate file?
import csv region = open ('region_test1.csv', 'rt', newline = '') reader_region = csv.reader (region, delimiter = ',') dmc = open ('dmc_test.csv', 'rt', newline = '') reader_dmc = csv.reader (dmc, delimiter = ',') dmc_testpoint = open ('dmc_testpoint.csv', 'wt', newline ='') writer_exon = csv.writer (dmc_testpoint, delimiter = ',') col in reader_region: chr_region = col[0] start_region = int(col[1]) end_region = int(col [2]) col in reader_dmc: chr_point = col[0] start_point = int(col [1]) end_point = int(col[2]) if chr_region == chr_point , start_region <= start_point , end_region >= end_point: print (true, col) else: print (false, col) writer_exon.writerow(col) region.close() dmc.close()
a couple of things wrong, not least of never check see if files opened successfully. glaring never close writer.
that said incredibly non-optimal way go program. file i/o slow. don't want keep rereading in factorial fashion. given how search requires possible comparisons you'll want store @ least 1 of 2 files in memory, , potentially use generator/iterator on other if dont wish store both complete sets of data in memory.
one have both sets loaded, proceed intersection checks
i'd suggest take @ http://docs.python.org/2/library/csv.html how use csv reader because doing doesn't appear make anysense because col[0], col[1] , col[2] aren't going think are.
these style , readability things but: names of iteration variables seem bit off, for col in ... should for token in ... because processing token token, , not column columns/line line, etc.
additionally nice pick consistent stick variables, start uppercase, save uppercase after '_'
that putting ' ' between objects , function noames , not others odd. again these dont change functionality of code.
Comments
Post a Comment