I have a file A with sequence ID and also the information of binding site location. I would like to extract the location information only without the A,T,C,G information. The shorter sequence above longer sequence shows its location and every number on the left for example in File A the value 451 is the location value on the left. I would like to get the location of short sequence on longer sequence which is start with 453 (start site) and obtain the length of shorter sequence which is size 21 and add to 453 to get the end site which is 474. Can anyone help me?
File A.txt
chr1:152806601-152807450
TTCAGCACCATGGACAGCGCC
451 GGCTTCAGCACCACGGACAGCGCCCCACCCGCGGCCCTCCCCCCGGCGGCGCGCTCCAGCCGGTGTAGGCGAGGC
TTCAGCACCATGGACAGCGCC
751 AGAGCCCCCCGGGACTGCAGAGAGCACCTGGGAGGCTGGACTGGGAACGAGACATACTCGAAGGAGTAAGTGAAG
chr10:125364276-125364825
TTCAGCACCATGGACAGCGCC
301 CAGTAATGTGGGGTTGTGGTCAGCACCATGGACAGCTCCCCTGTTGCTTCATATTGAGGAATAGGAAAGCGCCGC
TTCAGCACCATGGACAGCGCC
376 TATCTCCGGATCCTGGCTAGCTCCAGCCACTGCAGGTAACTGTCTTGAATGGGCTTAGAAACATGGTGATGTCTG
Desired output
chr1:152806601-152807450 453 474
chr1:152806601-152807450 757 778
chr10:125364276-125364825 318 339
chr10:125364276-125364825 378 399
Example code
import re
with open("A.txt", "r") as f:
lines = f.readlines()
label_ptrn = re.compile("") # insert regular expression sequence ID
line_ptrn = re.compile("") # insert regular expression start site
inner_ptrn = re.compile("") # insert regular expression end site
all_matches = []
for line in lines:
m = label_ptrn.match(line)
if m:
label = m.groupdict().get("label")
continue
m = line_ptrn.match(line)
if m:
start = m.groupdict().get("start_value")
sequence = m.groupdict().get("sequence")
mi = inner_ptrn.search(sequence)
if not mi:
continue
span = mi.span()
all_matches.append((label, int(start)+span[0], int(start)+span[1]))
with open("A_ouput.bed", "w+b") as f:
for m in all_matches:
f.write('%s\t%i\t%i\n' % m)