Tuesday, April 14, 2015

187. Repeated DNA Sequences Leetcode

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].
Solution:
Rolling hash and hashtable. 
Since there are only 4 possible letters, we can build a rolling hash with base of 5 and calculate hashcode for each 10-letter-long subsequences.
If we define code('A'):1, code('C'):2; code('G'):3; code('T'):4, Then the hashcode for the first 10-letter-long sequence will be 1*5^0+1*5^1+1*5^2+...2*5^5+...2*5^9. The advantage of rolling hash method is that we don't need to recalculate the hash code from the beginning of the next 10-letter-long sequence, we can calculate the hash code hc[i] based on last hash code hc[i-1]. hc[i]=hc[i-1]/5+code(s.charAt(i))*5^9. It is guranteed that unique DNA sequence will have unique hashcode. By using a hashtable, we can see whether the sequence is occurred more than once.

 public List<String> findRepeatedDnaSequences(String s) {  
     List<String> res=new ArrayList<String>();  
     if(s==null || s.length()<10) return res;  
     HashMap<Integer,Integer> map=new HashMap<Integer,Integer>();  
     int hashCode=0;  
     int base=1;  
     for(int i=0;i<10;i++){  
       hashCode+=getCode(s.charAt(i))*base;  
       base*=5;  
     }  
    map.put(hashCode,1);  
    base=base/5;  
    for(int i=10;i<s.length();i++){  
      hashCode=hashCode/5+getCode(s.charAt(i))*base;  
      if(!map.containsKey(hashCode)) map.put(hashCode,1);  
      else{  
        if(map.get(hashCode)==1) {  
          res.add(s.substring(i-9,i+1));  
          map.put(hashCode,2);  
        }  
      }  
    }  
    return res;  
   }  
   public int getCode(char c){  
     switch (c){  
     case 'A': return 1;  
     case 'C': return 2;  
     case 'G': return 3;  
     case 'T': return 4;  
     default: return 0;  
     }  
   }  

No comments:

Post a Comment