All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Solution:
Rolling hash and hashtable.
Since there are only 4 possible letters, we can build a rolling hash with base of 5 and calculate hashcode for each 10-letter-long subsequences.
If we define code('A'):1, code('C'):2; code('G'):3; code('T'):4, Then the hashcode for the first 10-letter-long sequence will be 1*5^0+1*5^1+1*5^2+...2*5^5+...2*5^9. The advantage of rolling hash method is that we don't need to recalculate the hash code from the beginning of the next 10-letter-long sequence, we can calculate the hash code hc[i] based on last hash code hc[i-1]. hc[i]=hc[i-1]/5+code(s.charAt(i))*5^9. It is guranteed that unique DNA sequence will have unique hashcode. By using a hashtable, we can see whether the sequence is occurred more than once.
public List<String> findRepeatedDnaSequences(String s) {
List<String> res=new ArrayList<String>();
if(s==null || s.length()<10) return res;
HashMap<Integer,Integer> map=new HashMap<Integer,Integer>();
int hashCode=0;
int base=1;
for(int i=0;i<10;i++){
hashCode+=getCode(s.charAt(i))*base;
base*=5;
}
map.put(hashCode,1);
base=base/5;
for(int i=10;i<s.length();i++){
hashCode=hashCode/5+getCode(s.charAt(i))*base;
if(!map.containsKey(hashCode)) map.put(hashCode,1);
else{
if(map.get(hashCode)==1) {
res.add(s.substring(i-9,i+1));
map.put(hashCode,2);
}
}
}
return res;
}
public int getCode(char c){
switch (c){
case 'A': return 1;
case 'C': return 2;
case 'G': return 3;
case 'T': return 4;
default: return 0;
}
}