Skip to content

【滑窗哈希】重复子串

repeated substring

所有 DNA 都由一系列缩写为 'A','C','G' 和 'T' 的核苷酸组成,例如:"ACGAATTCCG"。在研究 DNA 时,识别 DNA 中的重复序列有时会对研究非常有帮助。

编写一个函数来找出所有目标子串,目标子串的长度为 10,且在 DNA 字符串 s 中出现次数超过一次。

滑动窗口哈希

\(O(NL)\)时间&空间复杂度。

class Solution {
    const int L = 10;
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> ans;
        unordered_map<string, int> cnt;
        int n = s.length();
        for (int i = 0; i <= n - L; ++i) {
            string sub = s.substr(i, L);
            if (++cnt[sub] == 2) {
                ans.push_back(sub);
            }
        }
        return ans;
    }
};

位运算优化

注意到只有四个字母,因此可以用2bit表示。又注意到固定串长为10,每个子串可用20bit表示,小于一个int的长度。因此可以预处理每个滑动窗口为一个int。

\(O(N)\)时间复杂度。

class Solution {
    const int L = 10;
    unordered_map<char, int> bin = {{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> ans;
        int n = s.length();
        if (n <= L) {
            return ans;
        }
        int x = 0;
        for (int i = 0; i < L - 1; ++i) {
            x = (x << 2) | bin[s[i]];
        }
        unordered_map<int, int> cnt;
        for (int i = 0; i <= n - L; ++i) {
            x = ((x << 2) | bin[s[i + L - 1]]) & ((1 << (L * 2)) - 1);
            if (++cnt[x] == 2) {
                ans.push_back(s.substr(i, L));
            }
        }
        return ans;
    }
};

Polynomial rolling Hash + 前缀和

polynomial rolling hash:

\[ \displaylines{ hash(s) = \sum_{i = 0}^Ns[L-i-1]\cdot P^i \mod M } \]

where:

  • \(L\) is the length of \(s\).
  • \(P\) is a prime roughly larger than number of distinct characters used.
  • \(M\) is a large prime like \(1e9+7\).

properties:

  • since it is in the form of a prefix-sum, it is easy to calculate the hash of substrings.
\[ \displaylines{ hash(s[i:j+1]) = hash(s[:j]) - hash(s[:i-1])\cdot P^{j-i+1} } \]

implements:

// init
const static int P = 13; // a large prime, try-and-error
h[0] = 0;
p[0] = 1;
// calculate hash
for (int i = 1; i <= s.size(); i++) {
    h[i] = h[i-1] * P + s[i-1]
    p[i] = p[i-1] * P;    
}
// query hash of s[i:j+1]
int hash = h[j] - h[i-1] * p[j-i+1];

the answer (faster):

class Solution {
    const static int P = 5;
    const static int M = 1e9+7;
    map<char, int> m{{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> ans;
        if (s.size() <= 10) return ans;
        vector<long long> h(s.size() + 1, 0);
        vector<long long> p(s.size() + 1, 1);       
        for (int i = 1; i <= s.size(); i++) {
            h[i] = ((h[i-1] * P) % M + m[s[i-1]]) % M;
            p[i] = (p[i-1] * P) % M;
        }
        map<long long, int> cnt;
        for (int i = 1; i <= s.size() - 9; i++) {
            int j = i + 9;
            long long hash = h[j] - (h[i-1] * p[j-i+1]) % M;
            if (++cnt[hash] == 2) {
                ans.push_back(s.substr(i-1, 10));
            }
        }
        return ans;
    }
};

然而并不推荐,P要试试才知道能不能过?7就过不了...感觉很随机(