TY - JOUR
T1 - A sublinear algorithm for weakly approximating edit distance
AU - Batu, Tuǧkan
AU - Ergün, Funda
AU - Kilian, Joe
AU - Magen, Avner
AU - Raskhodnikova, Sofya
AU - Rubinfeld, Ronitt
AU - Sami, Rahul
PY - 2003
Y1 - 2003
N2 - We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns "CLOSE" if their edit distance is O(nα), and "FAR" if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ(nmax{α/2,2α-1}) for any fixed α < 1. Our algorithm thus provides a trade-off between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(nα/2) on the query complexity of every algorithm that distinguishes pairs of strings with edit distance at most nα from those with edit distance at least n/6.
AB - We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns "CLOSE" if their edit distance is O(nα), and "FAR" if their edit distance is Ω(n), where α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time Õ(nmax{α/2,2α-1}) for any fixed α < 1. Our algorithm thus provides a trade-off between accuracy and efficiency that is particularly useful when the input data is very large. We also show a lower bound of Ω(nα/2) on the query complexity of every algorithm that distinguishes pairs of strings with edit distance at most nα from those with edit distance at least n/6.
UR - http://www.scopus.com/inward/record.url?scp=0037770095&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0037770095&partnerID=8YFLogxK
U2 - 10.1145/780542.780590
DO - 10.1145/780542.780590
M3 - Conference article
AN - SCOPUS:0037770095
SN - 0734-9025
SP - 316
EP - 324
JO - Conference Proceedings of the Annual ACM Symposium on Theory of Computing
JF - Conference Proceedings of the Annual ACM Symposium on Theory of Computing
T2 - 35th Annual ACM Symposium on Theory of Computing
Y2 - 9 June 2003 through 11 June 2003
ER -