Multi-dimensional substring selectivity estimation. H. V. Jagadish, Olga Kapitskaia, Raymond T. Ng and Divesh Srivastava. With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use multi-dimensional count-suffix trees as the basic framework for substring selectivity estimation. Given the enormous size of these trees for large databases, we develop a space and time efficient probabilistic algorithm to construct multi-dimensional pruned count-suffix trees directly. We then present two techniques to obtain good estimates for a given multi-dimensional substring matching query, using a pruned count-suffix tree. The first one, called GNO (for Greedy Non-Overlap), generalizes the greedy parsing suggested by Krishnan et al. [KVI96] for one-dimensional substring selectivity estimation. The second one, called MO (for Maximal Overlap), uses all maximal multi-dimensional substrings of the query for estimation; these multi-dimensional substrings help to capture the correlation that may exist between strings in the multiple dimensions. We demonstrate experimentally, using real data sets, that MO is substantially superior to GNO in the quality of the estimate.