How long are your strings?
If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:
- For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word
value
gives:
- Store each of these suffixes in the database.
- You can now search for substrings using
LIKE 'alu%'
(which will find 'alu' as part of 'value').
By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.
Storage Cost
The number of characters required to store a word becomes word_len*word_len / 2
, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:
- 3-letter word:
(3*3/2) / 3 = 1.5
- 5-letter word:
(5*5/2) / 5 = 2.5
- 7-letter word:
(7*7/2) / 7 = 3.5
- 12-letter word:
(12*12/2) / 12 = 6
The number of rows required to store a word increases from 1 to word_len
. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.
Considerations
There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter
as one word or two?
The trade-off is as follows:
- Anything that is broken up cannot be found as a single element. If we store
five
and letter
separately, searching for five-letter
or fiveletter
will fail.
- Anything that is not broken up will take more storage space. Remember, the storage
requirement increases quadratically in the word length.
For convenience, you might want to remove the hyphen and store fiveletter
. The word can now be found by searching five
, letter
, and fiveletter
. (If you strip hyphens from any search query as well, users can still successfully find five-letter
.)
Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…