Accents in text searches

A good search engine must be flexible since words searched by a user may be present in database, but in a different shape. The most obvious example is the presence of uppercase letters in the text: if a sentence begins with "Tomorrow, ...", this word must be recognized by the search engine even if the user has typed "tomorrow" or "TOMORROW".

Many written languages use special characters (accentuated or other) that cause the same issue: the same word can be written in different shapes depending on the presence or absence of those characters (especially because of uppercase letters, which are often unaccentuated). The french word "sérénité", if it appears in a title, may be written in the shape "SERENITE". The engine must not only ignore the case but also the accents.

The special characters issue becomes even more complicated when you have to find again the matching words in the text once out of the database. I had to implement this function in order to highlight those words in the text results found by our search engine. So I had to find a way to make a case- and special-character-insensitive search in a PHP string and to obtain the exact position of the found words, considering multi-byte characters.

While searching for a satisfying method to make that kind of search, I realized that many different solutions are offered online, and that most of those are incomplete or even erroneous. I present in this article the ones I have adopted, which work very fine.

In MySQL database

To make a text search in a MySQL database, several solutions are available, the simplest one being the use of the LIKE operator : link1, link2. For more advanced behaviors, the operators REGEXP (a.k.a. RLIKE) and MATCH() ... AGAINST may be necessary, especially if regular expressions are involved.

Since we don't need to manage regular expressions at the moment, I had a choice. I didn't choose MATCH() ... AGAINST because it implies creating FULLTEXT indexes on the text columns in the database and an overhead for the MySQL engine on every search. Furthermore, this operator has some behaviors that are useless to us and we cannot turn off. The REGEXP operator is out of the question since it does not manage multi-byte characters, and I found no solution to this problem online.

 So I simply use the LIKE operator, which is implemented to be case- and accent-insensitive. To make it work reliably with accents, you just have to specify it which character collation to use. The 'utf8_general_ci' collation is entirely satisfying to us (we just have French to manage for now) but others are more adapted to specific languages.

The only Unicode characters the LIKE operator does not manage correctly are those corresponding to two basic letters, for instance 'Œ' that corresponds to 'OE'. To get around this problem, I de-accentuate the search patterns, then add to the pattern list their every possible variation according to those characters (lowercase to limit the list size, MySQL recognizes flawlessly that 'Œ' and 'œ' are equivalent). As an example, if the pattern list contains 'Caesar', the treated list will contain 'caesar', 'cæsar' and 'cǽsar'.

 I wrote a function that generates the condition array to use in a find() to search for a pattern list in a database field. Since our text blocks are generated by TinyMCE, I added a pattern treatment: the TinyMCE special characters are replaced with their respective code to correspond to the stored texts.

/**
 * @param string $field: database field to search
 * @param array $patterns: patterns as strings
 * @param boolean $markUp: true if searched field contains markup (then patterns must be sanitized)
 * @return array: find condition array to match given field against given patterns (to be put in an 'OR' find condition key)
 */
public function matchFieldPatterns($field, $patterns, $markup = false) {
	$markupReplacements = array(
		'&' => '&',
		'<' => '&LT;',
		'>' => '&GT;',
	);
	$managedCollations = array('utf8_general_ci');

	$conditions = array();
	foreach ($patterns as $pattern) {
		if ($markup) { // Some characters have been sanitized for database storage
			// Sanitize pattern as well
			$pattern = str_replace(array_keys($markupReplacements), array_values($markupReplacements), $pattern);
		}
		// Add conditions to match this field against this pattern for each managed collation
		foreach ($managedCollations as $collation) {
			$conditions[] = sprintf("%s LIKE '%%%s%%' COLLATE %s", $field, $pattern, $collation);
		}
	}

	return $conditions;
}

Called with parameters $field='AModel.a_field', $patterns=array('Foo', '<bar>') and $markUp=true, this function returns this condition array (to be associated to an 'OR' key in find 'conditions' option) :

array(
	"AModel.a_field LIKE 'Foo' COLLATE utf8_general_ci",
	"AModel.a_field LIKE '&LT;bar&GT;' COLLATE utf8_general_ci"
)

In a PHP string

Once the texts have been found in the database, it may be useful to know the positions of the found words, typically to highlight those in the results display. To do that you must make a second search in the PHP strings from the database (possibly treated, to remove hypertext tags for instance), with the same insensibility to case and accents.

 The simplest way I found to do this is to transform the text and the patterns: replace all accentuated letters with their corresponding basic letter(s) and make all letters uppercase, thereby all equivalent words end up in the same shape and there is no more ambiguity. The difficulty lies in the fact that the positions will be determined in the transformed text but used in the original one, so multi-byte characters must be handled carefully.

To remove accents, I opted for a correspondance array associating each group of equivalent accentuated letters (as a regular expression) to the corresponding basic letter. It just so happens that Cake's Inflector class contains such an array, but some modifications are to be performed because some special characters are associated to several basic letters, apparently for consistencity with MySQL behavior. For instance the letter 'ü' is associated to 'ue' in that array because in some widely used MySQL collations, ('ü' LIKE 'ue') is true whereas ('ü' LIKE 'u') is false.

 I thus created an Accents class extending the Inflector class in order to gain access to this array (it is declared protected) and to correct it before using it (these corrections depend on managed languages). Once the accents stripped from text and patterns thanks to the static function Accents::strip(), we can use the PHP functions that handle multi-byte characters (mb_strpos()mb_substr(), ...) to find the positions of the words to highlight. There still remains the issue of the few unicode characters that must correspond to two basic letters and thus are not corrected in the correspondance array, like 'æ'. To manage that, just decrement the position of each match (computed in the transformed text) by 1 for each occurence of such a character before that match in the PHP string, and you will obtain the correct positions in the original text.

/**
 * Utility wrapper for customizing Inflector functionalities
 */
class Accents extends Inflector {

	/**
	 * Replaces accentuated or special letters from given string with corresponding basic letter
	 * @param string $string
	 * @return string 
	 */
	public static function strip($string) {
		$map = self::correctMap();
		return preg_replace(array_keys($map), array_values($map), $string);
	}

	/**
	 * Corrects some de-accentuations in Inflector's transliteration map (i.e. 'ü' => 'ue')
	 * @return array 
	 */
	public static function correctMap() {
		$map = self::$_transliteration;
		
		// Override incorrect transliterations
		$overrides = array(
			'/æ|ǽ/' => 'ae',
			'/œ/' => 'oe',
			'/ü/' => 'u',
			'/Ä/' => 'A',
			'/Ü/' => 'U',
			'/Ö/' => 'O',
		);
		foreach ($overrides as $accent => $basic) {
			$map[$accent] = $basic;
		}
		// Remove remaining incorrect transliterations
		$deletions = array('/ä|æ|ǽ/', '/ö|œ/');
		foreach ($deletions as $key) {
			unset($map[$key]);
		}
		
		return $map;
	}
}

Our implementation

I gathered these solutions in a plug-in containing my class Accents and a Behavior that makes a new find type available to models: 'search'. If you never used a find type other than the core ones ('first', 'all', ...), I suggest consulting the CookBook documentation about custom find types.

This plug-in is freely available on GitHub, and contains a complete implementation of the concepts presented above. This can be a good source of examples if you intend to create your own implementation without starting from scratch.

In order to test the plugin, you can use the search field on top of the page.

If you use this plug-in directly, do not hesitate to share with us your remarks and suggestions !

This page belongs to the following categories: news , CakePHP , Code , Plugin , Accents.

Comments

Add a comment

5105
Petits fours baked