Fuzzy String Matching using Dice's Coefficient
By Frank Cox
(Janaury
2, 2013)
Here is the best algorithm that I'm current aware of for fuzzy string matching, i.e. finding approximate matches between two strings. Most of the time, all you need to know is whether String A matches String B. When this is the case, then strcmp is what you need (part of the C standard library). However, it is sometimes necessary to have a method for determining how similar two strings are. For example, consider a typical classified ad: "For sale: 1972 Ford Truck, blue in colour. Call Dave". We also have a second ad, "For Sale: 72 truck, blue Ford. Ask for Dave." You can I know that Dave is probably selling the same truck in both ads, but how can the computer determine that?
This algorithm tells us that there is a 72% chance that Dave is selling the same truck, even though the length of the strings and the words used are somewhat different.
This is a C function that implements Dice's Coefficient applied to fuzzy string matching, as described by Simon White in his article How To Strike A Match.
Mr. White's article contains a detailed description of the algorithm, but a brief description of how it works is that we break the strings into pairs of adjacent characters, ignoring any pairs that contain a space. Therefore, "For Sale: 72 truck" becomes "FO-OR-SA-AL-LE-E:-72-TR-RU-UC-CK". We then count the number of character pairs that match between the two strings, multiply that number by two, and divide the result by the total number of character pairs in the two strings.
/* dice-match.c Fuzzy String Matching using Dice's Coefficient Based on the algorithm as described by Simon White at http://www.catalysoft.com/articles/strikeamatch.html Copyright (c) 2013, Frank Cox All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY FRANK COX ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL FRANK COX BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include <stdio.h> #include <stdlib.h> #include <ctype.h> int dice_match(char *string1, char *string2); int main(void) { printf("\nMatch=%i%%\n",dice_match("For Sale: 72 truck, blue Ford. Ask for Dave.", "For sale: 1972 Ford Truck, blue in colour. Call Dave.")); return 0; } int dice_match(char *string1, char *string2) { struct pairs { char chars[2]; struct pairs *next; }; struct pairs *string1head = NULL; struct pairs *string1current; struct pairs *string1previous; struct pairs *string2head = NULL; struct pairs *string2current; struct pairs *string2previous; int counter=0, string1pairs=0, string2pairs=0; if (string1[0]=='\0' || string2[0]=='\0') return 0; while (string1[counter+1] != '\0') { if (string1[counter] != ' ' && string1[counter+1] != ' ') { string1current = (struct pairs *) malloc(sizeof(struct pairs)); if (string1current==NULL) exit(EXIT_FAILURE); if (string1head == NULL) string1head = string1current; else string1previous->next = string1current; string1current->next = NULL; string1current->chars[0]=toupper(string1[counter]); string1current->chars[1]=toupper(string1[counter+1]); string1previous=string1current; string1pairs++; } counter++; } counter=0; while (string2[counter+1] != '\0') { if (string2[counter] != ' ' && string2[counter+1] != ' ') { string2current = (struct pairs *) malloc(sizeof(struct pairs)); if (string2current==NULL) exit(EXIT_FAILURE); if (string2head == NULL) string2head = string2current; else string2previous->next = string2current; string2current->next = NULL; string2current->chars[0]=toupper(string2[counter]); string2current->chars[1]=toupper(string2[counter+1]); string2previous=string2current; string2pairs++; } counter++; } counter=0; string1current=string1head; while (string1current != NULL) { string2current=string2head; while (string2current != NULL) { if (string2current->chars[0]== string1current->chars[0] && string2current->chars[1]==string1current->chars[1]) { string2current->chars[0]='\0'; // Otherwise, 'GGGGG' would score a perfect match against 'GG'. counter+=2; break; } string2current=string2current->next; } string1current=string1current->next; } string1current = string1head; while (string1current != NULL) { free(string1current); string1current = string1current->next; } string2current = string2head; while (string2current != NULL) { free(string2current); string2current = string2current->next; } return (((double)counter/(string1pairs+string2pairs))*100); }
Other articles written by Frank Cox can be found here.
Frank Cox owns and operates the Melville Theatre in Melville, Saskatchewan, Canada, and has been playing with computers for over 30 years.
This work is licensed under a Creative
Commons Attribution-Share Alike 2.5 Canada License.