Class
Article
College
College of Engineering
Department
English Department
Faculty Mentor
Jack Elliott
Presentation Type
Poster Presentation
Abstract
Fuzzy name matching methods are helpful because names are often misspelled, sound similar, or reside in similar circumstances. A fuzzy name matching process assigns two names a similarity score between 0 and 1 using name-matching techniques (with 1 meaning identical and 0 meaning entirely unique). Fuzzy name matching is used for optimizing customer databases, creating more accurate medical records, or in Social Network Analysis (SNA). To observe the links between students’ social practices and academic performance, our research group is performing a large-scale (1000+ student) SNA through open-response name-generator surveys. Accordingly, the survey responses contain ambiguous names—many of which are variant references to the same student. Our efforts to disambiguate this large student social network inspired our work in name-matching techniques. Using a pilot study, we disambiguated the large student network manually. After the completion of the pilot study, we delineated our strategy. We split our overarching strategy according to unique ambiguity circumstances, like misspellings and names that are missing a last name. The resulting strategy is represented in four stages, each of which outline processes to resolve uniquely-ambiguous names. Each of these stages exhibited the potential for automation. Our current automation is a hybrid method that uses text- and language- based name similarity algorithms, in addition to clustering techniques, to create a fuzzy-name score between key names and ambiguous names in the interaction data. Levenshtein distance specifies how similar two names are textually (spelling). Further, the Metaphone II algorithm produces a key value that defines how each name is pronounced. Accordingly, if two names have identical Metaphone key values, they sound similar. In cases where ambiguous names cannot be matched with key names with simple name-similarity algorithms, we useagglomerative hierarchical clustering (consolidating names who have the closest network proximity until the entire network is combined) to determine if two names-of-interest have similar peer networks. If these peer networks match, we can be confident in consolidating the two names-of-interest. We combine the values from Levenshtein distance, Metaphone II algorithm, and agglomerative hierarchical clustering to produce a comprehensive fuzzy-name score between an ambiguous name and a resolved name. Our completed program will be instrumental for creating accurate data sets and studying more holistic social networks.
Location
Logan, UT
Start Date
4-7-2022 12:00 AM
Included in
Name-Matching Techniques for Disambiguating Interaction Data
Logan, UT
Fuzzy name matching methods are helpful because names are often misspelled, sound similar, or reside in similar circumstances. A fuzzy name matching process assigns two names a similarity score between 0 and 1 using name-matching techniques (with 1 meaning identical and 0 meaning entirely unique). Fuzzy name matching is used for optimizing customer databases, creating more accurate medical records, or in Social Network Analysis (SNA). To observe the links between students’ social practices and academic performance, our research group is performing a large-scale (1000+ student) SNA through open-response name-generator surveys. Accordingly, the survey responses contain ambiguous names—many of which are variant references to the same student. Our efforts to disambiguate this large student social network inspired our work in name-matching techniques. Using a pilot study, we disambiguated the large student network manually. After the completion of the pilot study, we delineated our strategy. We split our overarching strategy according to unique ambiguity circumstances, like misspellings and names that are missing a last name. The resulting strategy is represented in four stages, each of which outline processes to resolve uniquely-ambiguous names. Each of these stages exhibited the potential for automation. Our current automation is a hybrid method that uses text- and language- based name similarity algorithms, in addition to clustering techniques, to create a fuzzy-name score between key names and ambiguous names in the interaction data. Levenshtein distance specifies how similar two names are textually (spelling). Further, the Metaphone II algorithm produces a key value that defines how each name is pronounced. Accordingly, if two names have identical Metaphone key values, they sound similar. In cases where ambiguous names cannot be matched with key names with simple name-similarity algorithms, we useagglomerative hierarchical clustering (consolidating names who have the closest network proximity until the entire network is combined) to determine if two names-of-interest have similar peer networks. If these peer networks match, we can be confident in consolidating the two names-of-interest. We combine the values from Levenshtein distance, Metaphone II algorithm, and agglomerative hierarchical clustering to produce a comprehensive fuzzy-name score between an ambiguous name and a resolved name. Our completed program will be instrumental for creating accurate data sets and studying more holistic social networks.