项目作者: searchhub

项目描述 :
Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts
高级语言: Java
项目地址: git://github.com/searchhub/preDict.git
创建时间: 2017-05-22T14:45:54Z
项目社区:https://github.com/searchhub/preDict

开源协议:GNU Lesser General Public License v3.0

下载


search|hub preDict (CE) Community Edition

search|hub enables blazing fast language independent spell correction at scale

Some basics about the spell correction problem

  • @searchhub.io/a-closer-look-into-the-spell-correction-problem-part-1-a6795bbf7112">A closer look into the spell correction problem — Part 1
  • @searchhub.io/a-closer-look-into-the-spell-correction-problem-part-2-introducing-predict-8993ecab7226">A closer look into the spell correction problem — Part 2 — introducing preDict
  • @searchhub.io/a-closer-look-into-the-spell-correction-problem-part-3-the-bells-and-whistles-19697a34011b">A closer look into the spell correction problem - Part 3 — the bells and whistles

Edit Distance

preDict is based on the spell correction fuzzy search library SymSpell with a few customizations and optimization:

  • The fundamental beauty of SymSpell is the Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given edit distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

  • Additionally only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input phrase are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

preDict customizations

Our main goal was to increase accuracy whilst maintaining the increabible speed by adding:

  • We replaced the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights.
  • We added some customizing “hooks” that are used to rerank the top-k results (candidate list). The results are then reordered based on a combined proximity :
    • added a phonetic proximity algorithm (Eudex)
    • added a prefix proximity algorithm (Jaro-Winkler)
    • added a fragment proximity algorithm (dice coefficient)
    • added keyboard-distance to get a dynamic replacement weight (since letters close to each other are more likely to be replaced)
    • do some query normalization before search

Benchmark Results

Run on Windows10 with a Intel(R) Core(TM) i7-6700 CPU (2.60GHz) with Java(TM) 1.8.0_121

  1. Benchmark Mode Cnt Score Error Units
  2. SearchHub with PreDict EE * thrpt 200 96,019 ± 1,007 ops/ms
  3. PreDict CE thrpt 200 82,116 ± 1,149 ops/ms
  4. Original SymSpell (Port) thrpt 200 68,105 ± 0,977 ops/ms
  5. Lucene (FST Based FuzzySuggester) thrpt 5 17,588 ± 0,690 ops/ms
  6. Lucene (Fuzzy Field-Search) thrpt 200 0,749 ± 0,017 ops/ms

Quality Results

Based on data we collected for a few months. The test data is attached to the comparison project and can be changed. Changes to the data will, of course, change the results, but the differences shouldn’t be that dramatical.

  1. Benchmark Accuracy TP TN Fail-Rate FN FP
  2. SearchHub with PreDict EE * 98,87% 7452 355 1,13% 52 37
  3. PreDict CE 90,04% 6937 320 9,96% 496 307
  4. Original SymSpell (Port) 88,87% 6842 321 11,13% 591 306
  5. Lucene (Fuzzy Field-Search) 88,87% 6803 360 11,13% 630 267
  6. Lucene (FST based FuzzySuggester) 78,96% 5883 481 21,04% 1550 146

*SearchHub with PreDict EE represents our commercial offering https://www.searchhub.io
This offering is a search platform independent, AI-powered search query intelligence API containing our concept of controlled precision reduction and PreDict EE (Enterprise Edition) which is capable of handling language agnostic term decomposition and disambiguation.