项目作者: dedupeio

项目描述 :
:id: Examples for using the dedupe library
高级语言: Python
项目地址: git://github.com/dedupeio/dedupe-examples.git
创建时间: 2014-04-02T13:04:28Z
项目社区:https://github.com/dedupeio/dedupe-examples

开源协议:MIT License

下载


Dedupe Examples

Example scripts for the dedupe, a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data. For more details, see the differences between Dedupe.io and the dedupe library.

To get these examples:

  1. git clone https://github.com/dedupeio/dedupe-examples.git
  2. cd dedupe-examples

or download this repository

  1. cd /path/to/downloaded/file
  2. unzip master.zip
  3. cd dedupe-examples

Setup

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

  1. mkvirtualenv dedupe-examples
  2. pip install -r requirements.txt

Afterwards, whenever you want to work on dedupe-examples,

  1. workon dedupe-examples

CSV example - early childhood locations

This example works with a list of early childhood education sites in Chicago from 10 different sources.

  1. cd csv_example
  2. pip install unidecode
  3. python csv_example.py

(use ‘y’, ‘n’ and ‘u’ keys to flag duplicates for active learning, ‘f’ when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

Patent example - patent holders

This example works with Dutch inventors from the PATSTAT international patent data file

  1. cd patent_example
  2. pip install unidecode
  3. python patent_example.py

(use ‘y’, ‘n’ and ‘u’ keys to flag duplicates for active learning, ‘f’ when you are finished)

Record Linkage example - electronics products

This example links two spreadsheets of electronics products and links up the matching entries. Each dataset individually has no duplicates.

  1. cd record_linkage_example
  2. python record_linkage_example.py

To see how you might use dedupe for linking datasets, see the annotated source code for record_linkage_example.py.

Gazetteer example - electronics products

This example links two spreadsheets of electronics products and links up the matching entries using the Gazetteer class

  1. cd gazetteer_example.py
  2. python gazetteer_example.py

MySQL example - IL campaign contributions

See mysql_example/README.md for details

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

PostgreSQL big dedupe example - PostgreSQL example on large dataset

See pgsql_big_dedupe_example/README.md for details

This is the same example as the MySQL IL campaign contributions dataset above, but ported to run on PostgreSQL.

Training

The secret sauce of dedupe is human input. In order to figure out the best rules to deduplicate a set of data, you must give it a set of labeled examples to learn from.

The more labeled examples you give it, the better the deduplication results will be. At minimum, you should try to provide 10 positive matches and 10 negative matches.

The results of your training will be saved in a JSON file for future runs of dedupe.

Here’s an example labeling operation:

  1. Phone : 2850617
  2. Address : 3801 s. wabash
  3. Zip :
  4. Site name : ada s. mckinley st. thomas cdc
  5. Phone : 2850617
  6. Address : 3801 s wabash ave
  7. Zip :
  8. Site name : ada s. mckinley community services - mckinley - st. thomas
  9. Do these records refer to the same thing?
  10. (y)es / (n)o / (u)nsure / (f)inished