项目作者: ravi3222

项目描述 :
Word2vec (word embeddings) approach to identify duplicate products listing from structured textual data from an e-commerce website.
高级语言: Jupyter Notebook
项目地址: git://github.com/ravi3222/DUPLICATE_PRODUCT_DETECTION.git
创建时间: 2019-01-17T20:11:26Z
项目社区:https://github.com/ravi3222/DUPLICATE_PRODUCT_DETECTION

开源协议:

下载


DUPLICATE PRODUCT DETECTION

s

Problem definition:

The problem of detecting duplicate products listing from structured textual data from an e-commerce website.

Duplicate Products:

  • Products with similar characteristics, same productUrl.
  • Products with same appearance but differ in color
  • Products with same images

Approach:

  1. Extracted data for Tops category from the large dataset.
  2. Performed Data preprocessing:
    • Carefully removing irrelevant columns
    • Drop rows with incomplete data
    • Imputing null values for some rows in Title column
  3. This problem of duplicate product detection can be solved by measuring the similarity between product listings semantic description.
  4. Combined product details into string using relevant columns.
  5. Product listing string conversion into vector using word2vec pretrained Google’s model.
  6. Cosine similarity is used to calculate score for two products at a time.
  7. Export output score with required JSON file format.

Contents

  • parse.py: Script to extract product category specific attributes based on product subcategory
  • processing.py: Script to clean and impute missing or NA values
  • model.py: Script for pretrained gesim model
  • duplicates.py: Script to identify duplicates based on product details
  • json_output.json: Output json file contains duplicate product IDs with similarity score for each product members
  • Image_Similarity/image_extractor.ipynb: Starter script to extract image details, convert into json and download 200x200 images
  • imageData.json: JSON data contains product ids with image url links

Future Work:

Use siamese neural network (One shot learning) to find similarity between product images.

References: