项目作者: ravi3222
项目描述 :
Word2vec (word embeddings) approach to identify duplicate products listing from structured textual data from an e-commerce website.
高级语言: Jupyter Notebook
项目地址: git://github.com/ravi3222/DUPLICATE_PRODUCT_DETECTION.git
DUPLICATE PRODUCT DETECTION

Problem definition:
The problem of detecting duplicate products listing from structured textual data from an e-commerce website.
Duplicate Products:
- Products with similar characteristics, same productUrl.
- Products with same appearance but differ in color
- Products with same images
Approach:
- Extracted data for Tops category from the large dataset.
- Performed Data preprocessing:
- Carefully removing irrelevant columns
- Drop rows with incomplete data
- Imputing null values for some rows in Title column
- This problem of duplicate product detection can be solved by measuring the similarity between product listings semantic description.
- Combined product details into string using relevant columns.
- Product listing string conversion into vector using word2vec pretrained Google’s model.
- Cosine similarity is used to calculate score for two products at a time.
- Export output score with required JSON file format.
Contents
- parse.py: Script to extract product category specific attributes based on product subcategory
- processing.py: Script to clean and impute missing or NA values
- model.py: Script for pretrained gesim model
- duplicates.py: Script to identify duplicates based on product details
- json_output.json: Output json file contains duplicate product IDs with similarity score for each product members
- Image_Similarity/image_extractor.ipynb: Starter script to extract image details, convert into json and download 200x200 images
- imageData.json: JSON data contains product ids with image url links
Future Work:
Use siamese neural network (One shot learning) to find similarity between product images.
References: