DUPLICATE PRODUCT DETECTION

Problem definition:

The problem of detecting duplicate products listing from structured textual data from an e-commerce website.

Extracted data for Tops category from the large dataset.
Performed Data preprocessing:
- Carefully removing irrelevant columns
- Drop rows with incomplete data
- Imputing null values for some rows in Title column
This problem of duplicate product detection can be solved by measuring the similarity between product listings semantic description.
Combined product details into string using relevant columns.
Product listing string conversion into vector using word2vec pretrained Google’s model.
Cosine similarity is used to calculate score for two products at a time.
Export output score with required JSON file format.

parse.py: Script to extract product category specific attributes based on product subcategory
processing.py: Script to clean and impute missing or NA values
model.py: Script for pretrained gesim model
duplicates.py: Script to identify duplicates based on product details
json_output.json: Output json file contains duplicate product IDs with similarity score for each product members
Image_Similarity/image_extractor.ipynb: Starter script to extract image details, convert into json and download 200x200 images
imageData.json: JSON data contains product ids with image url links

Use siamese neural network (One shot learning) to find similarity between product images.