Deep Learning for Entity Matching : A Design Space Exploration


Authors :
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra. 

Name :
Aditya Bhat

Motivation :
Entity matching is used to find the data instances which map to the same real-world entity. This paper explores various state-of-the-art deep learning techniques for this task with different representational power.

Pros :

1. The authors also provide custom representational power based on the choices at each stage of the matching task. The simpler models (SIF) require less time to train whereas the most representational ones (Hybrid) require longer hours to train while providing better predictive power. The medium-complexity models (RNN, Attention) fall in between these two classes of models.

2. The deep learning (DL) techniques proposed in this paper are used for structured, textual and dirty data. This robustness in handling data is very useful when considering larger corpora of data obtained from different sources. The models work better with large training datasets and show significant improvement in performance.

3. The paper proposes a generic framework which can be used to streamline and leverage different DL techniques for the entity matching task. The three steps proposed are Attribute Embedding, Attribute Similarity Representation and Classification. Each and every step can be decoupled, making it easier to plug and play a different type of DL technique which gives us better accuracy. [1,2]

Cons :

1. The DL models require a lot of computation power and also takes a lot of time to train when compared to the Magellan project. The models showed poor performance on clean, structured data when compared against rule-based solutions. This is because in some of the structured datasets the models overfit the data and hence, its performance was affected whereas Magellan was able to perform better because of the restricted search space (during similarity check).

2. The paper does not consider the blocking stage of entity matching, which can be leveraged to produce better pairs of entity mentions. This further enhances the learning phase for the matcher to match the entities and generate lesser false negatives.

3. The paper assumes that the data available is high fidelity labelled data which is generally not the case and a lot of manual effort goes into providing the same. Data in the wild is really dirty and this assumption can hinder the performance of the various models.

Comments

  1. Good point about the burden of labelling data. The consideration of noisy data with outliers, etc, seems to be a challenging future direction for all of these methods.

    ReplyDelete

Post a Comment

Popular posts from this blog

A Machine Learning Approach to Live Migration Modeling

StormDroid: A Streaminglized Machine Learning-Based System for Detecting Android Malware

CrystalBall: Statically Analyzing Runtime Behavior via Deep Sequence Learning