Deep Learning for Entity Matching: A Design Space Exploration

Authors - Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra

Motivation/Abstract:
Introduces entity matching, deep learning separately. Categorizes all the Deep Learning solutions in similar situations(matching tasks in NLP), and then works towards solving Entity Matching. Talks about the involvement of Deep Learning in EM in the past (one model). Finds out different fields where Deep Learning might perform better, does testing and reports results and conclusion.

Main Points:

There are three criteria/context where the discussion(comparing Deep Learning model vs traditional model) takes place -
1. Structured Data - SQL type stored data
2. Textual Data - Consists of text blobs which consist of huge sentences
3. Dirty Data - Erroneous structured data where text fields have values all mixed up

Four representative models are picked and the Deep Learning process is explained through them. 

Positive points about the paper:

1. The results are extensive, easy to understand and makes sense (follows usual deep learning advantages/disadvantages). For example, for structured data - the initial models do not perform that well because of lack of data, whereas in the model where it gets a lot of data, it outperforms the traditional EM method.

2. It shows that throughout the experiments, one of the models (Hybrid) performs better consistently than the previous Deep Learning model used in EM. So it's a nice improvement over the previous model, with consistent performance.

3. The accuracy measure picked was F1 score. It doesn't give too much weight to precision and recall, and I believe it is a fair and just measure (in most cases)

4. The distribution of all data into sub-parts and then studying them is a great way of looking at things, and it covers the majority of all use-cases.

5. The author has done an extensive literature review and based his model architecture on a variety of existing Deep learning approaches, resulting in a well structured model layout where 32 different combinations could be tried and tested. The architecture offers modularity. 

Things I have doubts in:

1. F1 score, what happens in the cases where F1 is known to suffer a little bit. For instance, when the output is more biased towards one class (imbalanced classes)? 

2. There are two parts of the problem - blocking and matching. 
This paper only cares about the matching part of the problem, it assumes we have perfect blocking data which might not be the case. It assumes that the blocking data had no false negatives, which cannot be the ideal situation. So how would one usually deal with false negatives? 

3. What happens when we don't have exactly matching schema as the input? Because the author assumes the schema matches completely between two entities ( number of attributes are same )


Name - Rahul Biswas



Comments

  1. Yes, false negatives are generally a more difficult problem and have a greater consequence. Point #3 is well taken.

    ReplyDelete

Post a Comment

Popular posts from this blog

A Machine Learning Approach to Live Migration Modeling

StormDroid: A Streaminglized Machine Learning-Based System for Detecting Android Malware

CrystalBall: Statically Analyzing Runtime Behavior via Deep Sequence Learning