스켈터랩스의 KorQuAD 2.0 경진대회 준비 과정을 공유합니다.
Machine Reading Comprehension (MRC) has long been one of the most challenging problems in NLP. With the advent of Transformer based models - BERT1 - the NLP community has made huge advances in the past years with the BERT model and its variants becoming the architecture of choice for breaking state-of-the-art results in NLP benchmarks.
With MRC technology it is now possible to bring search engine technology to the next level as we get ever closer to understanding user queries in natural language and are now able to retrieve precise answers to user’s questions.
At Skelter Labs we started working on MRC by participating in the KorQuAD v12 challenge, which is inspired by the famous SQuAD3 challenge but in Korean. The KorQuAD v1 data is composed of pairs of Korean wikipedia4 articles and questions. The goal is to find exact match answers to the questions within the articles. We rapidly took hold of the first place after less than a quarter of efforts and this kick started many of our current applications that are powered by this technology.
Following up on our recent success on KorQuAD v1 we set up to conquer the KorQuAD v25 leaderboard at the beginning of this year. The KorQuAD v2 was much more challenging as it included very long answers and entire raw wikipedia HTML articles as input, making it a challenge not only on the modeling side but also on the engineering side having to deal with a huge amount of data and increasing inference time.
Our KorQuAD v1 winning model, based on SpanBERT6, handled inputs and post-processing in a conventional way. Question and spans of context are concatenated in an input to the model. For long contexts, the context would be split and multiple inputs would be created in a sliding window manner. The model outputs start and end positions of the answers.
Figure 1. An example of KorQuAD v1-style context, question, answer, and converted form. Answer span starts from the 18th token(“조지”) until 22th token(“(“).
Running our KorQuAD v1 model on the KorQuAD v2 dataset without any modifications gave us very poor results:
- F1 score: 62.71%
- EM score: 51.00%
We used this first score as our baseline and analyzing the dataset gave us a clearer picture of the challenge at we are dealing with:
Figure 2. Analysis on our KorQuAD v1 best model’s performance using KorQuAD v2 dev data
Our model struggled with very long answers that spanned multiple context windows. Primarily these answers were HTML tables or large portions of text contained in <div> tags.
On KorQuAD v1, all answers were short and would be contained within a single input, making it easy to detect start and end positions. So our first challenge was to find a way to handle very long answers.
Figure 3. An example of KorQuAD v2 data which our KorQuAD v1 best model predicted wellFigure 4. An example of KorQuAD v2 data which our KorQuAD v1 best model performed poorly.
As it turned out, this problem was easily solved by modifying our training data labeling method. For long answers, the beginning span would contain the start position but no end position. We then labelled the end position to the [CLS] token. The ending span would similarly be labeled with the start position to [CLS] token.
This alone increased the F1 score to 80.40% and the EM score to 62.86%. However it was not enough to get us to the first position. Our second idea was to run a second round of pretraining of our model using the masked LM task on KorQuAD v2 training data only. Since the HTML tags were broken down into multiple tokens using our previous vocabulary, we also included whole HTML tags in the vocabulary reducing the context size by 51%. After another round of fine tuning this brought our F1-score to 87.39% and EM score to 74.27%. A quick ablation study against our baseline showed that the additional pre-training round boosted the baseline scores from F1-score of 62.71% to 70.44% and EM score from 51.00% to 60.85%.