pg_eval_dataset_index_BERT.json

{
    "queries": {
        "4d499739-f971-4985-888c-f8db6c7b7efa": "Question 1: What is the main objective of introducing BERT?",
        "b9c30c77-b575-406e-9a95-6d90263fcd20": "Question 2: How does BERT differ from other language representation models like GPT or RoBERTa?",
        "7fe38245-bb09-4219-ac03-a7d0c1aed7d2": "Question 1: How does the feature-based approach differ from the fine-tuning approach when applying pre-trained language representations to downstream tasks?",
        "cdc70093-b372-4515-ba59-452fdb6b7f8c": "Question 2: What are the limitations of using standard language models during pre-training, particularly regarding their unidirectional nature? Question 1: How do different approaches like ELMo and GPT handle pre-trained language representations differently for downstream tasks?",
        "704a2abb-3313-4d02-8a18-173727858722": "What is the main limitation of current fine-tuning approaches? How does BERT address this limitation?",
        "fd3ed9a5-4f0e-487a-9c83-93489521621a": "How does BERT alleviate the previously mentioned unidirectionality constraint? What is the \"masked language model\" (MLM) pre-training objective inspired by the Cloze task? How does this objective help in predicting the original vocabulary id of the masked token?",
        "1c195a1f-4ad8-4916-8408-548ee22312dd": "Question 1: What is the main contribution of the paper presented by Google Research regarding Bidirectional Language Models?",
        "85ac9ce9-cd7c-4df1-bc4c-31d91e1c51f4": "Question 2: How does the Masked Language Model differ from other pre-training techniques like Unidirectional or Shallow Concatenation? Question 1: What is the main contribution of the paper presented by Google Research regarding Bidirectional Language Models?",
        "cab3c878-1b74-468f-b8d4-32d2f699acae": "What are some examples of pre-trained word embeddings? How do they improve upon embeddings learned from scratch? Can you provide any specific examples of how these techniques have been applied in practice? What are some other ways that these techniques can be generalized beyond just sentence embeddings? Are there any recent advancements in this field that have improved performance on certain tasks? Can you give an example of how ELMo's approach differs from traditional word embedding research? How does it integrate contextual word embeddings with existing task-specific architectures? Can you discuss the potential benefits of using contextual representations in natural language processing applications? Are there any limitations or challenges associated with using contextual representations in NLP? Can you explain the concept of ELMO and its predecessor, and how it differs from traditional word embedding research? Can you describe the process of training ELMo and what makes it unique compared to other methods? Can you discuss the potential benefits and drawbacks of using contextual representations in NLP? Can you compare and contrast the strengths and weaknesses of different types of embeddings, including those mentioned in the text? Can you discuss the potential impact of contextual representations on future developments in NLP? Can you explain the concept of LSTM and how it relates to ELMo? Can you discuss the potential benefits and drawbacks of using contextual representations in",
        "375d961c-6abb-45a8-992e-e80563e6517f": "Question 1: What was the main focus of Melamud et al.'s (2016) study regarding contextual representations?",
        "d84a25e5-6025-44ef-822a-cb1289c6e3c6": "Question 2: How did Fedus et al. (2018)'s work demonstrate the importance of using the cloze task in improving the robustness of text generation models?",
        "34b28839-62f2-40fe-b1a1-4290bb7cdf3b": "What are some examples of transfer learning techniques that have been successfully applied in the field of natural language processing? How does BERT utilize these techniques in its architecture?",
        "b91115fc-d825-489d-95f6-f0f096ff669b": "How does BERT's architecture differ from other transformer-based models like GPT or ALICE in terms of attention mechanisms, tokenization, and sequence length handling? Please compare and contrast these models' approaches to addressing common challenges faced by transformers in NLP tasks.",
        "0e2b1a26-8011-4e1f-acf9-4ebad890bcfb": "Question 1: What is the distinctive feature of BERT that sets it apart from other transformer-based models?",
        "a7d6fde0-84fc-46e5-ae45-d1c45b9bc058": "Question 2: How does the fine-tuning process work during the training of the BERT model? To answer these questions, students need to understand the core concepts of BERT's architecture, pre-training, and fine-tuning processes as described in the given text. These questions cover key aspects such as the unified architecture across tasks, minimal differences between pre-trained and final architectures, and the fine-tuning procedure. They require comprehension of the model's components and how it adapts to new tasks through parameter adjustments. This approach ensures that students can demonstrate their understanding of both the theoretical underpinnings and practical applications of BERT within the framework of computer vision research.",
        "cf9b1128-3555-47ad-9f8f-98419073baba": "Question 1: What is the primary difference between Bidirectional Self-Attention and Constrained Self-Attention used in the BERT Transformer?",
        "0486c3d1-760c-4757-9cd9-f82ed37a782f": "Question 2: How does the choice of model size (e.g., BERTBASE or BERTLARGE) affect the performance of the Transformer-based models? Please provide specific examples from the given text. Question 1: What is the main difference between Bidirectional Self-Attention and Constrained Self-Attention utilized in the BERT Transformer?",
        "c98cd949-bc46-4534-a5f7-505ac87ef4af": "What is the main difference between the first token of every sequence being a special classification token and adding a learned embedding to every token indicating whether it belongs to sentence A or sentence B? How does this affect the overall representation of the input sequence?",
        "c39068c8-21b1-454b-9977-0314cde62cc0": "Based on the context information provided, what is the primary distinction between the initial token within each sequence serving as a special classification token and incorporating a learned embedding that distinguishes tokens belonging to different sentences? Additionally, how does this modification influence the comprehensive representation of the input sequence?",
        "1c1f9d1c-88c5-4faa-894d-1b94815913bc": "What is the main difference between a Transformer decoder and a Transformer encoder? How does this affect the training process?",
        "d453b614-200e-4caa-87af-f733a1700ea4": "How does the masking technique work in BERT's masked language modeling task? What are the implications of this approach for the final hidden vectors?",
        "be8dfb1c-a092-4b69-939e-efa07c8a131a": "Question 1: What is the main difference between denoising auto-encoders and other types of auto-encoders?",
        "b4eefd84-e476-478e-86c8-5ca9fa4d8580": "Question 2: How does the proposed method address the issue of mismatch between pre-training and fine-tuning? Question 1: What is the primary advantage of using a bidirectional pre-trained model over traditional models?",
        "03dd22e3-6aba-4064-8c61-87b48f93c7a8": "What benefits does pre-training towards predicting next sentences have? How effective is the final model's performance on the NSP task compared to other tasks?",
        "58f553b4-1d17-41ef-9c68-2ad24a33643c": "Teacher: What benefit does pre-training towards predicting the next sentence provide? How well does the final model perform on the NSP task relative to other models?",
        "09b58f0d-b4b8-421e-8783-ecd7df290b76": "Question 1: What is the main difference between fine-tuning BERT and using it for downstream tasks?",
        "0b603baf-a569-43c0-a941-8728d57d87b9": "Question 2: How does BERT's self-attention mechanism differ from other models in terms of its ability to handle multi-modal data? Question 1: What is the main difference between fine-tuning BERT and using it for downstream tasks?",
        "09a76baa-a56d-4441-913a-7473dcab6694": "What is the main difference between BERT's approach and traditional approaches like Parikh et al.'s method? How does BERT's self-attention mechanism improve upon these methods?",
        "9d9fd296-3c57-446c-9f33-ee5f7d3f2dbd": "What are some key components of BERT's architecture that make it effective for various NLP tasks, and how do they contribute to its performance? To what extent has fine-tuning been shown to benefit the performance of BERT compared to pre-training, and why might this be the case?",
        "db12a500-48c6-4fab-b853-b439fc31526a": "Question 1: What is the General Language Understanding Evaluation (GLUE) benchmark and how does it differ from other benchmarks?",
        "92a91e62-aa12-42fd-98b2-195e8923b329": "Question 2: How long did it take to fine-tune BERT on the GLUE dataset using a single Cloud TPU, and what was the resulting F1 score? Question 1: What is the General Language Understanding Evaluation (GLUE) benchmark and how does it differ from other benchmarks?",
        "d9d41473-a6e0-442c-afd5-332c81fbde94": "What were the average performance scores for each model in the GLUE benchmark? How did these scores compare with the overall average score? Which models performed better than others in terms of accuracy, F1-score or correlation? What was the difference between the pre-trained models and the latest versions of BERT and OpenAI GPT? Could you provide more details about the specific improvements made in the latest versions of BERT and OpenAI GPT compared to previous ones? Based on the GLUE benchmark results, which model(s) showed significant improvement over time and why do you think they did so? Can you explain how the performance of each model varied across different tasks such as QQP, MRPC, STS-B, and CoLA? Please discuss any potential limitations or drawbacks of using the GLUE benchmark for evaluating language models. Finally, can you suggest ways to improve the current state-of-the-art models like BERT and OpenAI GPT to achieve even higher performance in future? |<|",
        "7bbda2c4-f90d-4c63-ab91-2be0c47ab779": "What were the main differences between BERTBASE and BERTLARGE? How did these differences impact the performance of the models on various GLUE tasks?",
        "59d11fa2-049a-4c5b-adc1-bf98b374128e": "How does the performance of BERT compare to that of OpenAI GPT on the official GLUE leaderboard? What specific improvements can be attributed to BERT's larger model size? To what extent do you believe the differences in performance between BERTBASE and BERTLARGE reflect the impact of model size on various GLUE tasks? Could you elaborate on how the authors addressed potential instability issues with BERTLARGE during its fine-tuning process? Please provide insights into the overall implications of this research on the field of natural language processing and machine learning. Based on your analysis, could you suggest any future directions or areas of focus for further investigation in this area? |<|",
        "06030106-46d3-4b7f-aa91-0225eb418947": "Question 1: How did BERTLARGE perform on the GLUE leaderboard?",
        "332488ff-2eb6-4321-b7af-d314abfaf9b6": "Question 2: What was the Stanford Question Answering Dataset (SQuAD v1.1) used for, and how does BERT handle such datasets?",
        "617e4fd8-ed5d-44bd-afff-c95c9cd7bc55": "Question 1: What was the main focus of the study conducted by Seo et al. in 2017?",
        "8ce17f92-1bd6-4a89-b473-3b95a73d2b34": "Question 2: How did the authors fine-tune their BERT model before using it on SQuAD? Question 1: What were the key components used in the training process of the system developed by the researchers mentioned in the text?",
        "59cf636f-353e-409f-9040-a6b342b260c1": "Question 1: What were the top leaderboard systems for Human participants in the SQuAD 1.1 competition?",
        "9d049470-1d41-45ea-8a7f-12e293779c5a": "Question 2: How did the Ours system perform compared to other systems in the SQuAD 2.0 competition? To compare with the original question, please provide the performance metrics of the Ours system against the other models mentioned in the table. Question 1: Which systems achieved the highest EMM scores among the Human participants in the SQuAD 1.1 competition?",
        "e3691d79-39b7-4c12-b86d-06fd7d63b0cf": "Question 1: How does the ESIM+GloVe model perform on the SQuAD v2.0 task compared to other models?",
        "625185b4-e0e3-437b-820d-57b929cadede": "Question 2: What modifications were made to the SQuAD v1.1 model to address the new definition of the task, and how effective were these modifications?",
        "2568a84d-acc6-4720-b720-85411e78ca6d": "Question 1: How was the model fine-tuned and what were the specific parameters used during the process?",
        "6a3e8dd1-e938-463b-bacc-c34206954508": "Question 2: What improvements were made over the previous best system and how does it compare to other models like Sun et al.'s and Wang et al.'s? Question 1: How was the model fine-tuned and what were the specific parameters used during the process?",
        "1b516bfb-64a7-4480-bd28-2abce91504bb": "What were the different ablations performed on the BERTBASE architecture? How did they affect the accuracy and F1 score metrics?",
        "464de966-1d58-4360-94d9-f43e26d9fbef": "How does the addition of a randomly initialized BiLSTM layer impact the performance of the \"LTR + No NSP\" model compared to the original LTR model? What are the implications of this change on the overall effectiveness of the pre-trained language model? To what extent do these ablations contribute to improving the downstream performance of the model? Can you provide any insights or conclusions from your analysis? Please include relevant references and citations where applicable.",
        "532888a5-cbeb-420c-8fba-020498caff57": "What were the key differences between the pre-training process and fine-tuning process used in the study? How did these differences affect the performance of the model during the fine-tuning stage? What were the specific impacts of the NSP task on the model's performance? How did adding a random-initialized BiLSTM on top of the LTR model improve its performance on SQuAD? Can you suggest any other ways to strengthen the LTR system beyond just using a BiLSTM? What were the limitations of the proposed approach in terms of computational cost and intuitive understanding for certain tasks such as QA? How did the performance of the model change when trained on different datasets or with varying levels of data augmentation? Could you provide more details on how the input representation affects the model's performance? How did the addition of the NSP task impact the performance of the model on various downstream tasks such as QNLI, MNLI, and SQuAD 1.1? What were the implications of using a larger training dataset and a different fine-tuning scheme compared to OpenAI GPT? How did the performance of the model change when trained on different types of neural networks such as LSTM, GRU, or Transformer-based architectures? How did the performance of the model change when trained",
        "93f713f1-cc9a-44a6-9a95-7c76f9e2b28e": "Question 1: How did the researchers observe the impact of model size on fine-tuning task accuracy?",
        "e18b624a-7c02-4ed6-9bf8-fe88206bf2cf": "Question 2: What were the results observed by the researchers when they compared the performance of different sized BERT models on various GLUE tasks?",
        "48389c08-1f6b-4db1-8c68-219247371881": "What were some of the key findings or insights regarding the impact of increasing the pre-trained Bi-LM size from two to four layers? How does this research align with previous work mentioned in the paper?",
        "c26f23e1-0d6d-47b1-acc7-7e84e1938536": "How do you think the feature-based approach differs from the fine-tuning approach in terms of its advantages and potential benefits for downstream tasks? Can you provide examples of how this approach could be applied in practice?",
        "d26cb4ca-c419-40f0-906d-b0f3c84e82de": "What were the results of the study using BERT on the CoNLL-2003 Named Entity Recognition task? How did the inclusion of maximal document context affect the performance of the model? What was the accuracy of the model when applied to different datasets? Can you provide more details about the hyperparameters used in the experiment? How does the increase in model size impact the performance of the model? Are there any other factors that could have affected the performance of the model besides the model size? Could you explain how the use of a case-preserving WordPiece model affects the performance of the model? How does the use of a Maximal Document Context affect the performance of the model? Can you provide more details about the CRF Hyperparams Dev Set Accuracy? How does the use of a CRF Hyperparams Dev Set Accuracy affect the performance of the model? Can you provide more details about the Masked LM Perplexity of held-out training data? How does the use of a Masked LM Perplexity of held-out training data affect the performance of the model? Can you provide more details about the Number of Layers? How does the use of a Number of Layers affect the performance of the model? Can you provide more details about the Hidden Size? How does the",
        "812d36f5-8203-407d-9502-65dad9de5e68": "Question 1: How does the performance of different methods compare in terms of accuracy and efficiency?",
        "f18d9a22-2c73-4390-b79e-881cc41fa7a6": "Question 2: What are the key components and techniques involved in the fine-tuning approach compared to the feature-based approach? Question 1: How does the performance of different methods compare in terms of accuracy and efficiency?",
        "c766cd49-e879-45d3-baaf-5bb53fe34b1b": "What were the key findings regarding the performance of different methods when using BERT? How did this demonstrate the effectiveness of BERT for both fine-tuning and feature-based approaches? Could you elaborate on how recent empirical improvements through transfer learning with language models have impacted various aspects of language understanding systems? What specific benefits can be observed in terms of resource efficiency when utilizing deep unidirectional architectures compared to their bidirectional counterparts? Please provide insights into how our study extends previous research by applying these principles to broader sets of natural language processing tasks. |<|",
        "b03d5a41-8c0b-4378-a9be-1ddf18808725": "Teacher/Professor:",
        "2c90fa96-f612-40fb-a3c7-e4e8c78e71eb": "What are some key contributions made by Alan Akbik, Duncan Blythe, and Roland Vollgraf in their work on contextual string embeddings for sequence labeling? How does this approach differ from previous approaches in character-level language modeling using self-attention?",
        "8f1e7bee-54b2-4665-82e4-138c7ba37edf": "Teacher/Professor:",
        "92cbcbf7-be26-45ea-b0ed-a4d2d32c428c": "What was the main focus of the paper by Daniel Cer et al. regarding the Semeval-2017 task?",
        "25900e1c-96a1-4630-a54f-bea9c176b487": "What is the main focus of the research paper by Alexis Conneau et al.? How does their work differ from previous approaches in natural language processing? What are some potential applications or implications of their approach? Can you provide any examples of how their method could be used in real-world scenarios? Please also discuss the limitations or challenges that they faced during the development of their model. Finally, can you compare their approach to other recent advancements in NLP such as those mentioned in the context information? To what extent do you think their work has contributed to the field of NLP? Based on your understanding of the paper, please provide a brief summary of its key contributions and findings. Could you elaborate on the technical details behind their model, including the architecture, training process, and evaluation metrics? Lastly, how might future research build upon this work and what new directions could it take? Please provide a detailed response to these questions. |<|",
        "bb189784-5b73-4b68-9ae7-14fa1788c777": "What are some key concepts or ideas discussed in the given documents? How do these relate to each other? Can you identify any common themes or patterns that emerge from analyzing the content? Please provide examples from the documents to support your answer. What are the main takeaways from this analysis? How does it impact our understanding of related topics or fields? Can you suggest potential applications or implications of these insights in real-world scenarios? Please provide specific examples from the documents to illustrate your points. How might these insights influence future research or development efforts in natural language processing or related areas? Please provide references to relevant papers or resources as evidence for your claims. Can you discuss how these insights could potentially lead to improvements in existing models or algorithms used in NLP tasks such as sentiment analysis, topic modeling, or machine translation? Please provide concrete examples from the documents to support your argument. How can we further leverage these insights to enhance the performance or efficiency of current NLP systems? Please provide suggestions for practical implementations or strategies that could be employed by researchers or practitioners working in the field. Can you explore the ethical considerations or limitations associated with using these insights in NLP applications? Please provide case studies or examples from the literature to support your discussion. How might these insights inform best practices or guidelines for developing",
        "3dc6eadc-1479-42b6-b6af-f78c5ee818f3": "Teacher/Professor:",
        "ed848d06-7fb1-417a-b916-5d9dce3b00cd": "What was the main focus of the paper by Mandar Joshi et al. at the time of writing?",
        "fccc33d3-771f-4b88-9d26-27ad7c7f3256": "What are some key concepts or ideas discussed in this text? How do these relate to each other and how might they be applied in real-world scenarios? Could you provide examples of how these concepts have been implemented in practical applications? What are the potential limitations or drawbacks of using these methods or approaches? How can we improve upon them going forward? Please discuss your thoughts and insights on this topic. Thank you for sharing your thoughts. Let's get started! Question 1: Can you identify any specific techniques or algorithms mentioned in the text that could potentially be used in real-world applications? Please explain how these techniques or algorithms work and what benefits they offer. Question 2: Based on the information provided in the text, please summarize the main points and draw connections between different concepts. Additionally, suggest possible ways in which these concepts could be integrated into existing systems or processes. Finally, consider discussing the ethical implications of using these technologies and propose guidelines for responsible use. Thank you for participating in this discussion. Let's continue. Question 1: Can you elaborate on the concept of \"distributed representations\" as described in the text? Specifically, how does this approach differ from traditional machine learning techniques such as linear regression or decision trees? Please provide examples of how this technique has been successfully applied in",
        "325dd4d0-67d4-4aa3-90f9-e83294399989": "Question 1:",
        "6a710f48-eee2-4ea1-b15f-7d3605024d80": "What is the main focus of the paper \"Dissecting contextual word embeddings: Architecture and representation\" by Alec Radford et al?",
        "08a3a347-a366-4fca-b5ae-ccac7d9a86ca": "Teacher/Professor:",
        "1630ca98-7b4b-4257-b168-145913c6adc0": "What was the main focus of Erik F Tjong Kim Sang and Fien De Meulder's study published in \"Journalism Bulletin\"?",
        "1e93729f-5220-4bd6-88e4-5e7f2ae37157": "Teacher/Professor:",
        "a9be8260-e8cf-405c-a847-1146753474dc": "What is the main focus of the paper \"Multi-granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering\"?",
        "44aa0539-cc2f-436c-ab08-d43560ebc4f7": "Based on the given context, here are two questions that cover different aspects:",
        "bdb1fa9d-ec0d-420f-a9a7-f076e60f1a11": "**Question 1:**",
        "334bdd40-a421-4d98-ab68-65aed3839cfb": "What are some examples of the pre-training tasks? How do they work? What is the purpose of using the Masked Language Model during the pre-training process?",
        "188227f5-2e53-4fba-9da2-1aed25cc5ed6": "What are some examples of the pre-training tasks? How do they work? What is the purpose of using the Masked Language Model during the pre-training process?",
        "ed89c40f-a8c7-4918-ace7-ec65a20ddf61": "Question 1: What is the main difference between BERT and OpenAI GPT in terms of their pre-training models?",
        "86a76bcb-0c54-44cd-84d0-2e363a8fabce": "Question 2: How do BERT's joint conditioning on both left and right context compare to other pre-trained models like ELMo?",
        "5df5c489-3042-4ec9-9dc6-b0d115159a0a": "What were the main components of the training process described in this text? How did the authors optimize the training process to handle longer sequences efficiently?",
        "d0cfcf39-e56c-4e42-88a4-f6096bc0073f": "What were the key differences between BERTBASE and BERTLARGE mentioned in the text? How did these models differ in terms of their architecture and performance?",
        "d0fc2eab-30f9-4f6e-95ea-53e026010a8d": "What is the optimal range of batch sizes for fine-tuning? What are some key factors that influence the performance of the model during pre-training?",
        "3b5b82c1-d224-4a46-b6e6-77dcebe28050": "What is the optimal range of batch sizes for fine-tuning? How does the choice of learning rate affect the training process?",
        "fafadd60-e3da-45b5-8555-5bbbd944766f": "What are some key differences between BERT and OpenAI GPT? How does the architecture of these models differ from each other? What specific features or components of BERT and OpenAI GPT contribute to their performance? Can you provide any insights into why certain hyperparameters may have been chosen differently for these models? How might the differences in training procedures impact the final results of these models? Are there any potential drawbacks or limitations associated with using either of these models in real-world applications? Based on your analysis, what can you conclude about the strengths and weaknesses of BERT and OpenAI GPT in terms of their ability to perform different types of natural language processing tasks? Could you discuss how the differences in training procedures and architecture affect the generalizability of these models to new domains or datasets? Please consider providing concrete examples or case studies where these models have been applied successfully or unsuccessfully in various contexts. Finally, would you recommend one of these models as a better choice for a particular application scenario, given the trade-offs involved in choosing between them? Please justify your answer based on the evidence presented in the document.",
        "c7e97357-1156-4107-8d29-836c008dcefd": "What are the main differences between GPT and BERT? How do these differences affect their performance? Can you provide examples of how these differences manifest in real-world applications or scenarios? What are some potential future directions for research in this area? To what extent does the use of pre-trained models like BERT and GPT impact the training process and overall performance of machine learning algorithms? Please discuss the implications of using different types of embeddings such as sentence embeddings, token embeddings, and contextual representations in natural language processing tasks. How do these embeddings differ in terms of their structure, scope, and application? Can you explain the role of the [SEP] and [CLS] tokens in the BERT model and how they contribute to its effectiveness? Finally, please discuss the importance of fine-tuning techniques in improving the performance of machine learning models, especially when dealing with limited labeled data. How does the choice of learning rate, batch size, and number of steps influence the convergence of the model during training? Can you provide insights into the trade-offs involved in choosing different hyperparameters for achieving optimal performance? Based on your understanding of the above concepts, please design a comprehensive question bank for the examination. Question 1: Compare and contrast the key features and capabilities of GPT and BERT,",
        "8aacfa2a-fda1-4b03-bc7a-d9fa6b017d34": "What is the MNLI dataset used for? How does it differ from other datasets like QQP and QNLI? <|USER|> What is the main difference between MNLI and QQP? How does this impact their performance on different tasks? <|ASSISTANT|>",
        "ba7b8965-dc16-40fd-a2fa-4257f72ae2e6": "What are some examples of tasks that have been fine-tuned using BERT? To what extent has it been used in different domains such as CoLA, STS-B, MRPC? How does BERT perform in these tasks compared to traditional machine learning models? Can you provide more details about the datasets used for training and evaluation in each task? What are the limitations of using BERT in these tasks? Are there any potential improvements or modifications that can be made to enhance its performance? Please discuss the ethical implications of using BERT in these applications. Could you also compare the performance of BERT with other state-of-the-art models in these tasks? Finally, please suggest ways to further improve the accuracy and efficiency of BERT in these tasks. |<|",
        "f2208672-4f8d-4cef-932a-1ea413df8dd3": "What is the main difference between MRPC and WNLI datasets? How does the performance of OpenAI GPT compare to other models in terms of these datasets? What are some potential ways to improve the performance of OpenAI GPT on these datasets? Can you provide any insights or examples related to the limitations of the GLUE dataset mentioned in the context? How might the use of multiple tasks affect the performance of OpenAI GPT on these datasets? Are there any specific techniques or methods that can help improve the performance of OpenAI GPT on these datasets? Can you suggest any strategies or approaches that can be used to address the challenges faced by OpenAI GPT on these datasets? How do you think the performance of OpenAI GPT will evolve over time as it continues to learn and adapt through interactions with users? Can you discuss any recent developments or advancements in the field of natural language processing that have impacted the performance of OpenAI GPT on these datasets? How do you think the performance of OpenAI GPT will change if it is exposed to more diverse and varied input data? Can you explain the concept of entailment and its importance in the context of these datasets? How do you think the performance of OpenAI GPT will change if it is exposed to more diverse",
        "1532bc5b-a3d0-4c5c-b0dc-dec8621dc9b0": "Question 1: What were the key findings regarding the impact of pre-training steps on achieving high fine-tuning accuracy in the study?",
        "9939d620-2bdc-4bb9-b5c0-fd52f08abb28": "Question 2: In what ways did the mixed strategy for masking target tokens during pre-training differ from other masking strategies? To understand these differences better, please provide a detailed analysis of how this approach affected the performance of BERT.",
        "34cb12a7-2088-4a72-bae5-ec4c0d8bf7b7": "What were the main differences between fine-tuning and feature-based approaches in the study? How did these approaches perform differently on the Dev set? What were the key findings from this ablation study? Could you provide more details about the masking strategies used in the study? How does the feature-based approach compare to other methods in terms of performance? What were the specific masking rates tested in the study? How did the authors conclude their analysis? Can you explain how the authors determined the optimal masking rate for the study? What were the overall results of the study? How did the authors justify the choice of BERT's last four layers as features? What were the implications of the study for future research in natural language processing? How did the authors address potential biases or limitations in their study? What were the strengths and weaknesses of the study design? How did the authors balance the trade-off between training efficiency and model performance? What were the challenges faced by the authors in conducting the study? How did the authors ensure the reproducibility of their results? What were the limitations of the study in terms of sample size or data availability? How did the authors validate the effectiveness of their approach? What were the ethical considerations involved in the study? How did the authors incorporate feedback from experts or stakeholders into",
        "3e83b371-aee3-4220-9d09-1afdce463f9e": "What were the main findings regarding the fine-tuning process in the study? How did the researchers compare different masking strategies during the pre-training phase?",
        "937f4d22-be7b-4fc7-a137-ae4e08367757": "What were the key differences between the feature-based approach and the traditional method in terms of performance? How did the researchers determine the best approach for NER with the feature-based approach? What were the limitations or challenges faced by the researchers while implementing the feature-based approach? How did the researchers address these limitations or challenges in their implementation? Can you explain how the researchers determined the best approach for NER with the feature-based approach? What were the advantages and disadvantages of using the feature-based approach over other approaches in this study? How did the researchers evaluate the effectiveness of their approach in predicting entities in the Dev set? Can you explain how the researchers evaluated the effectiveness of their approach in predicting entities in the Dev set? What were the limitations or challenges faced by the researchers while implementing the feature-based approach? How did the researchers address these limitations or challenges in their implementation? Can you explain how the researchers addressed the limitations or challenges faced by the researchers while implementing the feature-based approach? What were the advantages and disadvantages of using the feature-based approach over other approaches in this study? How did the researchers evaluate the effectiveness of their approach in predicting entities in"
    },
    "corpus": {
        "node_0": "Proceedings of NAACL-HLT 2019, pages 4171\u20134186\nMinneapolis, Minnesota, June 2 - June 7, 2019.c\u20dd2019 Association for Computational Linguistics\n4171\nBERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. As a re-\nsult, the pre-trained BERT model can be \ufb01ne-\ntuned with just one additional output layer\nto create state-of-the-art models for a wide\nrange of tasks, such as question answering and\nlanguage inference, without substantial task-\nspeci\ufb01c architecture modi\ufb01cations.\nBERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\nsults on eleven natural language processing\ntasks, including pushing the GLUE score to\n80.5% (7.7% point absolute improvement),\nMultiNLI accuracy to 86.7% (4.6% absolute\nimprovement), SQuAD v1.1 question answer-\ning Test F1 to 93.2 (1.5 point absolute im-\nprovement) and SQuAD v2.0 Test F1 to 83.1\n(5.1 point absolute improvement).\n1 Introduction\nLanguage model pre-training has been shown to\nbe effective for improving many natural language\nprocessing tasks (Dai and Le, 2015; Peters et al.,\n2018a; Radford et al., 2018; Howard and Ruder,\n2018).",
        "node_1": "1 Introduction\nLanguage model pre-training has been shown to\nbe effective for improving many natural language\nprocessing tasks (Dai and Le, 2015; Peters et al.,\n2018a; Radford et al., 2018; Howard and Ruder,\n2018). These include sentence-level tasks such as\nnatural language inference (Bowman et al., 2015;\nWilliams et al., 2018) and paraphrasing (Dolan\nand Brockett, 2005), which aim to predict the re-\nlationships between sentences by analyzing them\nholistically, as well as token-level tasks such as\nnamed entity recognition and question answering,\nwhere models are required to produce \ufb01ne-grained\noutput at the token level (Tjong Kim Sang and\nDe Meulder, 2003; Rajpurkar et al., 2016).\nThere are two existing strategies for apply-\ning pre-trained language representations to down-\nstream tasks: feature-based and \ufb01ne-tuning. The\nfeature-based approach, such as ELMo (Peters\net al., 2018a), uses task-speci\ufb01c architectures that\ninclude the pre-trained representations as addi-\ntional features. The \ufb01ne-tuning approach, such as\nthe Generative Pre-trained Transformer (OpenAI\nGPT) (Radford et al., 2018), introduces minimal\ntask-speci\ufb01c parameters, and is trained on the\ndownstream tasks by simply \ufb01ne-tuning all pre-\ntrained parameters. The two approaches share the\nsame objective function during pre-training, where\nthey use unidirectional language models to learn\ngeneral language representations.\nWe argue that current techniques restrict the\npower of the pre-trained representations, espe-\ncially for the \ufb01ne-tuning approaches. The ma-\njor limitation is that standard language models are\nunidirectional, and this limits the choice of archi-\ntectures that can be used during pre-training.",
        "node_2": "The two approaches share the\nsame objective function during pre-training, where\nthey use unidirectional language models to learn\ngeneral language representations.\nWe argue that current techniques restrict the\npower of the pre-trained representations, espe-\ncially for the \ufb01ne-tuning approaches. The ma-\njor limitation is that standard language models are\nunidirectional, and this limits the choice of archi-\ntectures that can be used during pre-training. For\nexample, in OpenAI GPT, the authors use a left-to-\nright architecture, where every token can only at-\ntend to previous tokens in the self-attention layers\nof the Transformer (Vaswani et al., 2017). Such re-\nstrictions are sub-optimal for sentence-level tasks,\nand could be very harmful when applying \ufb01ne-\ntuning based approaches to token-level tasks such\nas question answering, where it is crucial to incor-\nporate context from both directions.\nIn this paper, we improve the \ufb01ne-tuning based\napproaches by proposing BERT: Bidirectional\nEncoder Representations from Transformers.\nBERT alleviates the previously mentioned unidi-\nrectionality constraint by using a \u201cmasked lan-\nguage model\u201d (MLM) pre-training objective, in-\nspired by the Cloze task (Taylor, 1953). The\nmasked language model randomly masks some of\nthe tokens from the input, and the objective is to\npredict the original vocabulary id of the masked",
        "node_3": "4172\nword based only on its context. Unlike left-to-\nright language model pre-training, the MLM ob-\njective enables the representation to fuse the left\nand the right context, which allows us to pre-\ntrain a deep bidirectional Transformer. In addi-\ntion to the masked language model, we also use\na \u201cnext sentence prediction\u201d task that jointly pre-\ntrains text-pair representations. The contributions\nof our paper are as follows:\n\u2022 We demonstrate the importance of bidirectional\npre-training for language representations. Un-\nlike Radford et al. (2018), which uses unidirec-\ntional language models for pre-training, BERT\nuses masked language models to enable pre-\ntrained deep bidirectional representations. This\nis also in contrast to Peters et al. (2018a), which\nuses a shallow concatenation of independently\ntrained left-to-right and right-to-left LMs.\n\u2022 We show that pre-trained representations reduce\nthe need for many heavily-engineered task-\nspeci\ufb01c architectures. BERT is the \ufb01rst \ufb01ne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level and token-level tasks, outper-\nforming many task-speci\ufb01c architectures.\n\u2022 BERT advances the state of the art for eleven\nNLP tasks. The code and pre-trained mod-\nels are available at https://github.com/\ngoogle-research/bert.\n2 Related Work\nThere is a long history of pre-training general lan-\nguage representations, and we brie\ufb02y review the\nmost widely-used approaches in this section.\n2.1 Unsupervised Feature-based Approaches\nLearning widely applicable representations of\nwords has been an active area of research for\ndecades, including non-neural (Brown et al., 1992;\nAndo and Zhang, 2005; Blitzer et al., 2006) and\nneural (Mikolov et al., 2013; Pennington et al.,\n2014) methods. Pre-trained word embeddings\nare an integral part of modern NLP systems, of-\nfering signi\ufb01cant improvements over embeddings\nlearned from scratch (Turian et al., 2010).",
        "node_4": "Pre-trained word embeddings\nare an integral part of modern NLP systems, of-\nfering signi\ufb01cant improvements over embeddings\nlearned from scratch (Turian et al., 2010). To pre-\ntrain word embedding vectors, left-to-right lan-\nguage modeling objectives have been used (Mnih\nand Hinton, 2009), as well as objectives to dis-\ncriminate correct from incorrect words in left and\nright context (Mikolov et al., 2013).\nThese approaches have been generalized to\ncoarser granularities, such as sentence embed-\ndings (Kiros et al., 2015; Logeswaran and Lee,\n2018) or paragraph embeddings (Le and Mikolov,\n2014). To train sentence representations, prior\nwork has used objectives to rank candidate next\nsentences (Jernite et al., 2017; Logeswaran and\nLee, 2018), left-to-right generation of next sen-\ntence words given a representation of the previous\nsentence (Kiros et al., 2015), or denoising auto-\nencoder derived objectives (Hill et al., 2016).\nELMo and its predecessor (Peters et al., 2017,\n2018a) generalize traditional word embedding re-\nsearch along a different dimension. They extract\ncontext-sensitive features from a left-to-right and a\nright-to-left language model. The contextual rep-\nresentation of each token is the concatenation of\nthe left-to-right and right-to-left representations.\nWhen integrating contextual word embeddings\nwith existing task-speci\ufb01c architectures, ELMo\nadvances the state of the art for several major NLP\nbenchmarks (Peters et al., 2018a) including ques-\ntion answering (Rajpurkar et al., 2016), sentiment\nanalysis (Socher et al., 2013), and named entity\nrecognition (Tjong Kim Sang and De Meulder,\n2003). Melamud et al. (2016) proposed learning\ncontextual representations through a task to pre-\ndict a single word from both left and right context\nusing LSTMs.",
        "node_5": "Melamud et al. (2016) proposed learning\ncontextual representations through a task to pre-\ndict a single word from both left and right context\nusing LSTMs. Similar to ELMo, their model is\nfeature-based and not deeply bidirectional. Fedus\net al. (2018) shows that the cloze task can be used\nto improve the robustness of text generation mod-\nels.\n2.2 Unsupervised Fine-tuning Approaches\nAs with the feature-based approaches, the \ufb01rst\nworks in this direction only pre-trained word em-\nbedding parameters from unlabeled text (Col-\nlobert and Weston, 2008).\nMore recently, sentence or document encoders\nwhich produce contextual token representations\nhave been pre-trained from unlabeled text and\n\ufb01ne-tuned for a supervised downstream task (Dai\nand Le, 2015; Howard and Ruder, 2018; Radford\net al., 2018). The advantage of these approaches\nis that few parameters need to be learned from\nscratch. At least partly due to this advantage,\nOpenAI GPT (Radford et al., 2018) achieved pre-\nviously state-of-the-art results on many sentence-\nlevel tasks from the GLUE benchmark (Wang\net al., 2018a). Left-to-right language model-",
        "node_6": "4173\nBERT BERT\nE[CLS] E1  E[SEP]... EN E1\u2019 ... EM\u2019\nC\n T1\n T[SEP]...\n TN\n T1\u2019 ...\n TM\u2019\n[CLS] Tok 1  [SEP]... Tok N Tok 1 ... TokM\nQuestion Paragraph\nStart/End Span\nBERT\nE[CLS] E1  E[SEP]... EN E1\u2019 ... EM\u2019\nC\n T1\n T[SEP]...\n TN\n T1\u2019 ...\n TM\u2019\n[CLS] Tok 1  [SEP]... Tok N Tok 1 ... TokM\nMasked Sentence A Masked Sentence B\nPre-training Fine-Tuning\nNSP Mask LM Mask LM\nUnlabeled Sentence A and B Pair \nSQuAD\nQuestion Answer Pair\nNERMNLI\nFigure 1: Overall pre-training and \ufb01ne-tuning procedures for BERT. Apart from output layers, the same architec-\ntures are used in both pre-training and \ufb01ne-tuning. The same pre-trained model parameters are used to initialize\nmodels for different down-stream tasks. During \ufb01ne-tuning, all parameters are \ufb01ne-tuned. [CLS] is a special\nsymbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-\ntions/answers).\ning and auto-encoder objectives have been used\nfor pre-training such models (Howard and Ruder,\n2018; Radford et al., 2018; Dai and Le, 2015).\n2.3 Transfer Learning from Supervised Data\nThere has also been work showing effective trans-\nfer from supervised tasks with large datasets, such\nas natural language inference (Conneau et al.,\n2017) and machine translation (McCann et al.,\n2017). Computer vision research has also demon-\nstrated the importance of transfer learning from\nlarge pre-trained models, where an effective recipe\nis to \ufb01ne-tune models pre-trained with Ima-\ngeNet (Deng et al., 2009; Yosinski et al., 2014).\n3 BERT\nWe introduce BERT and its detailed implementa-\ntion in this section.",
        "node_7": "Computer vision research has also demon-\nstrated the importance of transfer learning from\nlarge pre-trained models, where an effective recipe\nis to \ufb01ne-tune models pre-trained with Ima-\ngeNet (Deng et al., 2009; Yosinski et al., 2014).\n3 BERT\nWe introduce BERT and its detailed implementa-\ntion in this section. There are two steps in our\nframework: pre-training and \ufb01ne-tuning. Dur-\ning pre-training, the model is trained on unlabeled\ndata over different pre-training tasks. For \ufb01ne-\ntuning, the BERT model is \ufb01rst initialized with\nthe pre-trained parameters, and all of the param-\neters are \ufb01ne-tuned using labeled data from the\ndownstream tasks. Each downstream task has sep-\narate \ufb01ne-tuned models, even though they are ini-\ntialized with the same pre-trained parameters. The\nquestion-answering example in Figure 1 will serve\nas a running example for this section.\nA distinctive feature of BERT is its uni\ufb01ed ar-\nchitecture across different tasks. There is mini-\nmal difference between the pre-trained architec-\nture and the \ufb01nal downstream architecture.\nModel Architecture BERT\u2019s model architec-\nture is a multi-layer bidirectional Transformer en-\ncoder based on the original implementation de-\nscribed in Vaswani et al. (2017) and released in\nthe tensor2tensor library.1 Because the use\nof Transformers has become common and our im-\nplementation is almost identical to the original,\nwe will omit an exhaustive background descrip-\ntion of the model architecture and refer readers to\nVaswani et al.",
        "node_8": "(2017) and released in\nthe tensor2tensor library.1 Because the use\nof Transformers has become common and our im-\nplementation is almost identical to the original,\nwe will omit an exhaustive background descrip-\ntion of the model architecture and refer readers to\nVaswani et al. (2017) as well as excellent guides\nsuch as \u201cThe Annotated Transformer.\u201d2\nIn this work, we denote the number of layers\n(i.e., Transformer blocks) as L, the hidden size as\nH, and the number of self-attention heads as A.3\nWe primarily report results on two model sizes:\nBERTBASE (L=12, H=768, A=12, Total Param-\neters=110M) and BERTLARGE (L=24, H=1024,\nA=16, Total Parameters=340M).\nBERTBASE was chosen to have the same model\nsize as OpenAI GPT for comparison purposes.\nCritically, however, the BERT Transformer uses\nbidirectional self-attention, while the GPT Trans-\nformer uses constrained self-attention where every\ntoken can only attend to context to its left.4\n1https://github.com/tensor\ufb02ow/tensor2tensor\n2http://nlp.seas.harvard.edu/2018/04/03/attention.html\n3In all cases we set the feed-forward/\ufb01lter size to be 4H,\ni.e., 3072 for the H = 768and 4096 for the H = 1024.\n4We note that in the literature the bidirectional Trans-",
        "node_9": "4174\nInput/Output Representations To make BERT\nhandle a variety of down-stream tasks, our input\nrepresentation is able to unambiguously represent\nboth a single sentence and a pair of sentences\n(e.g., \u27e8Question, Answer \u27e9) in one token sequence.\nThroughout this work, a \u201csentence\u201d can be an arbi-\ntrary span of contiguous text, rather than an actual\nlinguistic sentence. A \u201csequence\u201d refers to the in-\nput token sequence to BERT, which may be a sin-\ngle sentence or two sentences packed together.\nWe use WordPiece embeddings (Wu et al.,\n2016) with a 30,000 token vocabulary. The \ufb01rst\ntoken of every sequence is always a special clas-\nsi\ufb01cation token ( [CLS]). The \ufb01nal hidden state\ncorresponding to this token is used as the ag-\ngregate sequence representation for classi\ufb01cation\ntasks. Sentence pairs are packed together into a\nsingle sequence. We differentiate the sentences in\ntwo ways. First, we separate them with a special\ntoken ([SEP]). Second, we add a learned embed-\nding to every token indicating whether it belongs\nto sentence A or sentence B. As shown in Figure 1,\nwe denote input embedding as E, the \ufb01nal hidden\nvector of the special [CLS] token as C \u2208RH,\nand the \ufb01nal hidden vector for the ith input token\nas Ti \u2208RH.\nFor a given token, its input representation is\nconstructed by summing the corresponding token,\nsegment, and position embeddings. A visualiza-\ntion of this construction can be seen in Figure 2.\n3.1 Pre-training BERT\nUnlike Peters et al. (2018a) and Radford et al.\n(2018), we do not use traditional left-to-right or\nright-to-left language models to pre-train BERT.\nInstead, we pre-train BERT using two unsuper-\nvised tasks, described in this section. This step\nis presented in the left part of Figure 1.",
        "node_10": "3.1 Pre-training BERT\nUnlike Peters et al. (2018a) and Radford et al.\n(2018), we do not use traditional left-to-right or\nright-to-left language models to pre-train BERT.\nInstead, we pre-train BERT using two unsuper-\nvised tasks, described in this section. This step\nis presented in the left part of Figure 1.\nTask #1: Masked LM Intuitively, it is reason-\nable to believe that a deep bidirectional model is\nstrictly more powerful than either a left-to-right\nmodel or the shallow concatenation of a left-to-\nright and a right-to-left model. Unfortunately,\nstandard conditional language models can only be\ntrained left-to-right or right-to-left, since bidirec-\ntional conditioning would allow each word to in-\ndirectly \u201csee itself\u201d, and the model could trivially\npredict the target word in a multi-layered context.\nformer is often referred to as a \u201cTransformer encoder\u201d while\nthe left-context-only version is referred to as a \u201cTransformer\ndecoder\u201d since it can be used for text generation.\nIn order to train a deep bidirectional representa-\ntion, we simply mask some percentage of the input\ntokens at random, and then predict those masked\ntokens. We refer to this procedure as a \u201cmasked\nLM\u201d (MLM), although it is often referred to as a\nCloze task in the literature (Taylor, 1953). In this\ncase, the \ufb01nal hidden vectors corresponding to the\nmask tokens are fed into an output softmax over\nthe vocabulary, as in a standard LM. In all of our\nexperiments, we mask 15% of all WordPiece to-\nkens in each sequence at random. In contrast to\ndenoising auto-encoders (Vincent et al., 2008), we\nonly predict the masked words rather than recon-\nstructing the entire input.\nAlthough this allows us to obtain a bidirec-\ntional pre-trained model, a downside is that we\nare creating a mismatch between pre-training and\n\ufb01ne-tuning, since the [MASK] token does not ap-\npear during \ufb01ne-tuning.",
        "node_11": "In contrast to\ndenoising auto-encoders (Vincent et al., 2008), we\nonly predict the masked words rather than recon-\nstructing the entire input.\nAlthough this allows us to obtain a bidirec-\ntional pre-trained model, a downside is that we\nare creating a mismatch between pre-training and\n\ufb01ne-tuning, since the [MASK] token does not ap-\npear during \ufb01ne-tuning. To mitigate this, we do\nnot always replace \u201cmasked\u201d words with the ac-\ntual [MASK] token. The training data generator\nchooses 15% of the token positions at random for\nprediction. If the i-th token is chosen, we replace\nthe i-th token with (1) the [MASK] token 80% of\nthe time (2) a random token 10% of the time (3)\nthe unchanged i-th token 10% of the time. Then,\nTi will be used to predict the original token with\ncross entropy loss. We compare variations of this\nprocedure in Appendix C.2.\nTask #2: Next Sentence Prediction (NSP)\nMany important downstream tasks such as Ques-\ntion Answering (QA) and Natural Language Infer-\nence (NLI) are based on understanding the rela-\ntionship between two sentences, which is not di-\nrectly captured by language modeling. In order\nto train a model that understands sentence rela-\ntionships, we pre-train for a binarized next sen-\ntence prediction task that can be trivially gener-\nated from any monolingual corpus. Speci\ufb01cally,\nwhen choosing the sentencesA and B for each pre-\ntraining example, 50% of the time B is the actual\nnext sentence that follows A (labeled as IsNext),\nand 50% of the time it is a random sentence from\nthe corpus (labeled as NotNext). As we show\nin Figure 1, C is used for next sentence predic-\ntion (NSP). 5 Despite its simplicity, we demon-\nstrate in Section 5.1 that pre-training towards this\ntask is very bene\ufb01cial to both QA and NLI.",
        "node_12": "As we show\nin Figure 1, C is used for next sentence predic-\ntion (NSP). 5 Despite its simplicity, we demon-\nstrate in Section 5.1 that pre-training towards this\ntask is very bene\ufb01cial to both QA and NLI. 6\n5The \ufb01nal model achieves 97%-98% accuracy on NSP.\n6The vector C is not a meaningful sentence representation\nwithout \ufb01ne-tuning, since it was trained with NSP.",
        "node_13": "4175\n[CLS] he likes play ##ing [SEP]my dog is cute [SEP]Input\nE[CLS] Ehe Elikes Eplay E##ing E[SEP]Emy Edog Eis Ecute E[SEP]\nToken\nEmbeddings\nEA EB EB EB EB EBEA EA EA EA EASegment\nEmbeddings\nE0 E6 E7 E8 E9 E10E1 E2 E3 E4 E5Position\nEmbeddings\nFigure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmenta-\ntion embeddings and the position embeddings.\nThe NSP task is closely related to representation-\nlearning objectives used in Jernite et al. (2017) and\nLogeswaran and Lee (2018). However, in prior\nwork, only sentence embeddings are transferred to\ndown-stream tasks, where BERT transfers all pa-\nrameters to initialize end-task model parameters.\nPre-training data The pre-training procedure\nlargely follows the existing literature on language\nmodel pre-training. For the pre-training corpus we\nuse the BooksCorpus (800M words) (Zhu et al.,\n2015) and English Wikipedia (2,500M words).\nFor Wikipedia we extract only the text passages\nand ignore lists, tables, and headers. It is criti-\ncal to use a document-level corpus rather than a\nshuf\ufb02ed sentence-level corpus such as the Billion\nWord Benchmark (Chelba et al., 2013) in order to\nextract long contiguous sequences.\n3.2 Fine-tuning BERT\nFine-tuning is straightforward since the self-\nattention mechanism in the Transformer al-\nlows BERT to model many downstream tasks\u2014\nwhether they involve single text or text pairs\u2014by\nswapping out the appropriate inputs and outputs.\nFor applications involving text pairs, a common\npattern is to independently encode text pairs be-\nfore applying bidirectional cross attention, such\nas Parikh et al. (2016); Seo et al. (2017). BERT\ninstead uses the self-attention mechanism to unify\nthese two stages, as encoding a concatenated text\npair with self-attention effectively includes bidi-\nrectional cross attention between two sentences.",
        "node_14": "For applications involving text pairs, a common\npattern is to independently encode text pairs be-\nfore applying bidirectional cross attention, such\nas Parikh et al. (2016); Seo et al. (2017). BERT\ninstead uses the self-attention mechanism to unify\nthese two stages, as encoding a concatenated text\npair with self-attention effectively includes bidi-\nrectional cross attention between two sentences.\nFor each task, we simply plug in the task-\nspeci\ufb01c inputs and outputs into BERT and \ufb01ne-\ntune all the parameters end-to-end. At the in-\nput, sentence A and sentence B from pre-training\nare analogous to (1) sentence pairs in paraphras-\ning, (2) hypothesis-premise pairs in entailment, (3)\nquestion-passage pairs in question answering, and\n(4) a degenerate text- \u2205 pair in text classi\ufb01cation\nor sequence tagging. At the output, the token rep-\nresentations are fed into an output layer for token-\nlevel tasks, such as sequence tagging or question\nanswering, and the [CLS] representation is fed\ninto an output layer for classi\ufb01cation, such as en-\ntailment or sentiment analysis.\nCompared to pre-training, \ufb01ne-tuning is rela-\ntively inexpensive. All of the results in the pa-\nper can be replicated in at most 1 hour on a sin-\ngle Cloud TPU, or a few hours on a GPU, starting\nfrom the exact same pre-trained model. 7 We de-\nscribe the task-speci\ufb01c details in the correspond-\ning subsections of Section 4. More details can be\nfound in Appendix A.5.\n4 Experiments\nIn this section, we present BERT \ufb01ne-tuning re-\nsults on 11 NLP tasks.\n4.1 GLUE\nThe General Language Understanding Evaluation\n(GLUE) benchmark (Wang et al., 2018a) is a col-\nlection of diverse natural language understanding\ntasks. Detailed descriptions of GLUE datasets are\nincluded in Appendix B.1.",
        "node_15": "More details can be\nfound in Appendix A.5.\n4 Experiments\nIn this section, we present BERT \ufb01ne-tuning re-\nsults on 11 NLP tasks.\n4.1 GLUE\nThe General Language Understanding Evaluation\n(GLUE) benchmark (Wang et al., 2018a) is a col-\nlection of diverse natural language understanding\ntasks. Detailed descriptions of GLUE datasets are\nincluded in Appendix B.1.\nTo \ufb01ne-tune on GLUE, we represent the input\nsequence (for single sentence or sentence pairs)\nas described in Section 3, and use the \ufb01nal hid-\nden vector C \u2208 RH corresponding to the \ufb01rst\ninput token ([CLS]) as the aggregate representa-\ntion. The only new parameters introduced during\n\ufb01ne-tuning are classi\ufb01cation layer weights W \u2208\nRK\u00d7H, where Kis the number of labels. We com-\npute a standard classi\ufb01cation loss with C and W,\ni.e., log(softmax(CWT )).\n7For example, the BERT SQuAD model can be trained in\naround 30 minutes on a single Cloud TPU to achieve a Dev\nF1 score of 91.0%.\n8See (10) in https://gluebenchmark.com/faq.",
        "node_16": "4176\nSystem MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average\n392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k -\nPre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0\nBiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0\nOpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1\nBERTBASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6\nBERTLARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1\nTable 1: GLUE Test results, scored by the evaluation server ( https://gluebenchmark.com/leaderboard).\nThe number below each task denotes the number of training examples. The \u201cAverage\u201d column is slightly different\nthan the of\ufb01cial GLUE score, since we exclude the problematic WNLI set. 8 BERT and OpenAI GPT are single-\nmodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and\naccuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.",
        "node_17": "The \u201cAverage\u201d column is slightly different\nthan the of\ufb01cial GLUE score, since we exclude the problematic WNLI set. 8 BERT and OpenAI GPT are single-\nmodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and\naccuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.\nWe use a batch size of 32 and \ufb01ne-tune for 3\nepochs over the data for all GLUE tasks. For each\ntask, we selected the best \ufb01ne-tuning learning rate\n(among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set.\nAdditionally, for BERTLARGE we found that \ufb01ne-\ntuning was sometimes unstable on small datasets,\nso we ran several random restarts and selected the\nbest model on the Dev set. With random restarts,\nwe use the same pre-trained checkpoint but per-\nform different \ufb01ne-tuning data shuf\ufb02ing and clas-\nsi\ufb01er layer initialization.9\nResults are presented in Table 1. Both\nBERTBASE and BERTLARGE outperform all sys-\ntems on all tasks by a substantial margin, obtaining\n4.5% and 7.0% respective average accuracy im-\nprovement over the prior state of the art. Note that\nBERTBASE and OpenAI GPT are nearly identical\nin terms of model architecture apart from the at-\ntention masking. For the largest and most widely\nreported GLUE task, MNLI, BERT obtains a 4.6%\nabsolute accuracy improvement. On the of\ufb01cial\nGLUE leaderboard10, BERTLARGE obtains a score\nof 80.5, compared to OpenAI GPT, which obtains\n72.8 as of the date of writing.\nWe \ufb01nd that BERT LARGE signi\ufb01cantly outper-\nforms BERTBASE across all tasks, especially those\nwith very little training data. The effect of model\nsize is explored more thoroughly in Section 5.2.",
        "node_18": "On the of\ufb01cial\nGLUE leaderboard10, BERTLARGE obtains a score\nof 80.5, compared to OpenAI GPT, which obtains\n72.8 as of the date of writing.\nWe \ufb01nd that BERT LARGE signi\ufb01cantly outper-\nforms BERTBASE across all tasks, especially those\nwith very little training data. The effect of model\nsize is explored more thoroughly in Section 5.2.\n4.2 SQuAD v1.1\nThe Stanford Question Answering Dataset\n(SQuAD v1.1) is a collection of 100k crowd-\nsourced question/answer pairs (Rajpurkar et al.,\n2016). Given a question and a passage from\n9The GLUE data set distribution does not include the Test\nlabels, and we only made a single GLUE evaluation server\nsubmission for each of BERTBASE and BERTLARGE .\n10https://gluebenchmark.com/leaderboard\nWikipedia containing the answer, the task is to\npredict the answer text span in the passage.\nAs shown in Figure 1, in the question answer-\ning task, we represent the input question and pas-\nsage as a single packed sequence, with the ques-\ntion using the A embedding and the passage using\nthe B embedding. We only introduce a start vec-\ntor S \u2208RH and an end vector E \u2208RH during\n\ufb01ne-tuning. The probability of word i being the\nstart of the answer span is computed as a dot prod-\nuct between Ti and S followed by a softmax over\nall of the words in the paragraph: Pi = eS\u00b7Ti\n\u2211\nj eS\u00b7Tj .\nThe analogous formula is used for the end of the\nanswer span. The score of a candidate span from\nposition ito position jis de\ufb01ned as S\u00b7Ti + E\u00b7Tj,\nand the maximum scoring span where j \u2265 i is\nused as a prediction. The training objective is the\nsum of the log-likelihoods of the correct start and\nend positions. We \ufb01ne-tune for 3 epochs with a\nlearning rate of 5e-5 and a batch size of 32.",
        "node_19": "The score of a candidate span from\nposition ito position jis de\ufb01ned as S\u00b7Ti + E\u00b7Tj,\nand the maximum scoring span where j \u2265 i is\nused as a prediction. The training objective is the\nsum of the log-likelihoods of the correct start and\nend positions. We \ufb01ne-tune for 3 epochs with a\nlearning rate of 5e-5 and a batch size of 32.\nTable 2 shows top leaderboard entries as well\nas results from top published systems (Seo et al.,\n2017; Clark and Gardner, 2018; Peters et al.,\n2018a; Hu et al., 2018). The top results from the\nSQuAD leaderboard do not have up-to-date public\nsystem descriptions available,11 and are allowed to\nuse any public data when training their systems.\nWe therefore use modest data augmentation in\nour system by \ufb01rst \ufb01ne-tuning on TriviaQA (Joshi\net al., 2017) befor \ufb01ne-tuning on SQuAD.\nOur best performing system outperforms the top\nleaderboard system by +1.5 F1 in ensembling and\n+1.3 F1 as a single system. In fact, our single\nBERT model outperforms the top ensemble sys-\ntem in terms of F1 score. Without TriviaQA \ufb01ne-\n11QANet is described in Yu et al. (2018), but the system\nhas improved substantially after publication.",
        "node_20": "4177\nSystem Dev Test\nEM F1 EM F1\nTop Leaderboard Systems (Dec 10th, 2018)\nHuman - - 82.3 91.2\n#1 Ensemble - nlnet - - 86.0 91.7\n#2 Ensemble - QANet - - 84.5 90.5\nPublished\nBiDAF+ELMo (Single) - 85.6 - 85.8\nR.M. Reader (Ensemble) 81.2 87.9 82.3 88.5\nOurs\nBERTBASE (Single) 80.8 88.5 - -\nBERTLARGE (Single) 84.1 90.9 - -\nBERTLARGE (Ensemble) 85.8 91.8 - -\nBERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8\nBERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2\nTable 2: SQuAD 1.1 results. The BERT ensemble\nis 7x systems which use different pre-training check-\npoints and \ufb01ne-tuning seeds.\nSystem Dev Test\nEM F1 EM F1\nTop Leaderboard Systems (Dec 10th, 2018)\nHuman 86.3 89.0 86.9 89.5\n#1 Single - MIR-MRC (F-Net) - - 74.8 78.0\n#2 Single - nlnet - - 74.2 77.1\nPublished\nunet (Ensemble) - - 71.4 74.9\nSLQA+ (Single) - 71.4 74.4\nOurs\nBERTLARGE (Single) 78.7 81.9 80.0 83.1\nTable 3: SQuAD 2.0 results. We exclude entries that\nuse BERT as one of their components.",
        "node_21": "We exclude entries that\nuse BERT as one of their components.\ntuning data, we only lose 0.1-0.4 F1, still outper-\nforming all existing systems by a wide margin.12\n4.3 SQuAD v2.0\nThe SQuAD 2.0 task extends the SQuAD 1.1\nproblem de\ufb01nition by allowing for the possibility\nthat no short answer exists in the provided para-\ngraph, making the problem more realistic.\nWe use a simple approach to extend the SQuAD\nv1.1 BERT model for this task. We treat ques-\ntions that do not have an answer as having an an-\nswer span with start and end at the [CLS] to-\nken. The probability space for the start and end\nanswer span positions is extended to include the\nposition of the [CLS] token. For prediction, we\ncompare the score of the no-answer span: snull =\nS\u00b7C+ E\u00b7C to the score of the best non-null span\n12The TriviaQA data we used consists of paragraphs from\nTriviaQA-Wiki formed of the \ufb01rst 400 tokens in documents,\nthat contain at least one of the provided possible answers.\nSystem Dev Test\nESIM+GloVe 51.9 52.7\nESIM+ELMo 59.1 59.2\nOpenAI GPT - 78.0\nBERTBASE 81.6 -\nBERTLARGE 86.6 86.3\nHuman (expert)\u2020 - 85.0\nHuman (5 annotations)\u2020 - 88.0\nTable 4: SW AG Dev and Test accuracies.\u2020Human per-\nformance is measured with 100 samples, as reported in\nthe SW AG paper.\n\u02c6si,j = maxj\u2265iS\u00b7Ti + E\u00b7Tj. We predict a non-null\nanswer when \u02c6si,j > snull + \u03c4, where the thresh-\nold \u03c4 is selected on the dev set to maximize F1.\nWe did not use TriviaQA data for this model. We\n\ufb01ne-tuned for 2 epochs with a learning rate of 5e-5\nand a batch size of 48.",
        "node_22": "\u02c6si,j = maxj\u2265iS\u00b7Ti + E\u00b7Tj. We predict a non-null\nanswer when \u02c6si,j > snull + \u03c4, where the thresh-\nold \u03c4 is selected on the dev set to maximize F1.\nWe did not use TriviaQA data for this model. We\n\ufb01ne-tuned for 2 epochs with a learning rate of 5e-5\nand a batch size of 48.\nThe results compared to prior leaderboard en-\ntries and top published work (Sun et al., 2018;\nWang et al., 2018b) are shown in Table 3, exclud-\ning systems that use BERT as one of their com-\nponents. We observe a +5.1 F1 improvement over\nthe previous best system.\n4.4 SWAG\nThe Situations With Adversarial Generations\n(SW AG) dataset contains 113k sentence-pair com-\npletion examples that evaluate grounded common-\nsense inference (Zellers et al., 2018). Given a sen-\ntence, the task is to choose the most plausible con-\ntinuation among four choices.\nWhen \ufb01ne-tuning on the SW AG dataset, we\nconstruct four input sequences, each containing\nthe concatenation of the given sentence (sentence\nA) and a possible continuation (sentence B). The\nonly task-speci\ufb01c parameters introduced is a vec-\ntor whose dot product with the [CLS] token rep-\nresentation C denotes a score for each choice\nwhich is normalized with a softmax layer.\nWe \ufb01ne-tune the model for 3 epochs with a\nlearning rate of 2e-5 and a batch size of 16. Re-\nsults are presented in Table 4. BERT LARGE out-\nperforms the authors\u2019 baseline ESIM+ELMo sys-\ntem by +27.1% and OpenAI GPT by 8.3%.\n5 Ablation Studies\nIn this section, we perform ablation experiments\nover a number of facets of BERT in order to better\nunderstand their relative importance. Additional",
        "node_23": "4178\nDev Set\nTasks MNLI-m QNLI MRPC SST-2 SQuAD\n(Acc) (Acc) (Acc) (Acc) (F1)\nBERTBASE 84.4 88.4 86.7 92.7 88.5\nNo NSP 83.9 84.9 86.5 92.6 87.9\nLTR & No NSP 82.1 84.3 77.5 92.1 77.8\n+ BiLSTM 82.1 84.1 75.7 91.6 84.9\nTable 5: Ablation over the pre-training tasks using the\nBERTBASE architecture. \u201cNo NSP\u201d is trained without\nthe next sentence prediction task. \u201cLTR & No NSP\u201d is\ntrained as a left-to-right LM without the next sentence\nprediction, like OpenAI GPT. \u201c+ BiLSTM\u201d adds a ran-\ndomly initialized BiLSTM on top of the \u201cLTR + No\nNSP\u201d model during \ufb01ne-tuning.\nablation studies can be found in Appendix C.\n5.1 Effect of Pre-training Tasks\nWe demonstrate the importance of the deep bidi-\nrectionality of BERT by evaluating two pre-\ntraining objectives using exactly the same pre-\ntraining data, \ufb01ne-tuning scheme, and hyperpa-\nrameters as BERTBASE :\nNo NSP: A bidirectional model which is trained\nusing the \u201cmasked LM\u201d (MLM) but without the\n\u201cnext sentence prediction\u201d (NSP) task.\nLTR & No NSP: A left-context-only model which\nis trained using a standard Left-to-Right (LTR)\nLM, rather than an MLM. The left-only constraint\nwas also applied at \ufb01ne-tuning, because removing\nit introduced a pre-train/\ufb01ne-tune mismatch that\ndegraded downstream performance. Additionally,\nthis model was pre-trained without the NSP task.\nThis is directly comparable to OpenAI GPT, but\nusing our larger training dataset, our input repre-\nsentation, and our \ufb01ne-tuning scheme.",
        "node_24": "The left-only constraint\nwas also applied at \ufb01ne-tuning, because removing\nit introduced a pre-train/\ufb01ne-tune mismatch that\ndegraded downstream performance. Additionally,\nthis model was pre-trained without the NSP task.\nThis is directly comparable to OpenAI GPT, but\nusing our larger training dataset, our input repre-\nsentation, and our \ufb01ne-tuning scheme.\nWe \ufb01rst examine the impact brought by the NSP\ntask. In Table 5, we show that removing NSP\nhurts performance signi\ufb01cantly on QNLI, MNLI,\nand SQuAD 1.1. Next, we evaluate the impact\nof training bidirectional representations by com-\nparing \u201cNo NSP\u201d to \u201cLTR & No NSP\u201d. The LTR\nmodel performs worse than the MLM model on all\ntasks, with large drops on MRPC and SQuAD.\nFor SQuAD it is intuitively clear that a LTR\nmodel will perform poorly at token predictions,\nsince the token-level hidden states have no right-\nside context. In order to make a good faith at-\ntempt at strengthening the LTR system, we added\na randomly initialized BiLSTM on top. This does\nsigni\ufb01cantly improve results on SQuAD, but the\nresults are still far worse than those of the pre-\ntrained bidirectional models. The BiLSTM hurts\nperformance on the GLUE tasks.\nWe recognize that it would also be possible to\ntrain separate LTR and RTL models and represent\neach token as the concatenation of the two mod-\nels, as ELMo does. However: (a) this is twice as\nexpensive as a single bidirectional model; (b) this\nis non-intuitive for tasks like QA, since the RTL\nmodel would not be able to condition the answer\non the question; (c) this it is strictly less powerful\nthan a deep bidirectional model, since it can use\nboth left and right context at every layer.\n5.2 Effect of Model Size\nIn this section, we explore the effect of model size\non \ufb01ne-tuning task accuracy.",
        "node_25": "5.2 Effect of Model Size\nIn this section, we explore the effect of model size\non \ufb01ne-tuning task accuracy. We trained a number\nof BERT models with a differing number of layers,\nhidden units, and attention heads, while otherwise\nusing the same hyperparameters and training pro-\ncedure as described previously.\nResults on selected GLUE tasks are shown in\nTable 6. In this table, we report the average Dev\nSet accuracy from 5 random restarts of \ufb01ne-tuning.\nWe can see that larger models lead to a strict ac-\ncuracy improvement across all four datasets, even\nfor MRPC which only has 3,600 labeled train-\ning examples, and is substantially different from\nthe pre-training tasks. It is also perhaps surpris-\ning that we are able to achieve such signi\ufb01cant\nimprovements on top of models which are al-\nready quite large relative to the existing literature.\nFor example, the largest Transformer explored in\nVaswani et al. (2017) is (L=6, H=1024, A=16)\nwith 100M parameters for the encoder, and the\nlargest Transformer we have found in the literature\nis (L=64, H=512, A=2) with 235M parameters\n(Al-Rfou et al., 2018). By contrast, BERT BASE\ncontains 110M parameters and BERT LARGE con-\ntains 340M parameters.\nIt has long been known that increasing the\nmodel size will lead to continual improvements\non large-scale tasks such as machine translation\nand language modeling, which is demonstrated\nby the LM perplexity of held-out training data\nshown in Table 6. However, we believe that\nthis is the \ufb01rst work to demonstrate convinc-\ningly that scaling to extreme model sizes also\nleads to large improvements on very small scale\ntasks, provided that the model has been suf\ufb01-\nciently pre-trained. Peters et al. (2018b) presented",
        "node_26": "4179\nmixed results on the downstream task impact of\nincreasing the pre-trained bi-LM size from two\nto four layers and Melamud et al. (2016) men-\ntioned in passing that increasing hidden dimen-\nsion size from 200 to 600 helped, but increasing\nfurther to 1,000 did not bring further improve-\nments. Both of these prior works used a feature-\nbased approach \u2014 we hypothesize that when the\nmodel is \ufb01ne-tuned directly on the downstream\ntasks and uses only a very small number of ran-\ndomly initialized additional parameters, the task-\nspeci\ufb01c models can bene\ufb01t from the larger, more\nexpressive pre-trained representations even when\ndownstream task data is very small.\n5.3 Feature-based Approach with BERT\nAll of the BERT results presented so far have used\nthe \ufb01ne-tuning approach, where a simple classi\ufb01-\ncation layer is added to the pre-trained model, and\nall parameters are jointly \ufb01ne-tuned on a down-\nstream task. However, the feature-based approach,\nwhere \ufb01xed features are extracted from the pre-\ntrained model, has certain advantages. First, not\nall tasks can be easily represented by a Trans-\nformer encoder architecture, and therefore require\na task-speci\ufb01c model architecture to be added.\nSecond, there are major computational bene\ufb01ts\nto pre-compute an expensive representation of the\ntraining data once and then run many experiments\nwith cheaper models on top of this representation.\nIn this section, we compare the two approaches\nby applying BERT to the CoNLL-2003 Named\nEntity Recognition (NER) task (Tjong Kim Sang\nand De Meulder, 2003). In the input to BERT, we\nuse a case-preserving WordPiece model, and we\ninclude the maximal document context provided\nby the data.",
        "node_27": "In this section, we compare the two approaches\nby applying BERT to the CoNLL-2003 Named\nEntity Recognition (NER) task (Tjong Kim Sang\nand De Meulder, 2003). In the input to BERT, we\nuse a case-preserving WordPiece model, and we\ninclude the maximal document context provided\nby the data. Following standard practice, we for-\nmulate this as a tagging task but do not use a CRF\nHyperparams Dev Set Accuracy\n#L #H #A LM (ppl) MNLI-m MRPC SST-2\n3 768 12 5.84 77.9 79.8 88.4\n6 768 3 5.24 80.6 82.2 90.7\n6 768 12 4.68 81.9 84.8 91.3\n12 768 12 3.99 84.4 86.7 92.9\n12 1024 16 3.54 85.7 86.9 93.3\n24 1024 16 3.23 86.6 87.8 93.7\nTable 6: Ablation over BERT model size. #L = the\nnumber of layers; #H = hidden size; #A = number of at-\ntention heads. \u201cLM (ppl)\u201d is the masked LM perplexity\nof held-out training data.",
        "node_28": "#L = the\nnumber of layers; #H = hidden size; #A = number of at-\ntention heads. \u201cLM (ppl)\u201d is the masked LM perplexity\nof held-out training data.\nSystem Dev F1 Test F1\nELMo (Peters et al., 2018a) 95.7 92.2\nCVT (Clark et al., 2018) - 92.6\nCSE (Akbik et al., 2018) - 93.1\nFine-tuning approach\nBERTLARGE 96.6 92.8\nBERTBASE 96.4 92.4\nFeature-based approach (BERTBASE )\nEmbeddings 91.0 -\nSecond-to-Last Hidden 95.6 -\nLast Hidden 94.9 -\nWeighted Sum Last Four Hidden 95.9 -\nConcat Last Four Hidden 96.1 -\nWeighted Sum All 12 Layers 95.5 -\nTable 7: CoNLL-2003 Named Entity Recognition re-\nsults. Hyperparameters were selected using the Dev\nset. The reported Dev and Test scores are averaged over\n5 random restarts using those hyperparameters.\nlayer in the output. We use the representation of\nthe \ufb01rst sub-token as the input to the token-level\nclassi\ufb01er over the NER label set.\nTo ablate the \ufb01ne-tuning approach, we apply the\nfeature-based approach by extracting the activa-\ntions from one or more layers without \ufb01ne-tuning\nany parameters of BERT. These contextual em-\nbeddings are used as input to a randomly initial-\nized two-layer 768-dimensional BiLSTM before\nthe classi\ufb01cation layer.\nResults are presented in Table 7. BERT LARGE\nperforms competitively with state-of-the-art meth-\nods. The best performing method concatenates the\ntoken representations from the top four hidden lay-\ners of the pre-trained Transformer, which is only\n0.3 F1 behind \ufb01ne-tuning the entire model. This\ndemonstrates that BERT is effective for both \ufb01ne-\ntuning and feature-based approaches.",
        "node_29": "Results are presented in Table 7. BERT LARGE\nperforms competitively with state-of-the-art meth-\nods. The best performing method concatenates the\ntoken representations from the top four hidden lay-\ners of the pre-trained Transformer, which is only\n0.3 F1 behind \ufb01ne-tuning the entire model. This\ndemonstrates that BERT is effective for both \ufb01ne-\ntuning and feature-based approaches.\n6 Conclusion\nRecent empirical improvements due to transfer\nlearning with language models have demonstrated\nthat rich, unsupervised pre-training is an integral\npart of many language understanding systems. In\nparticular, these results enable even low-resource\ntasks to bene\ufb01t from deep unidirectional architec-\ntures. Our major contribution is further general-\nizing these \ufb01ndings to deep bidirectional architec-\ntures, allowing the same pre-trained model to suc-\ncessfully tackle a broad set of NLP tasks.",
        "node_30": "4180\nReferences\nAlan Akbik, Duncan Blythe, and Roland V ollgraf.\n2018. Contextual string embeddings for sequence\nlabeling. In Proceedings of the 27th International\nConference on Computational Linguistics , pages\n1638\u20131649.\nRami Al-Rfou, Dokook Choe, Noah Constant, Mandy\nGuo, and Llion Jones. 2018. Character-level lan-\nguage modeling with deeper self-attention. arXiv\npreprint arXiv:1808.04444.\nRie Kubota Ando and Tong Zhang. 2005. A framework\nfor learning predictive structures from multiple tasks\nand unlabeled data. Journal of Machine Learning\nResearch, 6(Nov):1817\u20131853.\nLuisa Bentivogli, Bernardo Magnini, Ido Dagan,\nHoa Trang Dang, and Danilo Giampiccolo. 2009.\nThe \ufb01fth PASCAL recognizing textual entailment\nchallenge. In TAC. NIST.\nJohn Blitzer, Ryan McDonald, and Fernando Pereira.\n2006. Domain adaptation with structural correspon-\ndence learning. In Proceedings of the 2006 confer-\nence on empirical methods in natural language pro-\ncessing, pages 120\u2013128. Association for Computa-\ntional Linguistics.\nSamuel R. Bowman, Gabor Angeli, Christopher Potts,\nand Christopher D. Manning. 2015. A large anno-\ntated corpus for learning natural language inference.\nIn EMNLP. Association for Computational Linguis-\ntics.\nPeter F Brown, Peter V Desouza, Robert L Mercer,\nVincent J Della Pietra, and Jenifer C Lai. 1992.\nClass-based n-gram models of natural language.\nComputational linguistics, 18(4):467\u2013479.\nDaniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-\nGazpio, and Lucia Specia. 2017.",
        "node_31": "Peter F Brown, Peter V Desouza, Robert L Mercer,\nVincent J Della Pietra, and Jenifer C Lai. 1992.\nClass-based n-gram models of natural language.\nComputational linguistics, 18(4):467\u2013479.\nDaniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-\nGazpio, and Lucia Specia. 2017. Semeval-2017\ntask 1: Semantic textual similarity multilingual and\ncrosslingual focused evaluation. In Proceedings\nof the 11th International Workshop on Semantic\nEvaluation (SemEval-2017) , pages 1\u201314, Vancou-\nver, Canada. Association for Computational Lin-\nguistics.\nCiprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,\nThorsten Brants, Phillipp Koehn, and Tony Robin-\nson. 2013. One billion word benchmark for measur-\ning progress in statistical language modeling. arXiv\npreprint arXiv:1312.3005.\nZ. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.\nQuora question pairs.\nChristopher Clark and Matt Gardner. 2018. Simple\nand effective multi-paragraph reading comprehen-\nsion. In ACL.\nKevin Clark, Minh-Thang Luong, Christopher D Man-\nning, and Quoc Le. 2018. Semi-supervised se-\nquence modeling with cross-view training. In Pro-\nceedings of the 2018 Conference on Empirical Meth-\nods in Natural Language Processing , pages 1914\u2013\n1925.\nRonan Collobert and Jason Weston. 2008. A uni\ufb01ed\narchitecture for natural language processing: Deep\nneural networks with multitask learning. In Pro-\nceedings of the 25th international conference on\nMachine learning, pages 160\u2013167. ACM.\nAlexis Conneau, Douwe Kiela, Holger Schwenk, Lo \u00a8\u0131c\nBarrault, and Antoine Bordes.",
        "node_32": "2008. A uni\ufb01ed\narchitecture for natural language processing: Deep\nneural networks with multitask learning. In Pro-\nceedings of the 25th international conference on\nMachine learning, pages 160\u2013167. ACM.\nAlexis Conneau, Douwe Kiela, Holger Schwenk, Lo \u00a8\u0131c\nBarrault, and Antoine Bordes. 2017. Supervised\nlearning of universal sentence representations from\nnatural language inference data. In Proceedings of\nthe 2017 Conference on Empirical Methods in Nat-\nural Language Processing, pages 670\u2013680, Copen-\nhagen, Denmark. Association for Computational\nLinguistics.\nAndrew M Dai and Quoc V Le. 2015. Semi-supervised\nsequence learning. In Advances in neural informa-\ntion processing systems, pages 3079\u20133087.\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-\nFei. 2009. ImageNet: A Large-Scale Hierarchical\nImage Database. In CVPR09.\nWilliam B Dolan and Chris Brockett. 2005. Automati-\ncally constructing a corpus of sentential paraphrases.\nIn Proceedings of the Third International Workshop\non Paraphrasing (IWP2005).\nWilliam Fedus, Ian Goodfellow, and Andrew M Dai.\n2018. Maskgan: Better text generation via \ufb01lling in\nthe . arXiv preprint arXiv:1801.07736.\nDan Hendrycks and Kevin Gimpel. 2016. Bridging\nnonlinearities and stochastic regularizers with gaus-\nsian error linear units. CoRR, abs/1606.08415.\nFelix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.\nLearning distributed representations of sentences\nfrom unlabelled data.",
        "node_33": "Dan Hendrycks and Kevin Gimpel. 2016. Bridging\nnonlinearities and stochastic regularizers with gaus-\nsian error linear units. CoRR, abs/1606.08415.\nFelix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.\nLearning distributed representations of sentences\nfrom unlabelled data. In Proceedings of the 2016\nConference of the North American Chapter of the\nAssociation for Computational Linguistics: Human\nLanguage Technologies. Association for Computa-\ntional Linguistics.\nJeremy Howard and Sebastian Ruder. 2018. Universal\nlanguage model \ufb01ne-tuning for text classi\ufb01cation. In\nACL. Association for Computational Linguistics.\nMinghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu,\nFuru Wei, and Ming Zhou. 2018. Reinforced\nmnemonic reader for machine reading comprehen-\nsion. In IJCAI.\nYacine Jernite, Samuel R. Bowman, and David Son-\ntag. 2017. Discourse-based objectives for fast un-\nsupervised sentence representation learning. CoRR,\nabs/1705.00557.",
        "node_34": "4181\nMandar Joshi, Eunsol Choi, Daniel S Weld, and Luke\nZettlemoyer. 2017. Triviaqa: A large scale distantly\nsupervised challenge dataset for reading comprehen-\nsion. In ACL.\nRyan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,\nRichard Zemel, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. 2015. Skip-thought vectors. In\nAdvances in neural information processing systems,\npages 3294\u20133302.\nQuoc Le and Tomas Mikolov. 2014. Distributed rep-\nresentations of sentences and documents. In Inter-\nnational Conference on Machine Learning , pages\n1188\u20131196.\nHector J Levesque, Ernest Davis, and Leora Morgen-\nstern. 2011. The winograd schema challenge. In\nAaai spring symposium: Logical formalizations of\ncommonsense reasoning, volume 46, page 47.\nLajanugen Logeswaran and Honglak Lee. 2018. An\nef\ufb01cient framework for learning sentence represen-\ntations. In International Conference on Learning\nRepresentations.\nBryan McCann, James Bradbury, Caiming Xiong, and\nRichard Socher. 2017. Learned in translation: Con-\ntextualized word vectors. In NIPS.\nOren Melamud, Jacob Goldberger, and Ido Dagan.\n2016. context2vec: Learning generic context em-\nbedding with bidirectional LSTM. In CoNLL.\nTomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-\nrado, and Jeff Dean. 2013. Distributed representa-\ntions of words and phrases and their compositional-\nity. In Advances in Neural Information Processing\nSystems 26 , pages 3111\u20133119. Curran Associates,\nInc.\nAndriy Mnih and Geoffrey E Hinton. 2009. A scal-\nable hierarchical distributed language model.",
        "node_35": "2013. Distributed representa-\ntions of words and phrases and their compositional-\nity. In Advances in Neural Information Processing\nSystems 26 , pages 3111\u20133119. Curran Associates,\nInc.\nAndriy Mnih and Geoffrey E Hinton. 2009. A scal-\nable hierarchical distributed language model. In\nD. Koller, D. Schuurmans, Y . Bengio, and L. Bot-\ntou, editors, Advances in Neural Information Pro-\ncessing Systems 21 , pages 1081\u20131088. Curran As-\nsociates, Inc.\nAnkur P Parikh, Oscar T \u00a8ackstr\u00a8om, Dipanjan Das, and\nJakob Uszkoreit. 2016. A decomposable attention\nmodel for natural language inference. In EMNLP.\nJeffrey Pennington, Richard Socher, and Christo-\npher D. Manning. 2014. Glove: Global vectors for\nword representation. In Empirical Methods in Nat-\nural Language Processing (EMNLP) , pages 1532\u2013\n1543.\nMatthew Peters, Waleed Ammar, Chandra Bhagavat-\nula, and Russell Power. 2017. Semi-supervised se-\nquence tagging with bidirectional language models.\nIn ACL.\nMatthew Peters, Mark Neumann, Mohit Iyyer, Matt\nGardner, Christopher Clark, Kenton Lee, and Luke\nZettlemoyer. 2018a. Deep contextualized word rep-\nresentations. In NAACL.\nMatthew Peters, Mark Neumann, Luke Zettlemoyer,\nand Wen-tau Yih. 2018b. Dissecting contextual\nword embeddings: Architecture and representation.\nIn Proceedings of the 2018 Conference on Empiri-\ncal Methods in Natural Language Processing, pages\n1499\u20131509.\nAlec Radford, Karthik Narasimhan, Tim Salimans, and\nIlya Sutskever. 2018. Improving language under-\nstanding with unsupervised learning.",
        "node_36": "2018b. Dissecting contextual\nword embeddings: Architecture and representation.\nIn Proceedings of the 2018 Conference on Empiri-\ncal Methods in Natural Language Processing, pages\n1499\u20131509.\nAlec Radford, Karthik Narasimhan, Tim Salimans, and\nIlya Sutskever. 2018. Improving language under-\nstanding with unsupervised learning. Technical re-\nport, OpenAI.\nPranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\nPercy Liang. 2016. Squad: 100,000+ questions for\nmachine comprehension of text. In Proceedings of\nthe 2016 Conference on Empirical Methods in Nat-\nural Language Processing, pages 2383\u20132392.\nMinjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and\nHannaneh Hajishirzi. 2017. Bidirectional attention\n\ufb02ow for machine comprehension. In ICLR.\nRichard Socher, Alex Perelygin, Jean Wu, Jason\nChuang, Christopher D Manning, Andrew Ng, and\nChristopher Potts. 2013. Recursive deep models\nfor semantic compositionality over a sentiment tree-\nbank. In Proceedings of the 2013 conference on\nempirical methods in natural language processing ,\npages 1631\u20131642.\nFu Sun, Linyang Li, Xipeng Qiu, and Yang Liu.\n2018. U-net: Machine reading comprehension\nwith unanswerable questions. arXiv preprint\narXiv:1810.06638.\nWilson L Taylor. 1953. Cloze procedure: A new\ntool for measuring readability. Journalism Bulletin,\n30(4):415\u2013433.\nErik F Tjong Kim Sang and Fien De Meulder.\n2003. Introduction to the conll-2003 shared task:\nLanguage-independent named entity recognition. In\nCoNLL.\nJoseph Turian, Lev Ratinov, and Yoshua Bengio.",
        "node_37": "1953. Cloze procedure: A new\ntool for measuring readability. Journalism Bulletin,\n30(4):415\u2013433.\nErik F Tjong Kim Sang and Fien De Meulder.\n2003. Introduction to the conll-2003 shared task:\nLanguage-independent named entity recognition. In\nCoNLL.\nJoseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.\nWord representations: A simple and general method\nfor semi-supervised learning. In Proceedings of the\n48th Annual Meeting of the Association for Compu-\ntational Linguistics, ACL \u201910, pages 384\u2013394.\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob\nUszkoreit, Llion Jones, Aidan N Gomez, Lukasz\nKaiser, and Illia Polosukhin. 2017. Attention is all\nyou need. In Advances in Neural Information Pro-\ncessing Systems, pages 6000\u20136010.\nPascal Vincent, Hugo Larochelle, Yoshua Bengio, and\nPierre-Antoine Manzagol. 2008. Extracting and\ncomposing robust features with denoising autoen-\ncoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103.\nACM.\nAlex Wang, Amanpreet Singh, Julian Michael, Fe-\nlix Hill, Omer Levy, and Samuel Bowman. 2018a.\nGlue: A multi-task benchmark and analysis platform",
        "node_38": "4182\nfor natural language understanding. In Proceedings\nof the 2018 EMNLP Workshop BlackboxNLP: An-\nalyzing and Interpreting Neural Networks for NLP ,\npages 353\u2013355.\nWei Wang, Ming Yan, and Chen Wu. 2018b. Multi-\ngranularity hierarchical attention fusion networks\nfor reading comprehension and question answering.\nIn Proceedings of the 56th Annual Meeting of the As-\nsociation for Computational Linguistics (Volume 1:\nLong Papers). Association for Computational Lin-\nguistics.\nAlex Warstadt, Amanpreet Singh, and Samuel R Bow-\nman. 2018. Neural network acceptability judg-\nments. arXiv preprint arXiv:1805.12471.\nAdina Williams, Nikita Nangia, and Samuel R Bow-\nman. 2018. A broad-coverage challenge corpus\nfor sentence understanding through inference. In\nNAACL.\nYonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V\nLe, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus\nMacherey, et al. 2016. Google\u2019s neural ma-\nchine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint\narXiv:1609.08144.\nJason Yosinski, Jeff Clune, Yoshua Bengio, and Hod\nLipson. 2014. How transferable are features in deep\nneural networks? In Advances in neural information\nprocessing systems, pages 3320\u20133328.\nAdams Wei Yu, David Dohan, Minh-Thang Luong, Rui\nZhao, Kai Chen, Mohammad Norouzi, and Quoc V\nLe. 2018. QANet: Combining local convolution\nwith global self-attention for reading comprehen-\nsion. In ICLR.\nRowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin\nChoi. 2018.",
        "node_39": "Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui\nZhao, Kai Chen, Mohammad Norouzi, and Quoc V\nLe. 2018. QANet: Combining local convolution\nwith global self-attention for reading comprehen-\nsion. In ICLR.\nRowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin\nChoi. 2018. Swag: A large-scale adversarial dataset\nfor grounded commonsense inference. In Proceed-\nings of the 2018 Conference on Empirical Methods\nin Natural Language Processing (EMNLP).\nYukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-\ndinov, Raquel Urtasun, Antonio Torralba, and Sanja\nFidler. 2015. Aligning books and movies: Towards\nstory-like visual explanations by watching movies\nand reading books. In Proceedings of the IEEE\ninternational conference on computer vision , pages\n19\u201327.\nAppendix for \u201cBERT: Pre-training of\nDeep Bidirectional Transformers for\nLanguage Understanding\u201d\nWe organize the appendix into three sections:\n\u2022 Additional implementation details for BERT\nare presented in Appendix A;\n\u2022 Additional details for our experiments are\npresented in Appendix B; and\n\u2022 Additional ablation studies are presented in\nAppendix C.\nWe present additional ablation studies for\nBERT including:\n\u2013 Effect of Number of Training Steps; and\n\u2013 Ablation for Different Masking Proce-\ndures.\nA Additional Details for BERT\nA.1 Illustration of the Pre-training Tasks\nWe provide examples of the pre-training tasks in\nthe following.",
        "node_40": "A Additional Details for BERT\nA.1 Illustration of the Pre-training Tasks\nWe provide examples of the pre-training tasks in\nthe following.\nMasked LM and the Masking ProcedureAs-\nsuming the unlabeled sentence is my dog is\nhairy, and during the random masking procedure\nwe chose the 4-th token (which corresponding to\nhairy), our masking procedure can be further il-\nlustrated by\n\u2022 80% of the time: Replace the word with the\n[MASK] token, e.g., my dog is hairy \u2192\nmy dog is [MASK]\n\u2022 10% of the time: Replace the word with a\nrandom word, e.g., my dog is hairy \u2192 my\ndog is apple\n\u2022 10% of the time: Keep the word un-\nchanged, e.g., my dog is hairy \u2192 my dog\nis hairy. The purpose of this is to bias the\nrepresentation towards the actual observed\nword.\nThe advantage of this procedure is that the\nTransformer encoder does not know which words\nit will be asked to predict or which have been re-\nplaced by random words, so it is forced to keep\na distributional contextual representation of ev-\nery input token. Additionally, because random\nreplacement only occurs for 1.5% of all tokens\n(i.e., 10% of 15%), this does not seem to harm\nthe model\u2019s language understanding capability. In\nSection C.2, we evaluate the impact this proce-\ndure.\nCompared to standard langauge model training,\nthe masked LM only make predictions on 15% of\ntokens in each batch, which suggests that more\npre-training steps may be required for the model",
        "node_41": "4183\nBERT (Ours)\nTrm Trm Trm\nTrm Trm Trm\n...\n...\nTrm Trm Trm\nTrm Trm Trm\n...\n...\nOpenAI GPT\nLstm\nELMo\nLstm Lstm\nLstm Lstm Lstm\nLstm Lstm Lstm\nLstm Lstm Lstm\n T1 T2  TN...\n...\n...\n...\n...\n E1 E2  EN...\n T1 T2 TN...\n E1 E2  EN...\n T1 T2  TN...\n E1 E2  EN...\nFigure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT\nuses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-\nleft LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly\nconditioned on both left and right context in all layers. In addition to the architecture differences, BERT and\nOpenAI GPT are \ufb01ne-tuning approaches, while ELMo is a feature-based approach.\nto converge. In Section C.1 we demonstrate that\nMLM does converge marginally slower than a left-\nto-right model (which predicts every token), but\nthe empirical improvements of the MLM model\nfar outweigh the increased training cost.\nNext Sentence Prediction The next sentence\nprediction task can be illustrated in the following\nexamples.\nInput = [CLS] the man went to [MASK] store [SEP]\nhe bought a gallon [MASK] milk [SEP]\nLabel = IsNext\nInput = [CLS] the man [MASK] to the store [SEP]\npenguin [MASK] are flight ##less birds [SEP]\nLabel = NotNext\nA.2 Pre-training Procedure\nTo generate each training input sequence, we sam-\nple two spans of text from the corpus, which we\nrefer to as \u201csentences\u201d even though they are typ-\nically much longer than single sentences (but can\nbe shorter also). The \ufb01rst sentence receives the A\nembedding and the second receives the B embed-\nding.",
        "node_42": "The \ufb01rst sentence receives the A\nembedding and the second receives the B embed-\nding. 50% of the time B is the actual next sentence\nthat follows A and 50% of the time it is a random\nsentence, which is done for the \u201cnext sentence pre-\ndiction\u201d task. They are sampled such that the com-\nbined length is \u2264512 tokens. The LM masking is\napplied after WordPiece tokenization with a uni-\nform masking rate of 15%, and no special consid-\neration given to partial word pieces.\nWe train with batch size of 256 sequences (256\nsequences * 512 tokens = 128,000 tokens/batch)\nfor 1,000,000 steps, which is approximately 40\nepochs over the 3.3 billion word corpus. We\nuse Adam with learning rate of 1e-4, \u03b21 = 0.9,\n\u03b22 = 0.999, L2 weight decay of 0.01, learning\nrate warmup over the \ufb01rst 10,000 steps, and linear\ndecay of the learning rate. We use a dropout prob-\nability of 0.1 on all layers. We use a gelu acti-\nvation (Hendrycks and Gimpel, 2016) rather than\nthe standard relu, following OpenAI GPT. The\ntraining loss is the sum of the mean masked LM\nlikelihood and the mean next sentence prediction\nlikelihood.\nTraining of BERT BASE was performed on 4\nCloud TPUs in Pod con\ufb01guration (16 TPU chips\ntotal).13 Training of BERTLARGE was performed\non 16 Cloud TPUs (64 TPU chips total). Each pre-\ntraining took 4 days to complete.\nLonger sequences are disproportionately expen-\nsive because attention is quadratic to the sequence\nlength. To speed up pretraing in our experiments,\nwe pre-train the model with sequence length of\n128 for 90% of the steps. Then, we train the rest\n10% of the steps of sequence of 512 to learn the\npositional embeddings.",
        "node_43": "Each pre-\ntraining took 4 days to complete.\nLonger sequences are disproportionately expen-\nsive because attention is quadratic to the sequence\nlength. To speed up pretraing in our experiments,\nwe pre-train the model with sequence length of\n128 for 90% of the steps. Then, we train the rest\n10% of the steps of sequence of 512 to learn the\npositional embeddings.\nA.3 Fine-tuning Procedure\nFor \ufb01ne-tuning, most model hyperparameters are\nthe same as in pre-training, with the exception of\nthe batch size, learning rate, and number of train-\ning epochs. The dropout probability was always\nkept at 0.1. The optimal hyperparameter values\nare task-speci\ufb01c, but we found the following range\nof possible values to work well across all tasks:\n\u2022 Batch size: 16, 32\n13https://cloudplatform.googleblog.com/2018/06/Cloud-\nTPU-now-offers-preemptible-pricing-and-global-\navailability.html",
        "node_44": "4184\n\u2022 Learning rate (Adam): 5e-5, 3e-5, 2e-5\n\u2022 Number of epochs: 2, 3, 4\nWe also observed that large data sets (e.g.,\n100k+ labeled training examples) were far less\nsensitive to hyperparameter choice than small data\nsets. Fine-tuning is typically very fast, so it is rea-\nsonable to simply run an exhaustive search over\nthe above parameters and choose the model that\nperforms best on the development set.\nA.4 Comparison of BERT, ELMo ,and\nOpenAI GPT\nHere we studies the differences in recent popular\nrepresentation learning models including ELMo,\nOpenAI GPT and BERT. The comparisons be-\ntween the model architectures are shown visually\nin Figure 3. Note that in addition to the architec-\nture differences, BERT and OpenAI GPT are \ufb01ne-\ntuning approaches, while ELMo is a feature-based\napproach.\nThe most comparable existing pre-training\nmethod to BERT is OpenAI GPT, which trains a\nleft-to-right Transformer LM on a large text cor-\npus. In fact, many of the design decisions in BERT\nwere intentionally made to make it as close to\nGPT as possible so that the two methods could be\nminimally compared. The core argument of this\nwork is that the bi-directionality and the two pre-\ntraining tasks presented in Section 3.1 account for\nthe majority of the empirical improvements, but\nwe do note that there are several other differences\nbetween how BERT and GPT were trained:\n\u2022 GPT is trained on the BooksCorpus (800M\nwords); BERT is trained on the BooksCor-\npus (800M words) and Wikipedia (2,500M\nwords).\n\u2022 GPT uses a sentence separator ( [SEP]) and\nclassi\ufb01er token ( [CLS]) which are only in-\ntroduced at \ufb01ne-tuning time; BERT learns\n[SEP], [CLS] and sentence A/B embed-\ndings during pre-training.",
        "node_45": "\u2022 GPT uses a sentence separator ( [SEP]) and\nclassi\ufb01er token ( [CLS]) which are only in-\ntroduced at \ufb01ne-tuning time; BERT learns\n[SEP], [CLS] and sentence A/B embed-\ndings during pre-training.\n\u2022 GPT was trained for 1M steps with a batch\nsize of 32,000 words; BERT was trained for\n1M steps with a batch size of 128,000 words.\n\u2022 GPT used the same learning rate of 5e-5 for\nall \ufb01ne-tuning experiments; BERT chooses a\ntask-speci\ufb01c \ufb01ne-tuning learning rate which\nperforms the best on the development set.\nTo isolate the effect of these differences, we per-\nform ablation experiments in Section 5.1 which\ndemonstrate that the majority of the improvements\nare in fact coming from the two pre-training tasks\nand the bidirectionality they enable.\nA.5 Illustrations of Fine-tuning on Different\nTasks\nThe illustration of \ufb01ne-tuning BERT on different\ntasks can be seen in Figure 4. Our task-speci\ufb01c\nmodels are formed by incorporating BERT with\none additional output layer, so a minimal num-\nber of parameters need to be learned from scratch.\nAmong the tasks, (a) and (b) are sequence-level\ntasks while (c) and (d) are token-level tasks. In\nthe \ufb01gure, E represents the input embedding, Ti\nrepresents the contextual representation of tokeni,\n[CLS] is the special symbol for classi\ufb01cation out-\nput, and [SEP] is the special symbol to separate\nnon-consecutive token sequences.\nB Detailed Experimental Setup\nB.1 Detailed Descriptions for the GLUE\nBenchmark Experiments.\nThe GLUE benchmark includes the following\ndatasets, the descriptions of which were originally\nsummarized in Wang et al. (2018a):\nMNLI Multi-Genre Natural Language Inference\nis a large-scale, crowdsourced entailment classi\ufb01-\ncation task (Williams et al., 2018).",
        "node_46": "B Detailed Experimental Setup\nB.1 Detailed Descriptions for the GLUE\nBenchmark Experiments.\nThe GLUE benchmark includes the following\ndatasets, the descriptions of which were originally\nsummarized in Wang et al. (2018a):\nMNLI Multi-Genre Natural Language Inference\nis a large-scale, crowdsourced entailment classi\ufb01-\ncation task (Williams et al., 2018). Given a pair of\nsentences, the goal is to predict whether the sec-\nond sentence is an entailment, contradiction, or\nneutral with respect to the \ufb01rst one.\nQQP Quora Question Pairs is a binary classi\ufb01-\ncation task where the goal is to determine if two\nquestions asked on Quora are semantically equiv-\nalent (Chen et al., 2018).\nQNLI Question Natural Language Inference is\na version of the Stanford Question Answering\nDataset (Rajpurkar et al., 2016) which has been\nconverted to a binary classi\ufb01cation task (Wang\net al., 2018a). The positive examples are (ques-\ntion, sentence) pairs which do contain the correct\nanswer, and the negative examples are (question,\nsentence) from the same paragraph which do not\ncontain the answer.\nSST-2 The Stanford Sentiment Treebank is a\nbinary single-sentence classi\ufb01cation task consist-\ning of sentences extracted from movie reviews",
        "node_47": "4185\nBERT\nE[CLS] E1  E[SEP]... EN E1\u2019 ... EM\u2019\nC\n T1\n T[SEP]...\n TN\n T1\u2019 ...\n TM\u2019\n[CLS] Tok \n1\n [SEP]... Tok \nN\nTok \n1 ... Tok\nM\nQuestion Paragraph\nBERT\nE[CLS] E1  E2  EN\nC\n T1\n  T2\n  TN\nSingle Sentence \n...\n...\nBERT\nTok 1  Tok 2  Tok N...[CLS]\nE[CLS] E1  E2  EN\nC\n T1\n  T2\n  TN\nSingle Sentence \nB-PERO O\n...\n...E[CLS] E1  E[SEP]\nClass \nLabel\n... EN E1\u2019 ... EM\u2019\nC\n T1\n T[SEP]...\n TN\n T1\u2019 ...\n TM\u2019\nStart/End Span\nClass \nLabel\nBERT\nTok 1  Tok 2  Tok N...[CLS] Tok 1[CLS][CLS] Tok \n1\n [SEP]... Tok \nN\nTok \n1 ... Tok\nM\nSentence 1\n...\nSentence 2\nFigure 4: Illustrations of Fine-tuning BERT on Different Tasks.\nwith human annotations of their sentiment (Socher\net al., 2013).\nCoLA The Corpus of Linguistic Acceptability is\na binary single-sentence classi\ufb01cation task, where\nthe goal is to predict whether an English sentence\nis linguistically \u201cacceptable\u201d or not (Warstadt\net al., 2018).\nSTS-B The Semantic Textual Similarity Bench-\nmark is a collection of sentence pairs drawn from\nnews headlines and other sources (Cer et al.,\n2017). They were annotated with a score from 1\nto 5 denoting how similar the two sentences are in\nterms of semantic meaning.\nMRPC Microsoft Research Paraphrase Corpus\nconsists of sentence pairs automatically extracted\nfrom online news sources, with human annotations\nfor whether the sentences in the pair are semanti-\ncally equivalent (Dolan and Brockett, 2005).",
        "node_48": "They were annotated with a score from 1\nto 5 denoting how similar the two sentences are in\nterms of semantic meaning.\nMRPC Microsoft Research Paraphrase Corpus\nconsists of sentence pairs automatically extracted\nfrom online news sources, with human annotations\nfor whether the sentences in the pair are semanti-\ncally equivalent (Dolan and Brockett, 2005).\nRTE Recognizing Textual Entailment is a bi-\nnary entailment task similar to MNLI, but with\nmuch less training data (Bentivogli et al., 2009).14\nWNLI Winograd NLI is a small natural lan-\nguage inference dataset (Levesque et al., 2011).\nThe GLUE webpage notes that there are issues\nwith the construction of this dataset, 15 and every\ntrained system that\u2019s been submitted to GLUE has\nperformed worse than the 65.1 baseline accuracy\nof predicting the majority class. We therefore ex-\nclude this set to be fair to OpenAI GPT. For our\nGLUE submission, we always predicted the ma-\njority class.\n14Note that we only report single-task \ufb01ne-tuning results\nin this paper. A multitask \ufb01ne-tuning approach could poten-\ntially push the performance even further. For example, we\ndid observe substantial improvements on RTE from multi-\ntask training with MNLI.\n15https://gluebenchmark.com/faq",
        "node_49": "4186\nC Additional Ablation Studies\nC.1 Effect of Number of Training Steps\nFigure 5 presents MNLI Dev accuracy after \ufb01ne-\ntuning from a checkpoint that has been pre-trained\nfor ksteps. This allows us to answer the following\nquestions:\n1. Question: Does BERT really need such\na large amount of pre-training (128,000\nwords/batch * 1,000,000 steps) to achieve\nhigh \ufb01ne-tuning accuracy?\nAnswer: Yes, BERT BASE achieves almost\n1.0% additional accuracy on MNLI when\ntrained on 1M steps compared to 500k steps.\n2. Question: Does MLM pre-training converge\nslower than LTR pre-training, since only 15%\nof words are predicted in each batch rather\nthan every word?\nAnswer: The MLM model does converge\nslightly slower than the LTR model. How-\never, in terms of absolute accuracy the MLM\nmodel begins to outperform the LTR model\nalmost immediately.\nC.2 Ablation for Different Masking\nProcedures\nIn Section 3.1, we mention that BERT uses a\nmixed strategy for masking the target tokens when\npre-training with the masked language model\n(MLM) objective. The following is an ablation\nstudy to evaluate the effect of different masking\nstrategies.\nNote that the purpose of the masking strategies\nis to reduce the mismatch between pre-training\n200 400 600 800 1,000\n76\n78\n80\n82\n84\nPre-training Steps (Thousands)\nMNLI Dev Accuracy\nBERTBASE (Masked LM)\nBERTBASE (Left-to-Right)\nFigure 5: Ablation over number of training steps. This\nshows the MNLI accuracy after \ufb01ne-tuning, starting\nfrom model parameters that have been pre-trained for\nksteps. The x-axis is the value of k.\nand \ufb01ne-tuning, as the [MASK] symbol never ap-\npears during the \ufb01ne-tuning stage. We report the\nDev results for both MNLI and NER.",
        "node_50": "This\nshows the MNLI accuracy after \ufb01ne-tuning, starting\nfrom model parameters that have been pre-trained for\nksteps. The x-axis is the value of k.\nand \ufb01ne-tuning, as the [MASK] symbol never ap-\npears during the \ufb01ne-tuning stage. We report the\nDev results for both MNLI and NER. For NER,\nwe report both \ufb01ne-tuning and feature-based ap-\nproaches, as we expect the mismatch will be am-\npli\ufb01ed for the feature-based approach as the model\nwill not have the chance to adjust the representa-\ntions.\nMasking Rates Dev Set Results\nMASK SAME RND MNLI NER\nFine-tune Fine-tune Feature-based\n80% 10% 10% 84.2 95.4 94.9\n100% 0% 0% 84.3 94.9 94.0\n80% 0% 20% 84.1 95.2 94.6\n80% 20% 0% 84.4 95.2 94.7\n0% 20% 80% 83.7 94.8 94.6\n0% 0% 100% 83.6 94.9 94.6\nTable 8: Ablation over different masking strategies.\nThe results are presented in Table 8. In the table,\nMASK means that we replace the target token with\nthe [MASK] symbol for MLM; SAME means that\nwe keep the target token as is; R ND means that\nwe replace the target token with another random\ntoken.\nThe numbers in the left part of the table repre-\nsent the probabilities of the speci\ufb01c strategies used\nduring MLM pre-training (BERT uses 80%, 10%,\n10%). The right part of the paper represents the\nDev set results. For the feature-based approach,\nwe concatenate the last 4 layers of BERT as the\nfeatures, which was shown to be the best approach\nin Section 5.3.",
        "node_51": "The numbers in the left part of the table repre-\nsent the probabilities of the speci\ufb01c strategies used\nduring MLM pre-training (BERT uses 80%, 10%,\n10%). The right part of the paper represents the\nDev set results. For the feature-based approach,\nwe concatenate the last 4 layers of BERT as the\nfeatures, which was shown to be the best approach\nin Section 5.3.\nFrom the table it can be seen that \ufb01ne-tuning is\nsurprisingly robust to different masking strategies.\nHowever, as expected, using only the MASK strat-\negy was problematic when applying the feature-\nbased approach to NER. Interestingly, using only\nthe R ND strategy performs much worse than our\nstrategy as well."
    },
    "relevant_docs": {
        "4d499739-f971-4985-888c-f8db6c7b7efa": [
            "node_0"
        ],
        "b9c30c77-b575-406e-9a95-6d90263fcd20": [
            "node_0"
        ],
        "7fe38245-bb09-4219-ac03-a7d0c1aed7d2": [
            "node_1"
        ],
        "cdc70093-b372-4515-ba59-452fdb6b7f8c": [
            "node_1"
        ],
        "704a2abb-3313-4d02-8a18-173727858722": [
            "node_2"
        ],
        "fd3ed9a5-4f0e-487a-9c83-93489521621a": [
            "node_2"
        ],
        "1c195a1f-4ad8-4916-8408-548ee22312dd": [
            "node_3"
        ],
        "85ac9ce9-cd7c-4df1-bc4c-31d91e1c51f4": [
            "node_3"
        ],
        "cab3c878-1b74-468f-b8d4-32d2f699acae": [
            "node_4"
        ],
        "375d961c-6abb-45a8-992e-e80563e6517f": [
            "node_5"
        ],
        "d84a25e5-6025-44ef-822a-cb1289c6e3c6": [
            "node_5"
        ],
        "34b28839-62f2-40fe-b1a1-4290bb7cdf3b": [
            "node_6"
        ],
        "b91115fc-d825-489d-95f6-f0f096ff669b": [
            "node_6"
        ],
        "0e2b1a26-8011-4e1f-acf9-4ebad890bcfb": [
            "node_7"
        ],
        "a7d6fde0-84fc-46e5-ae45-d1c45b9bc058": [
            "node_7"
        ],
        "cf9b1128-3555-47ad-9f8f-98419073baba": [
            "node_8"
        ],
        "0486c3d1-760c-4757-9cd9-f82ed37a782f": [
            "node_8"
        ],
        "c98cd949-bc46-4534-a5f7-505ac87ef4af": [
            "node_9"
        ],
        "c39068c8-21b1-454b-9977-0314cde62cc0": [
            "node_9"
        ],
        "1c1f9d1c-88c5-4faa-894d-1b94815913bc": [
            "node_10"
        ],
        "d453b614-200e-4caa-87af-f733a1700ea4": [
            "node_10"
        ],
        "be8dfb1c-a092-4b69-939e-efa07c8a131a": [
            "node_11"
        ],
        "b4eefd84-e476-478e-86c8-5ca9fa4d8580": [
            "node_11"
        ],
        "03dd22e3-6aba-4064-8c61-87b48f93c7a8": [
            "node_12"
        ],
        "58f553b4-1d17-41ef-9c68-2ad24a33643c": [
            "node_12"
        ],
        "09b58f0d-b4b8-421e-8783-ecd7df290b76": [
            "node_13"
        ],
        "0b603baf-a569-43c0-a941-8728d57d87b9": [
            "node_13"
        ],
        "09a76baa-a56d-4441-913a-7473dcab6694": [
            "node_14"
        ],
        "9d9fd296-3c57-446c-9f33-ee5f7d3f2dbd": [
            "node_14"
        ],
        "db12a500-48c6-4fab-b853-b439fc31526a": [
            "node_15"
        ],
        "92a91e62-aa12-42fd-98b2-195e8923b329": [
            "node_15"
        ],
        "d9d41473-a6e0-442c-afd5-332c81fbde94": [
            "node_16"
        ],
        "7bbda2c4-f90d-4c63-ab91-2be0c47ab779": [
            "node_17"
        ],
        "59d11fa2-049a-4c5b-adc1-bf98b374128e": [
            "node_17"
        ],
        "06030106-46d3-4b7f-aa91-0225eb418947": [
            "node_18"
        ],
        "332488ff-2eb6-4321-b7af-d314abfaf9b6": [
            "node_18"
        ],
        "617e4fd8-ed5d-44bd-afff-c95c9cd7bc55": [
            "node_19"
        ],
        "8ce17f92-1bd6-4a89-b473-3b95a73d2b34": [
            "node_19"
        ],
        "59cf636f-353e-409f-9040-a6b342b260c1": [
            "node_20"
        ],
        "9d049470-1d41-45ea-8a7f-12e293779c5a": [
            "node_20"
        ],
        "e3691d79-39b7-4c12-b86d-06fd7d63b0cf": [
            "node_21"
        ],
        "625185b4-e0e3-437b-820d-57b929cadede": [
            "node_21"
        ],
        "2568a84d-acc6-4720-b720-85411e78ca6d": [
            "node_22"
        ],
        "6a3e8dd1-e938-463b-bacc-c34206954508": [
            "node_22"
        ],
        "1b516bfb-64a7-4480-bd28-2abce91504bb": [
            "node_23"
        ],
        "464de966-1d58-4360-94d9-f43e26d9fbef": [
            "node_23"
        ],
        "532888a5-cbeb-420c-8fba-020498caff57": [
            "node_24"
        ],
        "93f713f1-cc9a-44a6-9a95-7c76f9e2b28e": [
            "node_25"
        ],
        "e18b624a-7c02-4ed6-9bf8-fe88206bf2cf": [
            "node_25"
        ],
        "48389c08-1f6b-4db1-8c68-219247371881": [
            "node_26"
        ],
        "c26f23e1-0d6d-47b1-acc7-7e84e1938536": [
            "node_26"
        ],
        "d26cb4ca-c419-40f0-906d-b0f3c84e82de": [
            "node_27"
        ],
        "812d36f5-8203-407d-9502-65dad9de5e68": [
            "node_28"
        ],
        "f18d9a22-2c73-4390-b79e-881cc41fa7a6": [
            "node_28"
        ],
        "c766cd49-e879-45d3-baaf-5bb53fe34b1b": [
            "node_29"
        ],
        "b03d5a41-8c0b-4378-a9be-1ddf18808725": [
            "node_30"
        ],
        "2c90fa96-f612-40fb-a3c7-e4e8c78e71eb": [
            "node_30"
        ],
        "8f1e7bee-54b2-4665-82e4-138c7ba37edf": [
            "node_31"
        ],
        "92cbcbf7-be26-45ea-b0ed-a4d2d32c428c": [
            "node_31"
        ],
        "25900e1c-96a1-4630-a54f-bea9c176b487": [
            "node_32"
        ],
        "bb189784-5b73-4b68-9ae7-14fa1788c777": [
            "node_33"
        ],
        "3dc6eadc-1479-42b6-b6af-f78c5ee818f3": [
            "node_34"
        ],
        "ed848d06-7fb1-417a-b916-5d9dce3b00cd": [
            "node_34"
        ],
        "fccc33d3-771f-4b88-9d26-27ad7c7f3256": [
            "node_35"
        ],
        "325dd4d0-67d4-4aa3-90f9-e83294399989": [
            "node_36"
        ],
        "6a710f48-eee2-4ea1-b15f-7d3605024d80": [
            "node_36"
        ],
        "08a3a347-a366-4fca-b5ae-ccac7d9a86ca": [
            "node_37"
        ],
        "1630ca98-7b4b-4257-b168-145913c6adc0": [
            "node_37"
        ],
        "1e93729f-5220-4bd6-88e4-5e7f2ae37157": [
            "node_38"
        ],
        "a9be8260-e8cf-405c-a847-1146753474dc": [
            "node_38"
        ],
        "44aa0539-cc2f-436c-ab08-d43560ebc4f7": [
            "node_39"
        ],
        "bdb1fa9d-ec0d-420f-a9a7-f076e60f1a11": [
            "node_39"
        ],
        "334bdd40-a421-4d98-ab68-65aed3839cfb": [
            "node_40"
        ],
        "188227f5-2e53-4fba-9da2-1aed25cc5ed6": [
            "node_40"
        ],
        "ed89c40f-a8c7-4918-ace7-ec65a20ddf61": [
            "node_41"
        ],
        "86a76bcb-0c54-44cd-84d0-2e363a8fabce": [
            "node_41"
        ],
        "5df5c489-3042-4ec9-9dc6-b0d115159a0a": [
            "node_42"
        ],
        "d0cfcf39-e56c-4e42-88a4-f6096bc0073f": [
            "node_42"
        ],
        "d0fc2eab-30f9-4f6e-95ea-53e026010a8d": [
            "node_43"
        ],
        "3b5b82c1-d224-4a46-b6e6-77dcebe28050": [
            "node_43"
        ],
        "fafadd60-e3da-45b5-8555-5bbbd944766f": [
            "node_44"
        ],
        "c7e97357-1156-4107-8d29-836c008dcefd": [
            "node_45"
        ],
        "8aacfa2a-fda1-4b03-bc7a-d9fa6b017d34": [
            "node_46"
        ],
        "ba7b8965-dc16-40fd-a2fa-4257f72ae2e6": [
            "node_47"
        ],
        "f2208672-4f8d-4cef-932a-1ea413df8dd3": [
            "node_48"
        ],
        "1532bc5b-a3d0-4c5c-b0dc-dec8621dc9b0": [
            "node_49"
        ],
        "9939d620-2bdc-4bb9-b5c0-fd52f08abb28": [
            "node_49"
        ],
        "34cb12a7-2088-4a72-bae5-ec4c0d8bf7b7": [
            "node_50"
        ],
        "3e83b371-aee3-4220-9d09-1afdce463f9e": [
            "node_51"
        ],
        "937f4d22-be7b-4fc7-a137-ae4e08367757": [
            "node_51"
        ]
    },
    "mode": "text"
}