Creating a dialogue system, or chatbot, is a typical task in the field of natural language processing (NLP). Almost all modern approaches to solving it are based on deep learning technologies using neural networks of various architectures. Chatbots are usually considered auxiliary marketing tools that help website or social network users choose products or services. However, the scope of their application is much wider.
This article shares our experience of corporate chatbot development. The chatbot was developed for the HR department of a large tech company from scratch, without using any out-of-the-box solutions. The main purpose of the chatbot was to optimize the HR team’s time and effort and, to some extent, to replace the real HR manager with an AI “person” that could answer frequently asked questions. Below, we describe the basic steps for creating a prototype from which you can build a production-ready system.
Step 1. Gathering requirements
The target audience for the chatbot was the company’s employees. The system was expected to answer frequently asked questions, help newcomers in onboarding, and communicate on various corporate topics.
In machine learning, it is very important to accurately define business requirements for the system. Usually, writing user stories is enough, and when it comes to a chatbot, the most efficient way is to write different types of dialogues. They could be one-off answers to questions asked, as well as in-depth dialogues, considering the context: previous phrases, the last 3–4 sentences, or a complete communication history, including previous sessions with this user.
Fig. 1. Example of contextless dialogue of two pairs of phrases
Thus, having collected a sufficient set of examples of different dialogue types, we can accurately define the requirements for the system to be developed.
Step 2. Selecting metrics
The second important step is discussing the acceptance criteria with the product owner. It is reasonable to use two levels of a dialogue system assessment: an external business metric and an internal technical metric. The former may measure the system quality from the end user’s point of view.
In our project, we prepared a set of 100 “question–answer” pairs and asked the assessors (several dozen employees) to rate the quality of the chatbot’s answers from 0 to 10. Afterward, their assessments were averaged, and the value of the business metric was calculated.
However, this approach is rather time-consuming. Therefore, it is convenient to use an additional internal technical metric that is calculated automatically. There are several popular technical metrics for evaluating the performance of NLP systems (Perplexity, BLEU, F1, Jaccard score, etc.).
In our project, we used the Jaccard score. To calculate this metric, you need to compare the real response of the system to the replica and the ideal response specified in the dataset. The Jaccard score is calculated as the number of matching words in the two answers divided by the sum of the number of matching words + the number of mismatched words from the system’s response + the number of mismatched words from the ideal answer. Figure 2 shows the calculation of the Jaccard score metric using the example of a one-phrase system response to a user question.
Fig. 2. Example calculation of Jaccard score metric comparing system’s answer and reference answer to question “When does the company pay wages?”
Internal metrics help track the task progress and choose the solutions that give the largest increase in quality. The best selected solutions can then be submitted to assessors to calculate the final business metric and plan for further improvements to achieve satisfactory acceptance criteria.
Step 3. Preparing training sets
The next step is to prepare a dataset to train the system. For chatbot training, a set of question–answer pairs is required. The set size significantly depends on the chosen neural network architecture and could range from thousands to millions of examples. The relationship is direct: the larger the training set, the higher the quality of the system as a whole. For example, one of the well-known sets, the open SQuAD dataset, contains more than 100,000 question–answer pairs.
In addition to the general set, it is necessary to prepare a special dataset related to the chatbot’s purpose. The size of this additional set is determined by the resources available.
In our project, the main set was an open dataset consisting of 2 million question–answer pairs on general topics. Additionally, we prepared our own dataset based on corporate information, with about 1,000 question–answer pairs on a narrow topic. These pairs were compiled by three experts independently of each other. It allowed us to achieve greater stability of the obtained metric, since three reference answers were proposed for each question.
Thus, this step resulted in two training sets: a large dataset of question–answer pairs on general topics and a small specialized dataset on the specific chatbot topic.
Step 4. Choosing an architecture
Historically, there have been various approaches to chatbot development. A significant breakthrough in this area occurred with the second wave of neural network usage back in 2016. At that time, recurrent neural networks (GRU, LSTM) were commonly used for NLP tasks. However, since 2018, the focus has shifted toward non-recurrent architectures using the “attention” mechanism. In particular, at present, one of the main approaches is the “transformers” architecture. Figure 3 shows the state-of-the-art models over the past two years using the example of the above-mentioned SQuAD task (search for an answer to a question in the text).
Fig. 3. Progress of neural networks of various architectures on SQuAD problem since 2018. The vertical axis is the EM (exact match) quality metric showing the approximate percentage of correspondence between the neural network’s response and the set of experts’ reference responses.
As you can see from the graph, at the beginning of 2019, the quality of the systems exceeded the human level achieved by the assessors for the same tasks.
Consequently, the highly recommended approach for creating chatbots nowadays is using the “transformers” architecture. For our project, we chose pretrained models based on the BERT architecture.
Step 5. Training
And so, we are now ready to train the selected model. Modern NLP models have a very high level of complexity and a large number of layers. Therefore, significant resources are required for training. For example, it takes about 100 GPU*days to train the BERT-Large model on a set of 800 million words, which is rather slow and/or expensive to be implemented in a commercial-grade project. That is why, in real-life scenarios, developers use the pretrained NLP models, available free of charge in many languages, as a basis. However, such pretrained models are not off-the-shelf solutions and they require modification for the selected task, the so-called fine-tuning of the models. It means the new output layers should be added to the pretrained model and additional training should be performed on the sets of question–answer pairs discussed in Step 3.
Quality monitoring is essential during the additional training, so we advise using cross-validation and automatic internal metrics. After acceptable results on internal metrics have been achieved, it is advisable to send the resulting system to assessors for a review to re-calculate business metrics and verify that acceptance criteria have been reached.
Finally, we get a trained model that can be built into the chatbot infrastructure and start beta testing of the final system in focus groups before production.
Conclusion
Here, we merely described the key stages of building a chatbot system: clarifying the task, defining quality criteria, collecting data, searching for relevant models, and training the models. Obviously, this list is in no way exhaustive. Your chatbot project may face various issues associated, for instance, with searching for pretrained models for the required languages, setting up equipment for training models, interacting with assessors, etc. In addition, each chatbot has its own specifics related to the task at hand and the expected target audience. In any case, the best guideline is feedback from end users.