When using generative methods to build systems that respond to clients’ requests, one of the biggest challenges is the evaluation phase. This stage raises many questions, such as: What does “correct” mean? Can an answer be syntactically and grammatically correct, but not meet the customer’s needs? What about a suggestion that saves time for the manager but requires editing? How can the system’s overall performance be assessed? How can we prevent the model’s output from generating hallucinations? What is the best metric to optimize the network? (View Highlight)
NLG metrics with an accuracy above 75%. However, metrics alone may not always accurately reflect the quality of the generated text output, leading us to realize that the evaluation phase needed to include human-centered annotations to align automatic metrics with human criteria. (View Highlight)