Author | Cheng Qian
Edit | Desert Shadow
Voice interaction has been played with new tricks? !
Wisdom reported on April 2nd that on Monday, Baidu released the industry’s first end-to-end speech language model based on new Cross-Attention, which has been launched online for free.
Summarize the newly upgraded essays with several key words: voice interaction.Ultra-realistic, ultra-low delay and ultra-low cost.
Super-fidelity refers to the fact that Wen Xiaoyan, equipped with a large phonetic language model, can understand Chongqing, Guangxi, Henan, Guangdong, Shandong and so onCharacteristic dialect can also realize emotional dialogue.; Ultra-low latency means that the waiting time of users can be reduced from the common 3-5 seconds in the industry toAbout 1 second, almost the same as talking to real people; Ultra-low cost means that in the voice question-and-answer scenario of telephone voice channel, the call cost is about lower than the industry average.50%-90%。
Jia Lei, chief architect of Baidu Voice, revealed that the model can be deployed on L20 card, and under the condition of meeting the delay requirement of voice interaction,The concurrency of dual L20 cards can be more than several hundred.. At present, the training process of speech language model is convenient, which can be realized by optimizing hundreds of cards in one week based on Wenxin model, and the optimization work itself is not complicated.
Compared with the application of big model in voice interaction scene, what is the uniqueness of this big model of voice language? How to reduce the call cost to 90% at the highest? How to interpret the innovation behind it? Zhi Dongxi had an in-depth exchange with Jia Lei, chief architect of Baidu Voice, trying to find the answers to these questions.
First, the real-life dialogue experience, the upgraded version of the text Xiaoyan voice interaction is more silky
The development of large-scale model in voice interaction scene is evolving towards a more natural, low-delay and highly realistic voice interaction experience. And this more anthropomorphic interactive experience, we get a glimpse of the embryonic form in the newly upgraded Wen Xiaoyan, which is equipped with an end-to-end phonetic language model and has become an emotional companion and all-round assistant.
First of all, Wen Xiaoyan has integrated information query including weather query, calendar query, unit conversion, stock price and so on.38 vertical assistantsIt can be seen that the efficiency of voice interaction is much higher than that of text interaction in these special scenarios.
Secondly, Wen Xiaoyan is rightProblems of timeliness and non-timelinessCan cope with encyclopedic inquiry, timely question and answer content of current political knowledge, and Wen Xiaoyan can search in real time, and follow the instructions accurately to reduce hallucinations; Non-timeliness issues such as common sense questions and answers are not a problem.
Finally, it is also the biggest difference between voice interaction and text. Wen Xiaoyan can be carried out with users.Emotional and natural communication, and can respond to feedback quickly, and realize realistic and anthropomorphic interaction effect.
Not much to say, let’s take a look at the actual effect of Wen Xiaoyan.
A major difficulty in speech recognition is dialect recognition. The pronunciation characteristics of dialects are rich and varied. The same dialect may have different pronunciations in different regions, and even the same word may have different pronunciations in different contexts. This makes it difficult for speech recognition system to accurately capture and analyze all pronunciation variants, which increases the difficulty of recognition. At present, Wen Xiaoyan has been able to deal with the characteristic dialects in Chongqing, Guangxi, Henan, Guangdong and Shandong. He can not only understand them, but also reply with the corresponding dialects.
Another major feature of voice communication is the need for multiple rounds of interaction. For example, in the following example, Wen Xiaoyan not only gives a method to distinguish budgerigars from different dimensions of adulthood and childhood, but also gives correct feedback in time when users interrupt to ask new questions.
Even if the reply contains many elements such as distinguishing the budgerigar’s sex and distinguishing from a certain fixed feature, Wen Xiaoyan gives a concise answer, and at the end reminds the user that he can record while observing.
In addition, human-computer interaction is often interrupted in the middle, such as the user gets the core information he wants or is not satisfied with the current output content, etc. When the user interrupts the voice playback to input, the voice recognition system may be mistaken because of environmental noise, unclear pronunciation of the user or confusion with the previous voice content.
In the face of children’s repeated interruptions, Wen Xiaoyan can accurately identify their needs of "changing a story", and when children say "Mom has already told it", they don’t automatically choose to change the story, but give an emotional reply in time to create an atmosphere of natural dialogue.
Wen Xiaoyan, an emotional interactive form, also extends its application to companion scenes in assistant scenes such as knowledge quiz. When the user mentioned "I’m in a bad mood", Wen Xiaoyan’s voice was worried and so on, and guided the user to tell the reason why he was in a bad mood and further enlighten him.
Second, brand-new cross-related attention, to create a very low training cost advantage
Different from the language model, the core difference of the phonetic language model is that it can produce emotion.
Jia Lei said that the text model only produces words, while the phonetic language model can have emotions. The key is two special links in the architectural diagram of the phonetic language model.TN Rhythm and Emotion Control of Personality and StyleThis is prepared for speech synthesis, which allows the big model to have the emotion of adapting the content while generating the answer, which is also the key innovation of Baidu’s end-to-end speech model.
Specifically, there are four key innovations.
First of all, this is released by Baidu.The industry’s first cross-modal speech language model based on Cross-Attention; The second is the model.Combining Encoder with speech recognition, the calculation of KV is reduced to 1/10.; The third isEncoder is combined with speech synthesis, and the output content can be controlled emotionally.; Finally,Efficient full query attention EALLQA, so that the KV cache is reduced to a few tenths.
On this basis, the model realizesIntegration of text recognition and text synthesis.On the basis of systematic end-to-end communication, these coupled technologies enable the model to realize a natural, realistic and emotional interactive experience on the basis of quick question and answer and quick understanding.
Jia Lei explained that the acoustic model is also a phonetic model, but usually large language models are connected by words. Therefore, in the process of integrating speech recognition and large language model, researchers share the process of Encoder and speech recognition in large language model to achieve the purpose of reducing the hard delay of speech interaction, and innovatively introduce cross-modal modeling, switching from Self-Attention to Cross-Attention, thus completing the integration of speech recognition and large language model.
Baidu proposes to use Cross-Attention to solve the cross-modal problem of speech and language. In this process, due to the speed limitation of the existing Attention technology in Cross-Attention speech language modeling, Baidu developed the EALLQA technology suitable for Cross-Attention, adopting implicit RNN two-level position coding, training MHA in 128 space, and inferring MQA in 512 space shared by all layers of the model, so as to achieveMake full use of limited training resources and reduce reasoning cost..
In the basic training of the model, Baidu uses self-distillation post-train the end-to-end speech language model of Cross-Attention based on the pre-training model of Self-Attention mature Wenxin language.
In fact,In speech model, the pressure of KV cache and KV calculation is much greater than that of text model.. Jia Lei explained that the essential difference between speech recognition and text model is that the first token at the beginning of a sentence determines the delay of speech recognition. For the large text model, users can wait for 2-3 seconds to give an answer after entering a paragraph of text, while in the large speech language model, users are less tolerant of reply delay, and they hope to hear the answer within 0.5-1 second.
On this basis, the end-to-end speech language model realizes low-cost training and low-cost high-speed reasoning. In addition, the speech language model also needs quick response and emotional response, which is the scene where another key technology is exerted.Streaming word-for-word LLM-driven multi-emotional speech synthesis. Jia Lei said that many rounds of continuous communication with emotion can make people want to continue communication.
Based on the streaming word-for-word method, its speech synthesis is to see one word jumping one word at a time, and the large model can help the speech synthesis to output the text normalized output, prosodic pause output and emotional output it needs, making the speech synthesis process flow like when people speak, and its adaptive emotional coverage can reach 17 kinds according to the text output.
In addition, there is a big pain point in speech recognition, which is that it can’t judge the starting point and ending point of the user’s speech, butLarge model blessing can make it analyze whether the user’s words have ended based on semantics.The semantics are incomplete and we need to wait.
Jia Lei further explained that,The activation of voice scenes requires extremely interactive cost, extremely fast delivery speed, intelligent and emotional humanized question and answer.. Baidu integrates speech recognition with large model to solve the problems of pre-fetching, hesitation, content understanding and quick question and answer, and integrates text synthesis with large model to output prosodic emotion needed in speech summation, thus solving the problems of context understanding and emotion control in synthesis. This makes the application potential of voice scene greatly improved.
Third, directly attack the difficulty of voice interaction, and enlarge Baidu’s end-to-end voice language model.
The continuous optimization of large model has significantly improved the robustness, naturalness and speaker similarity of speech, butThere are still many pain points in the previous technical path.This is why Baidu focuses on the end-to-end speech language model.
Compared with the communication between people, the response speed of large language model is slow, and users need to wait for a while to get a reply. In addition, voice communication is often accompanied by multiple rounds of dialogue and interaction, and it is extremely difficult for models to complete colloquial multiple rounds of interaction. Compared with texts, users use more scenes of voice interaction, and the surge of interaction will lead to the increase in the application cost of large-scale models, and the difficulty of large-scale application popularization will also increase.
On the traditional voice interaction route, it will be limited by the accurate response between contextual memory, noisy occasions, hesitant questions and interruptions.
Therefore, this has become a core contradiction in the field of voice interaction.The convenience of voice interaction determines its potential for large-scale application, and these pain points are hindering its popularization.Jia Lei believes that,The chemical reaction between speech and text is the key to find a breakthrough in a specific field in the future..
The appearance of phonetic language model is qualitative change, and its innovative synthesis technology makes the model not need to see the whole text of a sentence, but can synthesize a word by seeing the text of a word. On this basis, Baidu has excavated a unique application scenario. He gave an example. For example, when asking about the weather, users can quickly interrupt the next question when they get the temperature range of the weather. Its advantage is that the use cost of the model is greatly reduced, while the text model needs powerful hardware to achieve such efficient application, but the speech language model can achieve efficient concurrency with low-cost hardware.
At the same time, from the perspective of the whole voice interaction field, the accuracy of the speech recognition part of the large model has been greatly improved. Jia Lei believes that it is more a competition of speed, cost and answer accuracy.Cost reduction is the key to large-scale use of multimodal voice interaction..
Jia Lei said: "Cost reduction is the inevitable way of technological progress." The extremely low cost of Baidu’s speech language model means the possibility of large-scale industrialization. The landing application of AI is the core of the development of the model industry in 2025, and this model is the key to solving the speech problem.
Baidu’s accumulation in the field of speech recognition has a long history.
In 2018, the Deep Peak 2 model released by Baidu Voice broke through the traditional model that has been used for more than ten years, greatly improving the recognition accuracy in various scenarios. At the beginning of 2019, Baidu voice technology team announced the world’s first streaming multi-level truncated attention model SMLTA in the field of online voice, with a relative accuracy rate of 15%. In 2021, Baidu released the streaming truncated conformer modeling technology based on historical information abstraction-SMLTA 2, which solved the problems faced when the Transformer model was applied to online speech recognition tasks.
These technological innovations have been applied in many fields such as automobiles, consumer electronics and mobile phones. This time, in order to promote the large-scale application of speech language model, at present,Baidu has launched its online Xiaoyan and opened it for free, and plans to launch it on the open platform in April, and then it will be connected to call centers, speakers and other business lines..
Jia Lei mentioned: "Science may have national boundaries, but it has no corporate boundaries." Subsequently, Baidu opened the end-to-end speech language model, just to promote the application of the big language model in the speech field, which is conducive to the development of the whole industry and ecology.
Conclusion: Baidu’s phonetic language model is unsheathed, opening a new chapter with low cost and high efficiency.
Convenient, efficient, natural and friendly voice interaction, which is widely used in multiple scenes, is very important in the digital age. However, from the current actual effect, voice interaction has greatly improved in recognition accuracy, and now the competition focus of the industry is on its recognition speed, cost and accuracy of answers.
In this context, the release of Baidu’s end-to-end speech language model further broke through the cost, and proposed a new technical path, which pushed the industry competition of voice interaction to a new height. At the same time, Baidu will launch it on an open platform, which will accelerate the application and popularization of large models in voice interaction scenarios.