Topic Tag: speech

home Forums Topic Tag: speech

 Sparsity-based Defense against Adversarial Attacks on Linear Classifiers

    

Deep neural networks represent the state of the art in machine learning in a growing number of fields, including vision, speech and natural language processing. However, recent work raises important questions about the robustness of such architectures, by showing that it is possible to induce class…


 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation

    

We describe recurrent neural networks (RNNs), which have attracted great attention on sequential tasks, such as handwriting recognition, speech recognition and image to text. However, compared to general feedforward neural networks, RNNs have feedback loops, which makes it a little hard to understa…


 Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition

    

Deep convolutional neural networks are being actively investigated in a wide range of speech and audio processing applications including speech recognition, audio event detection and computational paralinguistics, owing to their ability to reduce factors of variations, for learning from speech. How…


 Conversational AI: The Science Behind the Alexa Prize

  

Conversational agents are exploding in popularity. However, much work remains in the area of social conversation as well as free-form conversation over a broad range of domains and topics. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million-dollar un…


 Human and Machine Speaker Recognition Based on Short Trivial Events

Trivial events are ubiquitous in human to human conversations, e.g., cough, laugh and sniff. Compared to regular speech, these trivial events are usually short and unclear, thus generally regarded as not speaker discriminative and so are largely ignored by present speaker recognition research. Howe…


 What do we need to build explainable AI systems for the medical domain?

    

Artificial intelligence (AI) generally and machine learning (ML) specifically demonstrate impressive practical success in many different application domains, e.g. in autonomous driving, speech recognition, or recommender systems. Deep learning approaches, trained on extremely large data sets or usi…


 Letter-Based Speech Recognition with Gated ConvNets

  

In this paper we introduce a new speech recognition system, leveraging a simple letter-based ConvNet acoustic model. The acoustic model requires — only audio transcription for training — no alignment annotations, nor any forced alignment step is needed. At inference, our decoder takes o…


 Evaluation of Speech for the Google Assistant

     

Posted by Enrique Alfonseca, Staff Research Scientist, Google Assistant Voice interactions with technology are becoming a key part of our lives — from asking your phone for traffic conditions to work to using a smart device at home to turn on the lights or play music. The Google Assistant is desi…


 Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

  

This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (…


 Tacotron 2: Generating Human-like Speech from Text

     

Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception Teams Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few year…


 Understanding Medical Conversations

   

Posted by Katherine Chou, Product Manager and Chung-Cheng Chiu, Software Engineer, Google Brain Team Good documentation helps create good clinical care by communicating a doctor’s thinking, their concerns, and their plans to the rest of the team. Unfortunately, physicians routinely spend more…


 High-fidelity speech synthesis with WaveNet

High-fidelity speech synthesis with WaveNet by DeepMind


 Spoken Language Biomarkers for Detecting Cognitive Impairment

   

In this study we developed an automated system that evaluates speech and language features from audio recordings of neuropsychological examinations of 92 subjects in the Framingham Heart Study. A total of 265 features were used in an elastic-net regularized binomial logistic regression model to cla…


 Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition

 

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling…


 Recent Advances in Convolutional Neural Networks

    

In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging…


 Learning weakly supervised multimodal phoneme embeddings

  

Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, …


 Learning Scalable Deep Kernels with Recurrent Structure

     

Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functi…


 LSTM: A Search Space Odyssey

  

Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in …


 Modular Representation of Layered Neural Networks

   

Layered neural networks have greatly improved the performance of various applications including image processing, speech recognition, natural language processing, and bioinformatics. However, it is still difficult to discover or interpret knowledge from the inference provided by a layered neural ne…


 How Important is Syntactic Parsing Accuracy? An Empirical Evaluation on Rule-Based Sentiment Analysis

  

Syntactic parsing, the process of obtaining the internal structure of sentences in natural languages, is a crucial task for artificial intelligence applications that need to extract meaning from natural language text or speech. Sentiment analysis is one example of application for which parsing has …


 DeepSafe: A Data-driven Approach for Checking Adversarial Robustness in Neural Networks

       

Deep neural networks have become widely used, obtaining remarkable results in domains such as computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, and bio-informatics, where they have produced results comparable to human…


 Improving speech recognition by revising gated recurrent units

   

Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their a…


 Research on several key technologies in practical speech emotion recognition

  

In this dissertation the practical speech emotion recognition technology is studied, including several cognitive related emotion types, namely fidgetiness, confidence and tiredness. The high quality of naturalistic emotional speech data is the basis of this research. The following techniques are us…


 Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

  

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks (DNNs) techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natura…


 Learning Latent Representations for Speech Generation and Transformation

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous succ…