Welcome to CSTR’s NN-TTS documentation!

Contents:

Indices and tables

Get Started

Required softwares/tools

The system has been tested in linux environment, and the following packages are required.

Data Preparation for a neural network (NN) based speech synthesis system

To build a NN system, you need to prepare linguistic features as system input and acoustic features as system output. Please follow the instructions in this section to prepare your data.

Input Linguistic Features

Neural networks take vectors as input, so the alphabet representation of linguistic features needs to be vectorized.

  1. HTS style: Please check the HTS demo for the HTS style labels (http://hts.sp.nitech.ac.jp/).
    • Provide HTS full-context labels with state-level alignments.
    • Provide a question file that matches the HTS labels.
    • The questions in the question file will be used to convert the full-context labels into binary and/or numerical features for vectorization. It is suggested to do a manual selection of the questions, as the number of questions will affect the dimensionality of the vectorized input features.
    • Different from the HTS format question, the NN system also supports to extract numerical values using ‘CQS‘, e.g., ** CQS “Pos_C-Word_in_C-Phrase(Fw)” {:(d+)+}**, where ‘:‘ and ‘+‘ are separators, and ‘(d+)‘ is a regular expression to match a numerical feature.
  2. “Composed” style:

  3. Direct *vectorized* input: If you prefer to do vectorization yourself, you can feed the system binary files directly. Please prepare your binary files with the following instructions:
    • Align the input feature vectors with the acoustic features. Input and output features should have the same number of frames.
    • Store the data in binary format with ‘float32‘ precision.
    • In the config file, use an empty question file, and set appended_input_dim to be the dimensionality of the input vector.
    • Note: voice conversion can use this kind of direct vectorized input.

Output Acoustic Features

The default setting is assuming you use the STRAIGHT vocoder (c version). This vocoder is free for academic users. The output includes
  • mel-cepstral coeffcients (MCC),
  • band aperiodicities (BAP),
  • Fundamental frequency (F0) in logarithmic scale.
Please provide the three features in binary format with ‘float32’ precision, in the config file, provide the dimensionality of each feature, for example
  • [Outputs] mgc : 60
  • [Outputs] dmgc : 180

dmgc means the dimensionality of MCC with delta and delta delta features. If dmgc is set to 60, only the static features are used. Please also tell the file extension for each feature, for example

  • [Extensions] mgc_ext : .mgc
  • [Extensions] bap_ext : .bap
  • [Extensions] lf0_ext : .lf0

The open-source WORLD vocoder is also supported. The modified version for SPSS can be found in the repository.

If you have your preferred vocoder, please try to give a nick name to each feature to match the supported ones.

Recipes

In the system, several recipes for standard neural network architectures are provided. They are described below:

Architecture

The system supports a flexible way to change neural network architectures by changing the config file in the [Architecture] section:
  • hidden_layer_size : [512, 512, 512, 512]
  • hidden_layer_type : [‘TANH’, ‘TANH’, ‘TANH’, ‘TANH’]
By default, feedforward neural network is used. But the system supports various types of hidden layers:
  • ‘TANH’ : The Hyperbolic Tangent activation function
  • ‘RNN’ : The simple but standard recurrent neural network unit
  • ‘LSTM’ : The standard Long Short-Term Memory unit
  • ‘GRU’ : The gated recurrent unit
  • ‘SLSTM’: The simplified LSTM unit
  • ‘BLSTM’: The bidirectional LSTM unit

You can define your own architecture by choosing a hidden unit at each hidden layer. For each type of hidden layer, please check the Models section.

Deep Feedforward Neural Network

An example config file can be found in the ‘./recipes/dnn’ directory. Please use ‘submit.sh ./run_lstm.py ./recipes/dnn/feed_foward_dnn.conf’ to build the feedforward neural network. Please modify the config file to adapt to your own working environment (e.g., data path).

Mixture Density Neural Network

(Deep) Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN)

An example config file is provided in ‘./recipes/dnn/hybrid_lstm.conf’. Follow the same recipe as that in the deep feedforward neural network section.

(Deep/Hybrid) Bidirectional LSTM-based RNN

Example config files are provided in ‘./recipes/blstm’ directory. ‘blstm.conf’ is for multiple bidrectional LSTM layers, while ‘hybrid_blstm.conf’ is for a hybrid architecture, that uses several feedforward layers at the bottom, and one BLSTM layer at the top.

Variants of LSTM

This recipe is to support the paper by Wu & King (ICASSP 2016). Several variants of LSTMs are provided. Please use the corresponding config files to do the experiments.

Stacked Bottlenecks

Trajectory modelling

Models

Deep Feedforward/Recurrent Networks

This is something I want to say that is not in the docstring.

class models.deep_rnn.DeepRecurrentNetwork(n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR')[source]

This class is to assemble various neural network architectures. From basic feedforward neural network to bidirectional gated recurrent neural networks and hybrid architecture. Hybrid means a combination of feedforward and recurrent architecture.

__init__(n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR')[source]

This function initialises a neural network

Parameters:
  • n_in – Dimensionality of input features
  • hidden_layer_size (A list of integers) – The layer size for each hidden layer
  • n_out (Integrer) – Dimensionality of output features
  • hidden_layer_type – the activation types of each hidden layers, e.g., TANH, LSTM, GRU, BLSTM
  • L1_reg – the L1 regulasation weight
  • L2_reg – the L2 regulasation weight
  • output_type – the activation type of the output layer, by default is ‘LINEAR’, linear regression.
  • p_dropout – the dropout rate, a float number between 0 and 1.
build_finetune_functions(train_shared_xy, valid_shared_xy)[source]

This function is to build finetune functions and to update gradients

Parameters:
  • train_shared_xy (tuple of shared variable) – theano shared variable for input and output training data
  • valid_shared_xy (tuple of shared variable) – theano shared variable for input and output development data
Returns:

finetune functions for training and development

parameter_prediction(test_set_x)[source]

This function is to predict

Parameters:test_set_x (python array variable) – input features for a testing sentence
Returns:predicted features

Layers

Recurrent Neural Network units

This is something I want to say that is not in the docstring.

class layers.gating.VanillaRNN(rng, x, n_in, n_h)[source]

This class implements a standard recurrent neural network: h_{t} = f(W^{hx}x_{t} + W^{hh}h_{t-1}+b_{h})

__init__(rng, x, n_in, n_h)[source]

This is to initialise a standard RNN hidden unit

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input data to current layer
  • n_in – dimension of input data
  • n_h – number of hidden units/blocks
recurrent_as_activation_function(Wix, h_tm1)[source]

Implement the recurrent unit as an activation function. This function is called by self.__init__().

Parameters:
  • Wix (matrix) – it equals to W^{hx}x_{t}, as it does not relate with recurrent, pre-calculate the value for fast computation
  • h_tm1 (matrix, each row means a hidden activation vector of a time step) – contains the hidden activation from previous time step
Returns:

h_t is the hidden activation of current time step

class layers.gating.LstmBase(rng, x, n_in, n_h)[source]

This class provides as a base for all long short-term memory (LSTM) related classes. Several variants of LSTM were investigated in (Wu & King, ICASSP 2016): Zhizheng Wu, Simon King, “Investigating gated recurrent neural networks for speech synthesis”, ICASSP 2016

__init__(rng, x, n_in, n_h)[source]

Initialise all the components in a LSTM block, including input gate, output gate, forget gate, peephole connections

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function()[source]

A genetic recurrent activation function for variants of LSTM architectures. The function is called by self.recurrent_fn().

recurrent_fn(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1=None)[source]

This implements a genetic recurrent function, called by self.__init__().

Parameters:
  • Wix – pre-computed matrix applying the weight matrix W on the input units, for input gate
  • Wfx – Similar to Wix, but for forget gate
  • Wcx – Similar to Wix, but for cell memory
  • Wox – Similar to Wox, but for output gate
  • h_tm1 – hidden activation from previous time step
  • c_tm1 – activation from cell memory from previous time step
Returns:

h_t is the hidden activation of current time step, and c_t is the activation for cell memory of current time step

class layers.gating.VanillaLstm(rng, x, n_in, n_h)[source]

This class implements the standard LSTM block, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a vanilla LSTM block

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the standard LSTM activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.LstmNFG(rng, x, n_in, n_h)[source]

This class implements a LSTM block without the forget gate, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a LSTM with the forget gate

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the LSTM (without the forget gate) activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.LstmNIG(rng, x, n_in, n_h)[source]

This class implements a LSTM block without the input gate, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a LSTM with the input gate

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the LSTM (without the input gate) activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.LstmNOG(rng, x, n_in, n_h)[source]

This class implements a LSTM block without the output gate, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a LSTM with the output gate

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.LstmNoPeepholes(rng, x, n_in, n_h)[source]

This class implements a LSTM block without the peephole connections, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a LSTM with the peephole connections

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.SimplifiedLstm(rng, x, n_in, n_h)[source]

This class implements a simplified LSTM block which only keeps the forget gate, inheriting the genetic class layers.gating.LstmBase.

__init__(rng, x, n_in, n_h)[source]

Initialise a LSTM with the peephole connections

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]

This function treats the LSTM block as an activation function, and implements the LSTM (simplified LSTM) activation function. The meaning of each input and output parameters can be found in layers.gating.LstmBase.recurrent_fn()

class layers.gating.GatedRecurrentUnit(rng, x, n_in, n_h)[source]

This class implements a gated recurrent unit (GRU), as proposed in Cho et al 2014 (http://arxiv.org/pdf/1406.1078.pdf).

__init__(rng, x, n_in, n_h)[source]

Initialise a gated recurrent unit

Parameters:
  • rng – random state, fixed value for randome state for reproducible objective results
  • x – input to a network
  • n_in (integer) – number of input features
  • n_h (integer) – number of hidden units

I/O functions

Binary I/O collections

class io_funcs.binary_io.BinaryIOCollection[source]

Utils

Data Provider

class utils.providers.ListDataProvider(x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False)[source]

This class provides an interface to load data into CPU/GPU memory utterance by utterance or block by block.

In speech synthesis, usually we are not able to load all the training data/evaluation data into RAMs, we will do the following three steps:

  • Step 1: a data provide will load part of the data into a buffer
  • Step 2: training a DNN by using the data from the buffer
  • Step 3: Iterate step 1 and 2 until all the data are used for DNN training. Until now, one epoch of DNN training is finished.

The utterance-by-utterance data loading will be useful when sequential training is used, while block-by-block loading will be used when the order of frames is not important.

This provide assumes binary format with float32 precision without any header (e.g. HTK header).

__init__(x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False)[source]

Initialise a data provider

Parameters:
  • x_file_list (python list) – list of file names for the input files to DNN
  • y_file_list – list of files for the output files to DNN
  • n_ins – the dimensionality for input feature
  • n_outs – the dimensionality for output features
  • buffer_size – the size of the buffer, indicating the number of frames in the buffer. The value depends on the memory size of RAM/GPU.
  • shuffle – True/False. To indicate whether the file list will be shuffled. When loading data block by block, the data in the buffer will be shuffle no matter this value is True or False.
load_next_partition()[source]

Load one block data. The number of frames will be the buffer size set during intialisation.

load_next_utterance()[source]

Load the data for one utterance. This function will be called when utterance-by-utterance loading is required (e.g., sequential training).

make_shared(data_set, data_name)[source]

To make data shared for theano implementation. If you want to know why we make it shared, please refer the theano documentation: http://deeplearning.net/software/theano/library/compile/shared.html

Parameters:
  • data_set – normal data in CPU memory
  • data_name – indicate the name of the data (e.g., ‘x’, ‘y’, etc)
Returns:

shared dataset – data_set

reset()[source]

When all the files in the file list have been used for DNN training, reset the data provider to start a new epoch.

Front-end

Label normalisation

class frontend.label_normalisation.HTSLabelNormalisation(question_file_name=None, subphone_feats='full', continuous_flag=True)[source]

This class is to convert HTS format labels into continous or binary values, and store as binary format with float32 precision.

The class supports two kinds of questions: QS and CQS.

QS: is the same as that used in HTS

CQS: is the new defined question in the system. Here is an example of the question: CQS C-Syl-Tone {_(d+)+}. regular expression is used for continous values.

Time alignments are expected in the HTS labels. Here is an example of the HTS labels:

3050000 3100000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[2]

3100000 3150000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[3]

3150000 3250000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[4]

3250000 3350000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[5]

3350000 3900000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[6]

305000 310000 are the starting and ending time. [2], [3], [4], [5], [6] mean the HMM state index.

wildcards2regex(question, convert_number_pattern=False)[source]

Convert HTK-style question into regular expression for searching labels. If convert_number_pattern, keep the following sequences unescaped for extracting continuous values):

(d+) – handles digit without decimal point ([d.]+) – handles digits with and without decimal point