Indices and tables¶
Get Started¶
Required softwares/tools¶
- The system has been tested in linux environment, and the following packages are required.
- Python 2.6/2.7
- Theano 0.6/0.7 (http://deeplearning.net/software/theano/). 0.8 version is not tested yet.
- Bandmat 0.5: https://pypi.python.org/pypi/bandmat/0.5
- SPTK: http://sp-tk.sourceforge.net/
Data Preparation for a neural network (NN) based speech synthesis system¶
To build a NN system, you need to prepare linguistic features as system input and acoustic features as system output. Please follow the instructions in this section to prepare your data.
Input Linguistic Features¶
Neural networks take vectors as input, so the alphabet representation of linguistic features needs to be vectorized.
- HTS style: Please check the HTS demo for the HTS style labels (http://hts.sp.nitech.ac.jp/).
- Provide HTS full-context labels with state-level alignments.
- Provide a question file that matches the HTS labels.
- The questions in the question file will be used to convert the full-context labels into binary and/or numerical features for vectorization. It is suggested to do a manual selection of the questions, as the number of questions will affect the dimensionality of the vectorized input features.
- Different from the HTS format question, the NN system also supports to extract numerical values using ‘CQS‘, e.g., ** CQS “Pos_C-Word_in_C-Phrase(Fw)” {:(d+)+}**, where ‘:‘ and ‘+‘ are separators, and ‘(d+)‘ is a regular expression to match a numerical feature.
“Composed” style:
- Direct *vectorized* input: If you prefer to do vectorization yourself, you can feed the system binary files directly. Please prepare your binary files with the following instructions:
- Align the input feature vectors with the acoustic features. Input and output features should have the same number of frames.
- Store the data in binary format with ‘float32‘ precision.
- In the config file, use an empty question file, and set appended_input_dim to be the dimensionality of the input vector.
- Note: voice conversion can use this kind of direct vectorized input.
Output Acoustic Features¶
- The default setting is assuming you use the STRAIGHT vocoder (c version). This vocoder is free for academic users. The output includes
- mel-cepstral coeffcients (MCC),
- band aperiodicities (BAP),
- Fundamental frequency (F0) in logarithmic scale.
- Please provide the three features in binary format with ‘float32’ precision, in the config file, provide the dimensionality of each feature, for example
- [Outputs] mgc : 60
- [Outputs] dmgc : 180
dmgc means the dimensionality of MCC with delta and delta delta features. If dmgc is set to 60, only the static features are used. Please also tell the file extension for each feature, for example
- [Extensions] mgc_ext : .mgc
- [Extensions] bap_ext : .bap
- [Extensions] lf0_ext : .lf0
The open-source WORLD vocoder is also supported. The modified version for SPSS can be found in the repository.
If you have your preferred vocoder, please try to give a nick name to each feature to match the supported ones.
Recipes¶
In the system, several recipes for standard neural network architectures are provided. They are described below:
Architecture¶
- The system supports a flexible way to change neural network architectures by changing the config file in the [Architecture] section:
- hidden_layer_size : [512, 512, 512, 512]
- hidden_layer_type : [‘TANH’, ‘TANH’, ‘TANH’, ‘TANH’]
- By default, feedforward neural network is used. But the system supports various types of hidden layers:
- ‘TANH’ : The Hyperbolic Tangent activation function
- ‘RNN’ : The simple but standard recurrent neural network unit
- ‘LSTM’ : The standard Long Short-Term Memory unit
- ‘GRU’ : The gated recurrent unit
- ‘SLSTM’: The simplified LSTM unit
- ‘BLSTM’: The bidirectional LSTM unit
You can define your own architecture by choosing a hidden unit at each hidden layer. For each type of hidden layer, please check the Models section.
Deep Feedforward Neural Network¶
An example config file can be found in the ‘./recipes/dnn’ directory. Please use ‘submit.sh ./run_lstm.py ./recipes/dnn/feed_foward_dnn.conf’ to build the feedforward neural network. Please modify the config file to adapt to your own working environment (e.g., data path).
Mixture Density Neural Network¶
(Deep) Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN)¶
An example config file is provided in ‘./recipes/dnn/hybrid_lstm.conf’. Follow the same recipe as that in the deep feedforward neural network section.
(Deep/Hybrid) Bidirectional LSTM-based RNN¶
Example config files are provided in ‘./recipes/blstm’ directory. ‘blstm.conf’ is for multiple bidrectional LSTM layers, while ‘hybrid_blstm.conf’ is for a hybrid architecture, that uses several feedforward layers at the bottom, and one BLSTM layer at the top.
Models¶
Deep Feedforward/Recurrent Networks¶
This is something I want to say that is not in the docstring.
-
class
models.deep_rnn.DeepRecurrentNetwork(n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR')[source]¶ This class is to assemble various neural network architectures. From basic feedforward neural network to bidirectional gated recurrent neural networks and hybrid architecture. Hybrid means a combination of feedforward and recurrent architecture.
-
__init__(n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR')[source]¶ This function initialises a neural network
Parameters: - n_in – Dimensionality of input features
- hidden_layer_size (A list of integers) – The layer size for each hidden layer
- n_out (Integrer) – Dimensionality of output features
- hidden_layer_type – the activation types of each hidden layers, e.g., TANH, LSTM, GRU, BLSTM
- L1_reg – the L1 regulasation weight
- L2_reg – the L2 regulasation weight
- output_type – the activation type of the output layer, by default is ‘LINEAR’, linear regression.
- p_dropout – the dropout rate, a float number between 0 and 1.
-
build_finetune_functions(train_shared_xy, valid_shared_xy)[source]¶ This function is to build finetune functions and to update gradients
Parameters: - train_shared_xy (tuple of shared variable) – theano shared variable for input and output training data
- valid_shared_xy (tuple of shared variable) – theano shared variable for input and output development data
Returns: finetune functions for training and development
-
Layers¶
Recurrent Neural Network units¶
This is something I want to say that is not in the docstring.
-
class
layers.gating.VanillaRNN(rng, x, n_in, n_h)[source]¶ This class implements a standard recurrent neural network: h_{t} = f(W^{hx}x_{t} + W^{hh}h_{t-1}+b_{h})
-
__init__(rng, x, n_in, n_h)[source]¶ This is to initialise a standard RNN hidden unit
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input data to current layer
- n_in – dimension of input data
- n_h – number of hidden units/blocks
-
recurrent_as_activation_function(Wix, h_tm1)[source]¶ Implement the recurrent unit as an activation function. This function is called by self.__init__().
Parameters: - Wix (matrix) – it equals to W^{hx}x_{t}, as it does not relate with recurrent, pre-calculate the value for fast computation
- h_tm1 (matrix, each row means a hidden activation vector of a time step) – contains the hidden activation from previous time step
Returns: h_t is the hidden activation of current time step
-
-
class
layers.gating.LstmBase(rng, x, n_in, n_h)[source]¶ This class provides as a base for all long short-term memory (LSTM) related classes. Several variants of LSTM were investigated in (Wu & King, ICASSP 2016): Zhizheng Wu, Simon King, “Investigating gated recurrent neural networks for speech synthesis”, ICASSP 2016
-
__init__(rng, x, n_in, n_h)[source]¶ Initialise all the components in a LSTM block, including input gate, output gate, forget gate, peephole connections
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function()[source]¶ A genetic recurrent activation function for variants of LSTM architectures. The function is called by self.recurrent_fn().
-
recurrent_fn(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1=None)[source]¶ This implements a genetic recurrent function, called by self.__init__().
Parameters: - Wix – pre-computed matrix applying the weight matrix W on the input units, for input gate
- Wfx – Similar to Wix, but for forget gate
- Wcx – Similar to Wix, but for cell memory
- Wox – Similar to Wox, but for output gate
- h_tm1 – hidden activation from previous time step
- c_tm1 – activation from cell memory from previous time step
Returns: h_t is the hidden activation of current time step, and c_t is the activation for cell memory of current time step
-
-
class
layers.gating.VanillaLstm(rng, x, n_in, n_h)[source]¶ This class implements the standard LSTM block, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a vanilla LSTM block
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the standard LSTM activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.LstmNFG(rng, x, n_in, n_h)[source]¶ This class implements a LSTM block without the forget gate, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a LSTM with the forget gate
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the LSTM (without the forget gate) activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.LstmNIG(rng, x, n_in, n_h)[source]¶ This class implements a LSTM block without the input gate, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a LSTM with the input gate
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the LSTM (without the input gate) activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.LstmNOG(rng, x, n_in, n_h)[source]¶ This class implements a LSTM block without the output gate, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a LSTM with the output gate
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.LstmNoPeepholes(rng, x, n_in, n_h)[source]¶ This class implements a LSTM block without the peephole connections, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a LSTM with the peephole connections
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.SimplifiedLstm(rng, x, n_in, n_h)[source]¶ This class implements a simplified LSTM block which only keeps the forget gate, inheriting the genetic class
layers.gating.LstmBase.-
__init__(rng, x, n_in, n_h)[source]¶ Initialise a LSTM with the peephole connections
Parameters: - rng – random state, fixed value for randome state for reproducible objective results
- x – input to a network
- n_in (integer) – number of input features
- n_h (integer) – number of hidden units
-
lstm_as_activation_function(Wix, Wfx, Wcx, Wox, h_tm1, c_tm1)[source]¶ This function treats the LSTM block as an activation function, and implements the LSTM (simplified LSTM) activation function. The meaning of each input and output parameters can be found in
layers.gating.LstmBase.recurrent_fn()
-
-
class
layers.gating.GatedRecurrentUnit(rng, x, n_in, n_h)[source]¶ This class implements a gated recurrent unit (GRU), as proposed in Cho et al 2014 (http://arxiv.org/pdf/1406.1078.pdf).
Utils¶
Data Provider¶
-
class
utils.providers.ListDataProvider(x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False)[source]¶ This class provides an interface to load data into CPU/GPU memory utterance by utterance or block by block.
In speech synthesis, usually we are not able to load all the training data/evaluation data into RAMs, we will do the following three steps:
- Step 1: a data provide will load part of the data into a buffer
- Step 2: training a DNN by using the data from the buffer
- Step 3: Iterate step 1 and 2 until all the data are used for DNN training. Until now, one epoch of DNN training is finished.
The utterance-by-utterance data loading will be useful when sequential training is used, while block-by-block loading will be used when the order of frames is not important.
This provide assumes binary format with float32 precision without any header (e.g. HTK header).
-
__init__(x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False)[source]¶ Initialise a data provider
Parameters: - x_file_list (python list) – list of file names for the input files to DNN
- y_file_list – list of files for the output files to DNN
- n_ins – the dimensionality for input feature
- n_outs – the dimensionality for output features
- buffer_size – the size of the buffer, indicating the number of frames in the buffer. The value depends on the memory size of RAM/GPU.
- shuffle – True/False. To indicate whether the file list will be shuffled. When loading data block by block, the data in the buffer will be shuffle no matter this value is True or False.
-
load_next_partition()[source]¶ Load one block data. The number of frames will be the buffer size set during intialisation.
-
load_next_utterance()[source]¶ Load the data for one utterance. This function will be called when utterance-by-utterance loading is required (e.g., sequential training).
To make data shared for theano implementation. If you want to know why we make it shared, please refer the theano documentation: http://deeplearning.net/software/theano/library/compile/shared.html
Parameters: - data_set – normal data in CPU memory
- data_name – indicate the name of the data (e.g., ‘x’, ‘y’, etc)
Returns: shared dataset – data_set
Front-end¶
Label normalisation¶
-
class
frontend.label_normalisation.HTSLabelNormalisation(question_file_name=None, subphone_feats='full', continuous_flag=True)[source]¶ This class is to convert HTS format labels into continous or binary values, and store as binary format with float32 precision.
- The class supports two kinds of questions: QS and CQS.
QS: is the same as that used in HTS
CQS: is the new defined question in the system. Here is an example of the question: CQS C-Syl-Tone {_(d+)+}. regular expression is used for continous values.
Time alignments are expected in the HTS labels. Here is an example of the HTS labels:
3050000 3100000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[2]
3100000 3150000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[3]
3150000 3250000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[4]
3250000 3350000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[5]
3350000 3900000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[6]
305000 310000 are the starting and ending time. [2], [3], [4], [5], [6] mean the HMM state index.
-
wildcards2regex(question, convert_number_pattern=False)[source]¶ Convert HTK-style question into regular expression for searching labels. If convert_number_pattern, keep the following sequences unescaped for extracting continuous values):
(d+) – handles digit without decimal point ([d.]+) – handles digits with and without decimal point