Current advancements in technology have resulted in our dependency on machines while accomplishing daily tasks. However, breaches of highly sensitive data have raised awareness that the information that can be accessed online is not safe. Some of these breaches are by human interaction and hence have resulted in the scientific community searching for alternative methods to minimize the risk of unauthorized access to personal data. One of the fields that have gained considerable attention is biometrics. Biometrics involves the use of physical characteristics of the human body that are unique to that person, such as fingerprints, iris, and voice, in order to verify that person’s identity.
1.1 Problem statement
To the listener, the speech signal carries many levels of information. While the speech transfers a message using words, it also contains information about gender, emotion, language, and generally the identity of the speaker Reyn02. This section discusses the motivation behind the use of speaker recognition and the general categories and tasks associated with it, followed by the discussion of the aim of this work and the proposed solution.
With the growing number of services accessed via telephone, web or mobile apps, maintaining and remembering multiple passwords, PIN’s, and authentication details required to gain access to accounts remotely has become more challenging. Especially since security experts encourage the use of different authentications for different accounts.
Meanwhile, with the currently existing infrastructure speaker identity is a biometric that can be easily tested for remote access applications Beig11. This makes speaker recognition valuable for many real-world applications.
Speaker recognition involves the identification of the speaker based on the words they speak and it can be divided into two categories, text-dependent, and text-independent. Text-dependent requires the speaker to say the same word that was used for feature extraction, whereas text-independent can identify the speaker regardless of the words mentioned ChLu09. Text-dependant speaker recognition has prior knowledge of the text to be spoken Reyn02. Text-independent speaker recognition relies on the physiological characteristics of the speaker and does not make any assumption about the context of the speech Beig11.
Speaker recognition can be divided into two general tasks, namely, speaker identification and speaker verification Reyn02. Speaker identification involves with the determination of who the speaker is from a group of known voices or speakers. Speaker verification involves with the determination of whether a person is who he/she claims to be.
1.1.3 Thesis Aim
The aim of this work is to present an experimental evaluation of feature extraction techniques that could be used towards text-independent speaker verification. The Feature extraction is the process of extracting speaker-specific properties from the raw signal and storing it into a feature vector KiLi10. The speech signal consists of many features, all of which might not be important for speaker verification. A good feature should include the following characteristics Rose02:
· Should discriminate between speakers while having small within-speaker variability.
· Be robust against noise.
· Occur frequently and naturally in speech.
· Be easy to extract from the speech signal.
· Should not be susceptible to mimicry.
· Should be stable over time and not affected by speakers health.
Meanwhile, the number of features should also be considered since the number of required training samples for reliable density estimation grows exponentially with the number of features JaDM00. Moreover, the computational savings are also achieved with lower dimension features.
1.1.4 Proposed Solution
One of the structures of any speaker verification system is the front-end processing. Front-end processing generally consists of some form of voice activity detection (VAD) to remove non-speech sections of the signal, followed by the extraction of features that contain the speaker’s identity from the speech signal Reyn02. The features vectors extract are then used to build a model of the speaker or test against the model and decide if the person is he/she claims to be.
But, before proceeding to the front-end processing, a speech signal is required. A dataset consisting of 44 words is recorded using 12 male and 12 female volunteers raised in the province of Manitoba. The choice of these 44 words is due to having enough speech data to build a model and at the same time, quick to record making it practical to use for a real-world application. Moreover, the volunteers being from a specific geographical location can limit the variety of accents and forms of speaking and thus, the analysis will be based on the physiological factors of the speaker.
The traditional approach towards solving of speaker recognition problem involved the use of linear methods. However, the process of speech production is nonlinear NeMM06. Speech has nonlinear characteristics and its multifractal nature has been proven LaSK97. A VAD based on the fractal dimension (FD) is used to separate the non-speech segments of the signal. The choice of FD is due to the estimation of the FD based on signal complexity and not relying on the amplitude.
Fusion is the combination of information from multiple sources KiLi10, which is used to combine nonlinear method to the traditional methods and form the feature vectors. The features used to form the feature vector are the linear prediction cepstral coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), Higuchi fractal dimension (HFD), variance fractal dimension (VFD), zero crossing rate (ZCR), and Turns count (TC). The theory and programming of these algorithms are fully discussed in chapter 3 and the motivation behind using them is discussed in section 5.4.
Upon extraction of the feature vectors, the support vector machine (SVM) is used to build a model of the speaker and test it against unseen data. The choice of SVM is due to the availability of different kernel functions suitable for different type of features and the availability of highly optimized libraries that could be used.
1.2 Thesis Formulation
This thesis comprises of three portions which include, recording of a dataset, front-end processing, and classification. The next section discusses the thesis statement followed by the thesis objective and the research questions.
1.2.1 Thesis Statement
The core of this thesis is to assess the suitability of embedding fractal methods, due to the nature of speech, to the front-end process of a speaker verification system and to investigate the effectiveness of these methods. But before proceeding to the front-end processes of any speaker verification system, a speech signal is required. Therefore, volunteers are recorded and the acquired signals are stored in a repository. Moreover, a detailed description of the recording procedures is provided to serve as a guide and ensure the repeatability of the recordings.
1.2.2 Thesis Objective
There are three main objectives in this thesis:
1. Recording of participating volunteers and establishment of a dataset that could be used for the study of text-independent speaker verification by:
a) Designing a set of protocols to ensure the quality and the repeatability of the recordings;
b) Storage of the dataset in a repository accessible by researchers to allow further studying in the field of speaker recognition.
2. Study the suitability of using FD in the front-end processing of a speaker verification system by:
a) Comparison of the HFD and the VFD using test data;
b) Studying the effects of addition of colored noise on estimation of FD using the HFD and the VFD algorithms;
c) Implementing a VAD based on FD of the speech signal;
3. Assess the effectiveness of the VAD algorithm and each algorithm in the fusion of features by:
a) Dividing the data into training and testing data
b) Extracting the features from the speech part of the signal and forming multiple feature vectors based on fusion of different features to assess the effects of each algorithm;
c) Building a training model from the training data for each combination of feature vectors and using the testing data to measure the accuracy of the classification;
d) Extraction of the same features vectors using different VAD and comparing the accuracy results.
1.2.3 Research Question
The goal of this thesis is the robust feature extraction from speech, which could be used in a text-independent speaker verification system. However to achieve this goal a number of research questions arises which are addressed below:
1. What set of test words to use which would be practical for a real-world application and at the same time contains enough data for speaker verification?
2. How to record these test words to ensure quality, repeatability, and similarity of all the recordings?
3. How and where to store the recorded dataset to allow further research using the dataset?
4. Fractal dimension estimation algorithms are numerous and a question that arises is which one is more suitable for text-independent speaker verification?
5. What are the effects of noise on FD estimation?
6. Can FD be used for VAD?
7. Will using FD for VAD improve the performance of the speaker verification system in comparison to other algorithms?
8. Does fusion of multiple algorithms increase the accuracy of speaker verification?
9. How can the accuracy be compared with the literature if the dataset is different?
10. What is the effect of addition of each feature to the feature vector, on the accuracy of speaker verification?
11. Which fusion of features is more appropriate for speaker verification?
12. How to divide the training and the testing data to avoid overtraining the SVM?
13. Which kernel and cost function should be chosen for the extracted feature vectors?
1.3 Thesis Organization
This thesis presents a study of text-independent speaker verification with the aim of embedding fractal algorithms to the front-end processing of the speaker verification system. This thesis consists of 6 chapters. Chapter 2 presents a fundamental background for this study on speaker verification. This chapter discusses the physiology of the speech production and perception, the phonetics of speech, and some of the fundamental methods needed for speech processing. Chapter 3 presents the algorithms used in this thesis for this study on speaker verification. This chapter discussed the algorithms used to generate colored noise, test data (the Weierstrass function and fractional Brownian motion), feature extraction (LPCC, MFCC, HFD, VFD, ZCR, and TC), and the classifier (SVM). Chapter 4 presents the procedures for the recording of the dataset used for this study on speaker verification. This chapter discusses the hardware and software used to ensure the quality and similarity of all the recordings. Moreover, a list of English phonemes chosen for this study and the demographics of the speakers are presented, followed by the environment and a set of protocols that would be followed to ensure the repeatability of all the recordings. Chapter 5 presents the design of experiments and the analysis of the results obtained for this study on speaker verification. This chapter discusses the results of the HFD and VFD on the test data, tests the effects of colored noise on FD estimation, introduces a VAD algorithm based on FD, introduces the feature vectors used for experimental sensitivity analysis, and classifies the feature vectors to measure the accuracy. Chapter 6 presents the conclusion of this study on speaker verification. This chapter discusses a summary of the results and findings, answers the research questions and reasoning behind it, and provides suggestions and recommendation for future work.