11 CHAPTER 2 AUDIO FEATURE EXTRACTION IN SPEECH PROFILING FOR QURANIC SEMANTIC AUDIO 2.1. Introduction Speech recognition has been widely studied for various type of languages, not only for English, but also for other languages including the Arabic recitations of the Holy Al-Quran. It is known that the recitation of Al-Quran is quite different from normal Arabic readings in terms of pronunciations and meanings. However without knowledge and proper learning of the Arabic language, misunderstanding or misinterpretation may occur when one reads and studies the Qur’an especially when the understanding is to derive from the context of the Qur’anic verses (Ramli et al., 2018). Recent study show that any attempt to comprehend the Quran using only linguistic meaning may lead to misreading and misinterpreting due to the high level of Arabic language of the Qur’an which consists of language properties, sound symbolism, linguistic forms, rhyme, word play, irony, and metaphors that exceed the standard Arabic language (Hedayat, 2013). This chapter starts with an overview, past and present work of speech profiling. Then it provides a brief introduction of Quranic maqamat and sound element that relates to this work and later section described the basic concept of audio feature extraction and its application for complex spectrum. Then followed by techniques used in audio feature extraction in profiling the Quranic audio signal and ends with overview of semantic audio analysis for Quranic search. 12 2.2. Quranic Maqamat Recitation (QMR) The Qur’an contains much scientific knowledge that ranges from basic arithmetic to the most complicated terms. On the other hand, QMR represents the emotions narrated by the reciters during reciting the Holy Quran. This section will elaborate the background, types and acoustic elements contained in QMR. 2.2.1. Overview of Quranic Maqamat Reading or reciting the Qur’an is fundamental to all Muslims because the holy content contains comprehensive guidance to help daily routine of Muslims in all aspects of life. Therefore, the Muslims is urge to recite it with certain melody to beautify the voice in order to feel certain emotions during reciting them. In the study of maqamat, learning the Quranic recitation is beautifying the voice in reciting the Quran. It is originated from the technique of Arabic maqam which imposed from improvising the patterns, pitches and development of a musical art of Arabian culture. In recent study by Albakri (2020) shows the maqams as a center in Arabic Islamic musical culture. Some historical and academic references have been quoted, which are to prove the place of the maqams in the music of Islam and more specifically in the performance of the Holy Quran which meaningful accents have been pointed out over the tajweed and tartil in the performing art of the Holy Quran. (Albakri, 2020). Like these varied melodies, the verses of the Quran vary widely in term of topics and event to generate different feelings to the listener. Maqamat, plural from maqam, a set of pitches having characteristic of melodious elements, with a traditional pattern of their use. 13 2.2.2. Types of Maqamat As previously mentioned, approximately 50 maqamat exist, only among the most widely used are the Maqamat of Bayyati, Hijaz Rast, Jiharkah, Nahawand, Soba, and Sikah. The maqam names originated from many Arabic and Turkish names, which mostly comes from throughout the Middle East in the formative period of Persian culture musical system (Nettl, 2007). Next section provides details description of each maqamat and its features and Table 2.1 summarizes the description based on features and suitable theme. 1. Maqam Bayyati Maqam Bayyati is the prime maqam in the recitation of maqamat Quranic. The function of this maqam served as an opening and closing of the rhythm in the Quranic recitation. In Malaysia currently, the qaris as well as the qariahs has popularised this tarranum during the recitation. They have placed the Maqamat Bayyati as the top maqam which acted as a stimulation during the recitation performance (M.Zaini et al. 2018). According to Drs. Muhsin Salim, the origin of the word Bayyati came from the Arabic word ( بیت ) which means a “house”, subsequently used in a form of mubalaghah ( بیات ) and the letter ya (ي) was added, which then form the word Bayyati (Lembaga Bahasa Dan Ilmu Al-Quran,1987). In theory, the word maqam literally means as a “house”, which is actually linked to several beliefs, referring Bayyati as a basis for any art form of Quranic rhythm. This is due to the fact that it has been used at the beginning and the end of each Quranic recitation. In addition, there are other theories which claimed that the word Bayyati itself sufficed from one of the place or area in Iraq. The letter ya’ilah was added to localize within 14 the area. This belief came from Mohd Ali b. Abu Bakar, which sourced from Dr. Sayid Agil Hussain al- Munawwar’s theory, an expert in Quranic rhythm from Indonesian (A.Bakar,1997). However, both theories related to the origins of Bayyati cannot be officially recognized as accurate and valid. This is due to that the word is used as part of daily conversation of the locals in Malay Archipelago, not among Arabic community. (M.Zaini, 2009). Bayyati has a wider scope in dividing and providing meaningful content of the Quranic recitation. This is because that Bayyati has twelve melodies style of reciting from four level of intonations known as Qarar, Nawa, Jawab and Jawab al-Jawab. Besides that, Bayyati can be combined with other melodies for tone similarity, in order to form a variation in Bayyati which is: Syuri, Husainiy, ‘Ajam and Kurdi (M.Munir, 2005). In general, Bayyati has a unique, soft, scarcely audible and low tone with a sharp pitch. Sometimes, these three tones; high or low and normal is used. By combining these two tones, this maqam is very flexible, easy acceptable and covering a much wider scope (Nik Jaafar, 1998). 2. Maqam Hijaz The word Hijaz is referring to the one of the states in Arabic Territory. However, no other specific and detailed information regarding the responsible person who has provided the sources as well as the link of the name based on that particular place. However, there is a reliable source which stated the word maqam originated and evolved in Hijaz (Nik, 1998). Initially, based on the community who lived in the scarce land and desert, stated that Hijaz was introduced by their qari. Then, this melody was 15 brought to Egypt and localized in line with the greenery of Nil Valley (A.Bakar,1997). The features of Hijaz are light and fast melody while the rhythm is quite strict, and the recitation is using a loud and clear voice. Apparently, it was changed to a softer and harmonious maqam called Hijaz Misri. As a result, the listener will feel more mesmerize with the beautiful melody of the tone. (M.Zaini, 2009). In addition, Hijaz can be combined with various type of maqamat which has similar in tone. This will lead to a variation which become part of Hijaz which is Kard, Kurd, Kard Kurd and Nakriz (M.Munir, 2005). There are several features of Maqamat Hijaz, such as reciting in a slow mode but very effective, strict and a hard rhythm. It can be adopted to any voice of tabaqah and used in command, assertiveness, and reminder verse effectively. (M.Zaini et al. 2018). 3. Maqam Jiharkah Similar to Rast, Jiharkah also originated from Persian and has been modified by an expert in the field of Quranic art song from Hijaz and Egypt and blending in with their culture. This Maqam however was adapted by the qaris and qariahs from all over the world. There are other theories however, which claimed that it has originated from African Continent. (Nik, 1998). Jiharkah has a minor rhyme and rhythm. This Maqam depends on either the qari recites in slow or fast pace. Jiharkah only has one variation, namely Kurdi and it can be recited in two intonation which is tabaqat Nawa or Jawab and Jawab al-Jawab in four rhythm tones. (M.Munir, 2005). This Maqam has its own uniqueness, easy and fast, effective and suitable for a moderate tabaqah. Meanwhile, the roles 16 and functions for this maqam is to eliminate the tensions in any reciting, fluency in recitation, notably for sad and emotion sentences, provide an accuracy in pronouncing all letters, wordings or sentences as well as better in concentration and self-cautions. (A.Bakar,1997). 4. Maqam Nahawand Maqam Nahawand is taken from the word Nahawand ( نھاوند ) which refers to a place in Hamadan, a district in Iran (Persian). Nahawand was adopted and modified by the Egypt qari’s to make it more localize (Nik Jaafar, 1998). Nahawand has five types of rhythm harakat and two tone tabaqah, namely Jawab and Jawab al-Jawab. Maqamat Nahawand can be combined with four different types of songs which is Asli, Nakriz, ‘usyaq and Murakkab (M.Munir, 2005). The Nahawand rhythm is fast, light soft and harmonious tone, making it more appealing, mesmerizing and attractive rhyme. A high tone is a requirement should the qaris and qariahs intent to adopt this maqam in their recitations, making a moderate tabaqah change from lowest to the highest tone, making more vibration simultaneously as well as a good voice-control (M.Zaini, 2009). The features in Nahawand is considered as easy, soft and touching, suitable for a moderate tabaqah. There are several emotion’s affects as a result from Nahawand which is more calming, focusing and awareness, make it more suitable to a happy and sad verse. (A.Bakar,1997). 17 5. Maqam Rast There are several views in regard to the origins of Rast. The first view came from Dr. Sayid Agil Husain Munawwar who stated that the Rast was originated from a City called Rast ( راس ). Similar to the case of maqam Hijaz which eventually widely adopted, the letter Ta (ت) is added and become ( راست ) (A.Bakar,1997). For the second view which came from Haji Nik Ja’far b. Nik Ismail (1998), he stated that the Rast is sourced from the Persian’s local or language itself whereby it was modified by the expert in Hijaz and Egypt art song based on their lahjah and culture. Nowadays, Rast is well-known among the qaris and qariahs all over the world. Unlike the above views, Dr. Muhsin Salim, an expert in a field of maqamat from Quranic Institution of Jakarta, in his opinion has stated that the letter Rast ( راست ) came from the Arabic word )ذا رست( or ( ھذا رست( which leads to the word Rasydah ( رشدة ) or Rast (Lembaga Bahasa Dan Ilmu Al-Quran,1987). Rast is divided into seven types of harakat and two tabaqat intonation: Jawab and Jawab al-Jawab. Furthermore, Rast can be combined with three types of variation of maqamat which is Usyaq, Zanjiran and Syabir ‘Ala Rast (M.Munir, 2005). According to Mohd Zaini (2009), the features of Rast should as consist of some characteristics such as an easy movement, fast and enthusiastic, can be applied with any tabaqah and suitable to be used in any types of maqamat. Basically, the functions of Rast are to provide the essence to the overall of maqamat and stimulate other maqamat which can be used in the future. To listeners, Rast can be comforting and give stimulation to our soul. It also helps the reciters in term of accuracy and fluency in a pronunciation as well as in pronouncing the letters. 18 6. Maqam Sikah Maqam Sikah is a rare member of the Sikah family. Its scale starts with the root Jins Sikah on the tonic, followed by Jins Upper Rast on the 3rd degree (with its tonic on the 6th degree) then Jins Rast on the 6th degree (which is a secondary ghammaz). Similar to Jiharkah, maqam Sikah means “guitar strumming”, which is also originated from Persian. This Maqam has been modified by the experts in the field of Quranic art song from Hijaz and Egypt according to their own culture. Maqam has made popularized by the expert’s reciters around the world beginning from 7th century until 19th AD (Nik, 1998). A full concentration is required for any qari or qariah who wish to adopt this maqamat since it is difficult to perform in a perfect condition. Sikah can produce up to six types of harakat and two types of tabaqah; Jawab and Jawab al-Jawab. Maqam Sikah also can be combined with four different type of song variations such as Asli, Turkiy,Raml and Iraqiy (M.Munir, 2005). Among the features of this maqam is soft, graceful and harmonious, suitable for a higher tabaqah. Meanwhile the functions of Sikah is making the readings softer, more satisfaction to the reciters and listener, suitable for a desire, solace and reliance verse (A.Bakar,1997). 7. Maqam Soba According to the expert in Quranic art song, maqam Soba is originated from one of the Syrian areas. This theory is supported by Dr. Sayid Agil Husain al- Munawwar. Whereby, Nik Ja’far b. Nik Ismail has claimed the it is possible that it has sourced from the Egypt melody, in which Soba is part of the Bayyati known as Soba Mesir (A.Bakar,1997). Soba is divided into five types of harakat’s intonation and can be combined with three 19 types of melody’s variations, among them are ‘Ajami, Mahur or Muhur and Bastanjar (M.Munir, 2005). This Maqam portrayed towards the sad feeling which described the desire for resolution and assistance. In comparison with Bayyati and Hijaz which has a high and low tones, Soba has a monotonous, high and fast rhythm. The uniqueness in Soba is in its harmonious, rhythmic and slow pitch. (Nik, 1998). Some of the features in Soba are soft and fast recitation, a lighter and rhythmic tones, suitable for a moderate tabaqah. Meanwhile, the role of Soba is to gain peace and release stress, thus a person can be more focused in life and realizing their mistakes, making it more appropriate for Quranic recitation which describe a self-happiness, sadness and tranquility. Therefore, the recitation will become more smoothly and fluently (A.Bakar,1997). Table 2.1 Summary of maqamat types with its features and themes. Maqam Features Suitable Theme Bayyati unique, soft, scarcely audible, and low tone with a sharp pitch. Greetings, opening and closing remarks. Hijaz slow mode but very effective, strict and a hard rhythm Command, assertiveness, and reminder. Jiharkah Soft, graceful, and harmonious Awareness, happy and sad. Nahawand Easy, calm, soft and moderate. Awareness, happy and sad. Rast Easy movement, fast and enthusiastic All types. Sikah Soft, graceful, and harmonious Desire, solace, and reliance Soba Harmonious, rhythmic, and slow pitch. Self-happiness, sadness and tranquility 20 2.2.3. Characteristics and Sound element in QMR The Quranic maqamat, or also named as ‘tarannum’, define a set of patterns of melody used by the reciters in their recitation adding in a proper articulation and rhythm tajweed. For instance, maqam Rast is said to evoke pride, power, soundness of mind, and masculinity, Bayyati relates with vitality and joy, Sikah more towards love, Saba is sadness and pain and Hijaz as reminder and assertiveness explained distant desert (M. Zaini, 2018). These emotions are said to be evoked in part through change in the size of an interval during a maqam presentation. Some references describe maqam moods using very vague and subjective terminology. However, there has not been any significant research using scientific methodology on a sample of reciters from different background or culture (whether Arab or non-Arab) to deduce the relationship between the emotions and the selected maqam. There are several terms used to describe different attributes of sound – pitch, tune, and rhythm. For example, men, with deeper voices, tend generally to have lower-pitched voices than women. The art of maqamat comes from varying the pitch. If the pitch did not vary at all, then the whole recitation would be completely flat and monotonous. For the pitch pattern as a whole is also known tune. In the context of Quranic recitation, this is what the maqamat are – tune patterns – an identifiable pattern of pitches progressing from one to another. With the maqamat, the patterns are loosely defined and not fixed, and there is a lot of improvisation in recitation, but nonetheless they are still recognizable enough to be categorized as one maqam or another. Rhythm refers to the time component of sound, which is the domain of tajweed in terms of Quranic recitation. This is the reason tajweed defines the rhythm of the Quran. Because tajweed determines the rhythm and is part of the Qur’an, one cannot apply an external rhythm to their recitation. All of these elements produce 21 certain profiles to the audio signal captured from the recitation of Quranic maqamat and contribute significant finding in acoustics analysis for speech profiling. 2.3. Audio Feature Extraction (AFE) Theoretically, it should be possible to recognize speech directly from the digitized waveform. However, because of the large variability of the speech signal, it is better to perform some feature extraction that would reduce that variability. The audio data provided cannot be utilized by the models directly so they need to convert them into an understandable format. Thus, a process called Audio Feature Extraction (AFE) is used. It is a process that explains most of the data but in an understandable way. Feature extraction is required for classification, prediction and recommendation algorithms. The reason for computing the short-term spectrum is that the cochlea of the human ear performs a quasi-frequency analysis. The analysis in the cochlea takes place on a nonlinear frequency scale (known as the Bark scale or the mel scale). This scale is approximately linear up to about 1000 Hz and is approximately logarithmic thereafter. So, in the feature extraction, it is very common to perform a frequency warping of the frequency axis after the spectral computation. Different taxonomies exist for the classification of audio features. Scaringella (2006) followed a standard taxonomy by dividing audio features used for genre classification into three groups based on timbre, rhythm, and pitch information, whereas Weihsetal (2007) has categorized the audio features into another four subcategories, namely short-term features, long-term features, semantic features, and compositional features. This section describes basic feature extraction techniques for audio as shown in Figure 2.1 that are in use today, or that may be useful in the future, especially in the speech recognition area. 22 Figure 2.1: Basic concept of AFE 2.3.1. AFE for Complex Spectrum The spectrum computation discussed in previous section is known as the real spectrum. As the real spectrum is computed from the log magnitude spectrum, the phase part is ignored. This will not enable the reconstruction of the sequence from the spectrum. However, the reconstruction can be done by preserving the Fourier phase and use it for reconstruction from the real spectrum. For the reconstruction of the sequence from the spectrum, complex spectrum is used. Instead of taking inverse Fourier transform of the log magnitude spectrum for the real spectrum, the Inverse Discrete Fourier Transform (IDFT) of the logarithm of complex spectrum is used for computing complex spectrum. As the logarithm of all the spectral values are used, the phase is preserved in the complex cepstral sequence which can be used for reconstructing back the sequence. The methods for computing pitch and formant parameters from the complex spectrum remain same as that of the real spectrum as these parameters are obtained from the magnitude of the complex cepstral coefficients. The mathematical relation for computing complex spectrum in Eq. (2.1) as follow: 𝑐!(𝑛) = "#$ ∫ log +𝑠(ω). 𝑒%&'𝑑𝜔 + 𝑗 "#$ ∫ 𝑠(ω) 𝑒%&'𝑑𝜔 $($$($ (2.1) 23 which also can be expressed as 𝑐!(𝑛) = 𝑐)(𝑛) + 𝑗𝑐%(𝑛) (2.2) where 𝑐)(𝑛) is the real spectrum, and j𝑐%(𝑛) is the imaginary part of the complex spectrum. 2.3.2. Cepstral Analysis The objective of cepstral analysis (CA) is to separate the speech into its source and system components without any a priori knowledge about source and/or system. The resulting of voiced and unvoiced speech can be considered as the convolution of respective excitation sequence and vocal tract filter characteristics. Based on the source filter theory of speech production, voiced sounds and unvoiced sounds are produced by exciting the time varying system characteristics with periodic impulse sequence and random noise sequence, respectively (Giacobello et al., 2012). The speech sequence S(ω), excitation sequence E(ω) and vocal tract filter H(ω) can be expressed and represented in frequency domain in Eq. (2.3) as follows: 𝑆(ω) = 𝐸(ω) ∗ 𝐻(ω) (2.3) From the Eq. (2.3) the magnitude spectrum of given speech sequence can be represented as, |𝑆(ω)| = |𝐸(ω)| ∗ |𝐻(ω)| (2.4) 24 To linearly combine the E(ω) and H(ω) in the frequency domain, logarithmic representation is used. Thus, the logarithmic representation of Eq. (2.4) will be, 𝑙𝑜𝑔|𝑆(𝜔)| = 𝑙𝑜𝑔|𝐸(𝜔)| + 𝑙𝑜𝑔|𝐻(𝜔)| (2.5) 𝑐(𝑛) = 𝐼𝐷𝐹𝑇(𝑙𝑜𝑔|𝑆(𝜔)|) = 𝐼𝐷𝐹𝑇(𝑙𝑜𝑔|𝐸(𝜔)| + 𝑙𝑜𝑔|𝐻(𝜔)|) (2.6) As indicated in Eq. (2.5), the magnitude speech spectrum is transformed by log operation. The log operation converted the multiplication of the speech spectrum ω into linearly summation of both excitation component and vocal tract component. IDFT is then used to separate the linearly combined log spectra of excitation and vocal tract system components. The IDFT of linear spectra transforms back to the time domain but the IDFT of log spectra transforms to inverse frequency domain or the cepstral domain which is similar to time domain. This is mathematically explained in Eq. (2.6). Figure 2.2 details the various steps involved in converting the given short- term speech signal to its cepstral domain representation. The output obtained at different stages of spectrum computation is the voiced frame considered and x(n) is the windowed frame. Here, s(n) multiplied by a hamming window to get x(n). |x(ω)| represent the spectrum of the windowed sequence x(n). As the spectrum of the given frame is symmetric, only one half of the spectral components is plotted. The log|x(ω)| represents the log magnitude spectrum obtained by taking logarithm of the |x(ω)|. c(n) represented the computed spectrum for the voiced frame s(n). The obtained spectrum contains vocal tract components which are linearly combined according to Eq. (2.6). As the spectrum is derived from the log magnitude of the linear spectrum, it is also symmetrical in the quefrency domain. 25 Figure 2.2 Block diagram representing computation of complex spectrum 2.3.3. Mel-Frequency Cepstral Coefficient (MFCC) MFCC is considered amongst the widely used method in speech recognition due to its design that has been empirically determined to work well for speaker recognition (Reynold, 1994). The aim is to mimics the frequency response of the human hearing, the coefficients rely on a mel-frequency spacing of filterbank energies. The Mel-scaled feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude and then warping the frequencies on a Mel-scale. For final part, instead of using IDFT, it followed by applying the inverse Discrete Cosine Transform (IDCT) for better and faster processing due to its uses only cosine functions while DFT uses both cosine and sine (in the form of complex exponentials) while both operate on a function at a finite number of discrete data points. Next section are the steps involved in MFCC feature extraction. 1. Pre-Emphasis Filter The main goal of pre-emphasis filter is to emphasizes the high frequencies region that has been supressed during the sound mechanism production of vocal cord. It is able to amplify the significant formant frequencies that have important information in it. The most commonly used pre-emphasis filter is given by the following transfer function. In digitally speech waveform there has been occupied with highly range of additive noise. 26 Here pre-emphasis filter is used is applied to filter the additive noise. This is done by applying a first-order FIR high-pass filter. In the time domain, with input x[n] and 0.97 ≤ a ≤ 1.0, the filter equation, 𝑦[𝑛] = 𝑥[𝑛] − 𝑎. 𝑥[𝑛 − 1] (2.7) and the transfer function of the FIR filter in z-domain is: 𝐻(𝑍) = 1 − 𝛼. 𝑧 − 1, 0.97 ≤ 𝛼 ≤ 1.0 (2.8) where α is the pre-emphasis parameter and coefficient a is adjusted according to time based on auto-correlation values of the audio signal. The aim of this stage is to amplify the amount of energy in high frequencies. The pre-emphasis filter is applied on the input signal prior next step of windowing. 2. Framing and Windowing To avoid discontinuities and distortion in the underlying spectrum, the speech signal needs to be sliced in a short duration of time for better analysis. Therefore, the speech analysis have to be segmented in small segments called framing. The segmented signal is separated into several frames instead of analyzing the entire signal at once. For capturing accurate segments, the time window needs to be advanced every pre-selected time frame in millisecond (ms) to tracked the temporal characteristics of individual speech sounds. Slicing of pre-windowed time frame is usually sufficient to provide good spectral resolution of these sounds, and enough to resolve significant temporal characteristics. The overlapping analysis is 27 to centralized the input sequence of certain frame for each speech sound. The commonly used windows in mel analysis is Hanning or Hamming windows. These windows is applied on each frame a window to taper the signal towards the frame boundaries. This is performed to enhance the harmonics, smooth the edges and to reduce the edge effect while taking the DFT on the signal. 3. Discrete Fourier Transform (DFT) The input to the DFT is a windowed signal x[n]...x[m], and the output, for each of N discrete frequency bands, is a complex number X[k] representing the magnitude and phase of that frequency component in the original signal. Each windowed frame is converted into magnitude spectrum by applying DFT. (2.9) where N is the number of points used to compute the DFT. 4. Mel-Scale Filterbank Mel refer to a unit measurement of human ears perceiving frequency. Mel- filterbank is a set of bandpass filter that use to filter the Fourier transformed signal in order to calculate the mel-spectrum value. Due to limitation of human ear does not perceive pitch linearly, the mel-spectrum does not correspond to the physical frequency of the tone linearly as well. The mel-scale is approximately a linear frequency spacing below 1kHz, and a logarithmic spacing above 1kHz. The approximation of mel from physical frequency can be expressed as 28 f*+, = 2595 log Q1+ f700R (2.10) The mel-filterbanks can be implemented in both time domain and frequency domain. In order to mimic the human ears perception, warping the axis according to the non-linear function by applying the given function in Eq. (2.10). The triangular filter banks with mel frequency warping are displayed in Figure 2.3. The logarithm value is obtained by converting the values of DFT multiplication into an addition one. Mel Filter Bank values are reduced by replacing each value by its natural log. For MFCC, computing mel-filterbanks are commonly implemented in frequency domain. The center frequencies of the filters are normally evenly spaced on the frequency axis. The mel spectrum of the magnitude spectrum X(k) is computed by multiplying the magnitude spectrum by each of the of the triangular mel weighting filters. Using a logarithmic scale makes the feature estimates less sensitive to variations in input. Figure 2.3. Mel-filter bank 29 𝑠(𝑚) = T [|𝑋(𝑘)|# 𝐻- (𝑘)]; 0 ≤ 𝑚 ≤ 𝑀 − 1/("012 (2.11) where M is total number of triangular mel weighting filters. Hm(k) is the weight given to the kth energy spectrum bin contributing to the mth output band and is expressed as ⎩⎪⎪⎨ ⎪⎪⎧ 0, 𝑘 < 𝑓(𝑚 − 1)2+𝑘 − 𝑓(𝑚 − 1).𝑓(𝑚) − 𝑓(𝑚 − 1) , 𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚)2(𝑓(𝑚 + 1) − 𝑘)𝑓(𝑚 + 1) − 𝑓(𝑚) , 𝑓(𝑚) < 𝑘 ≤ 𝑓(𝑚 + 1)0, 𝑘 > 𝑓(𝑚 + 1) with m ranging from 0 to M−1. 5. Discrete Cosine Transform (DCT) Since the vocal tract is smooth, the energy levels in adjacent bands tend to be correlated. The DCT is applied to the transformed mel frequency coefficients produces an array of cepstral coefficients. Prior to computing DCT the mel spectrum is usually represented on a log scale. This results in a signal in the cepstral domain with a que-frequency peak corresponding to the pitch of the signal and a number of formants representing low que- frequency peaks. Since most of the signal information is represented by the first few MFCC coefficients, the system can be made robust by extracting only those coefficients ignoring or truncating higher order DCT components. Frequency vector is applied with DCT using the equation below: 30 𝑠(𝑚) = & 𝑙𝑜𝑔+𝑠(𝑚), cos 0!"($%&.()* 1 ; 𝑛 = 0,12,…… , 𝑐 − 1*%+$,& (2.12) where c(n) are the cepstral coefficients and c is the number of MFCCs. Conventional MFCC systems use only 8–13 cepstral coefficients. The 0th coefficient is often excluded since it represents the average log-energy of the input signal, which only carries little speaker-specific information. Since the cepstral coefficients only contain information from a given frame, it commonly referred to as static features. To get the extra information about the temporal dynamics of the signal, it needs to compute the first and second derivatives of cepstral coefficients. These are known as delta coefficients, and delta-delta coefficients. Delta coefficients contains information about speech rate, and delta-delta coefficients provide information similar to acceleration of speech. The commonly used definition for computing dynamic parameter is ∆𝑐$(𝑛) = ∑ &!'"()*+)#!$%#∑ |+|#!$%# (2.13) where 𝑐-(𝑛) 𝑑enotes the mth feature for the nth time frame, 𝑘3 is the ith weight and T is the number of successive frames used for computation. In general, T is taken as 2. The delta-delta coefficients are computed by taking the first order derivative of the delta coefficients. Figure 2.4 summarizes the overall process of MFCC feature extraction that have been explained previously in details. 31 Figure 2.4 MFCC feature extraction 2.3.4. Frequency warping In audio signal processing, frequency warping is a technique that are commonly used for spectral analysis. Its ability to mimic the human hearing due to its function as a non-uniform scaling of frequency. The basic concepts of frequency warping are by applying the unitary warping operator to the function. This warping procedure will drastically change the local density as well as the spectral density. Among well-known such warping a filterbank functions is the MFCC. This will be explained more details in chapter 4. The warping function defines how the frequency components and frequency ranges are individual mapped on the new scale. There are several ways to create a frequency warped in such domain. Among them are frequency warping of the Fourier spectra, Fourier transforming the frequency warped time signal and non-uniform resolution filterbank. Warped frequency also defines the allocation of the new resolution which ranges in the original representation how they are compressed and expanded. The first method when combined to one operation, is equivalent to spectral domain processing with a nonuniform resolution filterbank made of warped (frequency domain) sine functions. The method is chosen depends on the actual Input data pre-emphasis Windowing DFTMel-filterbankDCT 32 practical limitations, e.g., in the implementation of the warping process. Frequency warping is used as the main enhancement method for MFCC throughout the profiling system. Detailed method on how it will be implied on the frequency warping strategy by direct warping of the spectrum, rather than the filterbank will be discussed further in Chapter 5. 2.3.5. Spectral Descriptors This section provides a set of functions that describe the timbre of audio known as spectral descriptors (SD). SD defines the equations used to determine the spectral features, common usage of each feature, and provides examples to describes the spectral descriptors more intuitively. SD are widely used in machine and deep learning applications, and perceptual analysis. Spectral descriptors have been applied to a range of applications, including speaker identification and recognition (Murthy, 1999), music genre classification (Li et al., 2005), mood recognition (Tsang, 2000) and voice activity detection (Scheirer et al.,1997). 1. Spectral Centroid The spectral centroid is the frequency-weighted sum normalized by the unweighted sum (Peeters, 2004). The algorithm state that, µ1, as the spectral centroid: 𝜇. = ∑ /&0&'(&$')∑ 0&'(&$') (2.14) where fk is the frequency in Hz corresponding to bin k, sk is the spectral value at bin k and b1 and b2 are the band edges, in bins, to calculate the spectral centroid. The spectral centroid represents the central energy of the spectrum. 33 In audio analysis, it often uses for music or genre classification due to its function as brightness indicator. The spectral centroid is also commonly used to classify voiced or unvoiced speech. Figure 2.5 shows example of spectral centroid of human speech. Observed the centroid jumps in regions of unvoiced speech. Figure 2.5 Example of spectral centroid from human speech. 2. Spectral Spread Spectral spread is the standard deviation around the spectral centroid (Peeters, 2004). It is the second degree of the spectral centroid as stated below algorithm: 𝜇# = b∑ (6!" $%)&8!'&!('%∑ 8!'&!('% (2.15) 34 where, fk is the frequency in Hz corresponding to bin k, sk is the spectral value at bin k, b1 and b2 are the band edges, in bins, to calculate the spectral centroid and µ1 is the spectral centroid. The spectral spread depicts the "instantaneous bandwidth" of the spectrum. It indicates the domination of a tone. For example, the spread increases as the tones diverge and decreases as the tones converge. 3. Spectral Roll-off Point The spectral roll-off point measures the bandwidth of the audio signal by determining the frequency bin under which a given percentage of the total energy exists (Scheirer,1997). ∑ |𝑠0| = 𝑘 ∑ 𝑠09&019%3019% (2.16) The spectral roll-off point has been used to distinguish between voiced and unvoiced speech, speech/music discrimination, music genre classification, acoustic scene recognition, and music mood classification. 2.4. Automatic Speech recognition for Speech Profiling Automatic speech recognition (ASR) is the area of research that allows machines such as microphone or telephone to accept vocal input from humans and intelligently interpret them to the maximum degree of accuracy. It has a wide area of applications which makes life easier and very promising. This section will start with the overview of speech recognition over the decades and the developments throughout the years in ASR. 35 2.4.1. An overview of speech recognition The study of speech recognition has been an area of research for more than fifty decades. The aim is to have the ability to capture, understand and react on the captured information. In Figure 2.6, Santosh classified that a speech recognition system includes four main stages which are further classified into subsystem (Santosh, 2010). These basic stages will be used extensively throughout this research. It involved analysis, extraction, modelling, and matching. Many models have developed by far, over the years, to produce an accurate system that can benefited by many (Pahwa et al., 2020). Speech recognition software use the various such as natural language processing. For instance, ASR system makes use of natural language processing techniques based on grammars (Reshamwala et al., 2013). It uses the context free grammars for representing syntax of that language presents a means of dealing with spontaneous through the spotlighting addition of automatic summarization including indexing, which extracts the gist of the speech transcriptions to deal with Information retrieval and dialogue system issues. Communication has almost fully been using keyboards and screens, but speech is the most widely used, natural and the fastest means of communication for people. Many parameters affect the accuracy of the recognition system. These parameters are dependence or independence from speaker, discrete or continuous word recognition, vocabulary, environment, acoustic model, language model, and many more. Problems such as noisy environment, differentiating one word by two different speakers, incompatibility between train and test conditions led to made system without complete recognition. 36 Figure 2.6 Four basic stages of Speech recognition (Santosh, 2010) In this research, an open source model is used which is based on Hidden Markov Models. A Hidden Markov Model (HMM) is originated and created by Huang’s team in 1990. For this, HMM toolkit is designed for speech recognition. Hidden Markov Model toolkit (HTK) is developed in 1989 by Steve Young at the Speech Vision and Robotics Group of the Cambridge University Engineering Department. HTK training tools are used to train HMMs using Training utterances from a speech corpus. HTK recognition tools are used to transcribe unknown utterances and to evaluate system performance. A method using Gaussian Mixture Model or statistical pattern classification is suggested to reduce computational load. This model is a statistical model where the system being modelled based on the assumption on Markov process with unknown parameters. The challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further 37 analysis, for example for pattern recognition applications. Its extension from English as the standard into foreign languages, in this case of Arabic, its represent a real research challenge area. 2.4.2. ASR for Quranic Semantic Search The research on Arabic ASR has focused on developing recognition system for modern standard Arabic. Since 2004, most issues faced in developing highly accurate ASRs for Arabic are the predominance of non-discretised text material, the enormous dialectal variety, and the morphological complexity (Hussein et al., 2021). The study uses a morphology-based language model at different stages in a speech recognition system for conversational Arabic and the automatic discretising Arabic text for use in acoustic model training for ASR. Quranic Arabic is the form of Arabic in which the Quran is written. With the similar language notation, The Quranic text is considered as predominance of non-discretised text material in Arabic language. Domain specific ontologies are created and inferred to one and another has leads researchers to Semantic search. Recently, the state-of-the-art ASR in the Arabic language comes from modular Hidden Markov Model Deep Neural Network systems (P.Smit et al., 2017). According to Hussein, there are various major challenges need to be faced in dealing with the language complexity. The best ASR results on the modern standard Arabic data were reported by the Aalto University team. To deal with morphological complexity in Arabic language, the character-level language model was suggested by (A.Ahmed, 2018). 38 2.4.3. Ontologies for Semantic Audio Analysis An ontology describes problem entities, operations, relations and structures. In the context of semantic audio tools, the entities may be sounds or sound objects, while relations and structures are described by their organization. Operations describe the available tools and their context. A discussion on information management and knowledge representation requirements of these tools can be facilitated by a model for building semantic audio tools. Utilising the design principles, a set of ontologies has been developed for describing the process of audio recording. The proposed ontology detailed in this section is closely related to the information management framework for semantic audio tools outlined in the next section. It is designed to satisfy some of its requirements, for instance, the need for collection information about production, and uses the technologies deemed to be most appropriate for managing heterogeneous information in an open ended way. The proposed ontology allows for describing QMR production in more detail than what was possible using previously published ontologies. 2.4.4. Ontology design principles In this section summarises the features which make the proposed ontology more suitable for QMR audio files. For instance, there are many audio feature ontologies exist in music world (Allik et al., 2016). All of these audio ontologies are based on music arrangement rather than dealing with acoustic features contained in frequency domain. These are the design principles and their advantages as follows: 1. Time frame and temporal entities can be used to localise events. 2. The proposed ontology is published as a modular ontology library whose components may be reused or extended outside of its framework. 39 3. Ease of use. The proposed ontology provides only the terms required for descriptive knowledge representation without more foundational elements. 4. Adaptation to existing and future applications in industry and academia. The models provide the basis for content annotation as well as the decomposition of events in complex workflows. While elements of these models can also be found in other ontologies, they are not present all at once in a single unified framework. Thus, the design of proposed ontology will fill this gap. It provides a model to describe the production workflow from composition to delivery, including QMR recording, provided with very basic concepts to do so in detail. In the next following section will provide an overview and detailed of relevant audio feature extraction techniques related to the semantical analysis for knowledge base construction. 2.5. Semantic Audio Analysis (SAA) for Knowledge base construction Since 1962, semantics is regarded as the study of meaning of human expression through language (Ullmann, 1962), whereas computer science studies regard semantics as a knowledge representation issue (Guarino and Giaretta, 1995). Semantic audio represent sound or feature that is meaningful or contained some information related to production of the audio. The motivation of this work is to design software systems that enable to support well-structured information for audio editing, and can facilitate data collection, and audio engineering knowledge base of QMR for future used. 40 Many applications that have been developed using semantic information to support the user in identifying, organizing, and exploring and manipulating audio signals. Speech recognition is an important SAA application. It includes language identification, speaker identification or gender identification. SAA involves the process of understand the audio information and incorporate machine learning, digital signal processing, speech processing, source separation, perceptual models of hearing, and ontologies. 2.5.1. Utilities and applications The concept of semantic audio is designed in this work such that the technologies involved should enable the analysis of audio content in order for meaningful associations between the content and the acoustic elements are represent and manageable associations in a digital computer. Two crucial components of semantic audio applications are the capability of representing and structuring information of audio element, and the capability to analyse the association of these concepts with a representation of the recording. Extracting information from audio recordings is requisite for building semantic audio applications. It is important to review the basic categorical distinctions in audio features, and the relationships of these features in the physical, perceptual and audio domains. According to Olson’s taxonomy of audio dimensions (Olson, 1952) provides an insightful parallel view on how the qualities of sound and audio are interpreted in various disciplines, and provides a basic terminology related to the concepts of physical and psychological qualities. Following this line of thought enables us to resolve ambiguities that often appear in relation to acoustical, perceptual and audio quantities. 41 In Table 2.2 illustrates physical quantities used to describe elementary sounds and the most related perceptual and audio concepts. It is crucial to see that if the basic physical quantities and concepts become more complex, and it becomes more difficult to establish the correspondence between categories. A sound may be classified by growth and decay, or the attack and release times related to timbre, harmonicity and inharmonicity, regular or irregular spacing of frequency components, frequency and amplitude modulation. Obtaining audio features corresponding to simple physical quantities, such as the fundamental frequency of a sound, is a question of measurement involving simple mathematical transformations. In semantic audio applications, recognising more complex acoustic elements such as audio note or an instrument, will require more complex processing, such as pattern recognition and classification, or knowledge-based processing. All of these relies on logical inference using contextual information alongside directly measured physical quantities. There are underlying needs for the design of ontologies and some basic pre-cautions that should be addressed in designing the ontologies. information management solutions discussed in the following chapters. Table 2.2 Principal dimensions of elementary audio sounds (Olson, 1952) Physical Quantity (or concept) Perceptual Quality Musical Category Frequency (fundamental) Perceived pitch Musical note Amplitude (intensity) Perceived loudness Dynamics Duration (time) Perceived duration Beat and tempo Waveform (or complex spectrum) Perceived timbre Tone quality 42 2.5.2. Semantic Audio Tool A system for integrating components that allow the implementation of the ideas mentioned so far in this work may be modelled as shown in Figure 2.8. This model has three analysis layers corresponding to audio feature extractors, three information layers corresponding to ontologies for describing tools and the results of audio analysis, and three application layers corresponding to tools that can be built using this information and their descriptions. In the following, we outline the role of the three layers and the components that may be utilised in the model. 1. Analysis layers (Audio feature extraction) As mentioned that AFE is the main technology used in this layer. This steps is very crucial which would give the biggest impact to the next layer. Some of the signal processing components required for extracting information from audio content are well researched and may be adapted. Basic feature extraction techniques standardised in speech recognition are successfully applied to audio. For example, segmentation audio semantically had done many research this past years (Theodorou, 2014, Aggarwal, 2022). Many techniques of DSPs techniques applicable to these problems are describe in (Aggarwal, 2022). For instance, high and mid- level feature extraction are the focus of MIR research. High-level segmentation of audio recordings played by a single instrument and the analysis of master recordings however were not considered by previous research. 2. Information layers (Audio Features Ontology) The audio ontology and its extensions is defined as frames of reference for describing the domain. It can be used to give a reference point for the 43 information management layer considered here. This ontology is also considered to be useful to represent for instance, an audio elements and its relation to corresponding signals. Its basic components allow for associating entities with the event occurred time-based domain. This is crucial in representing audio features, and serve as the basis for the Audio Features Ontology and pre-designed Maqamat Ontology in the next chapter. It also provide the relation between features, which considered as important elements for efficient feature extraction. Ontologies for describing audio analysis algorithms, ideally, including even their low- level digitally components, and ontologies that allow for describing audio processing tools are equally important in building intelligent audio processing environments. Currently there are no ontologies describing specific signal models or performance related data. These shall be developed in the future as the need arises. 3. Application layers (Interaction and navigation) A well-defined and structured representation of the audio and its various representations is the key for development a semantic audio tool. The ontological needs of describing applications include the ability to create a knowledge base. This information is very useful to retrieve data, feature extraction if needed, or ask for user interaction. Figure 2.7 and 2.8 shows the knowledge representation model between analysis, information and application layers (Fazekas, 2012). 44 Figure 2.7: Knowledge representation model for analysis and information layers (Fazekas, 2012) Figure 2.8: Knowledge representation model for information and application layers (Fazekas, 2012) 45 2.6. Summary Speech recognition has been widely studied for various type of languages, including the Quranic recitation which contained acoustics features. These properties contained uniquely in formants which define as a concentration of acoustic energy around a particular frequency that corresponds to a resonance in the vocal tract. Like these varied melodies, the verses of the Quran vary widely in term of topics and event to generate different feelings to the listener. These acoustic features contained in the QMR are considered as a complex speech signal that need to be extract and analyse for the purpose of understanding the characteristics of the signal components. This chapter provides a brief overview of Quranic maqamat and sound element that relates to this work. The acoustic features contained in the Quranic recitation are considered as a complex speech signal that need to be extracted and analyzed. These techniques are crucial for understanding the characteristics of phonological and morphological elements from QMR audio files to determine any correlation between acoustics properties and Quranic rhetoric elements contained in QMR audio features. Then later section described the basic concept of speech recognition, AFE and its application for complex spectrum and followed by cepstral analysis in complex spectrum. For complex spectrum, two types of algorithms will be presented in cepstral analysis method. Those are the typical cepstral analysis with warping frequency function and mel-scaled features, which also known as MFCC. The section also outlines techniques used in audio feature extraction in profiling the Quranic audio signal. The final part briefly explains the semantic analysis and its utilities and tool in audio for the development of well-structured database system that support studio environment which facilitate Quranic semantic audio search based on ontological audio features.