Automatic Speech Recognition : a Refutation Approach
Author: Julius Jonathan Guzy
Leicester Polytechnic
England
"Scientific theories are not statements of fact; they are not even descriptions of effects; they are explanations, which means that only their consequences are open to inspection and available to be compared with our experience"
Bronowski, Humanism and the Growth of Knowledge
ABSTRACT
The thesis describes a "refutation" approach to the design of recognition algorithms for use in Speech Input Interfaces that operate in natural "open systems" environments. The approach is based on a view of the scientific method due to Popper. This gives rise to a view of perception that differs from that underlying other approaches.
An example is presented of the derivation of a refutation based algorithm from a theory of the articulation /tu/ in the context of a following /nain/. The resultant algorithm shows the method to be capable of expressing the generality required of speaker independent speech recognition algorithms.
A comparison is performed between the refutation approach and past approaches. Significant technical and philosophical differences are shown to exist between them. An example of an isolated word, speaker independent, Refutation based digit utterance recogniser: the RB-1, is presented. The problem of recognition performance evaluation for recognisers designed to operate in open system environments is examined and a set of performance evaluation measures is proposed. A comparative evaluation of the recognition performance of the RB-1 and a commercial Pattern Matching based system: the Marconi Macrospeak, is described. To obtain a performance better than that of the RB-1 over all of the evaluation parameters considered, thresholds had to be imposed over the goodness of match score. The percentage correct responses to target utterances for the Macrospeak and the RB-1 were respectively: 44% and 25%, percentage error responses to target utterances were 0.05% and 0.06%, and percentage error responses to non-vocabulary utterances were 16% and 22%. In conclusion the main area for future work is identified to be that of spoken command protocol design. The scientific contribution of this thesis is that of having introduced a different view of the problem of automatic speech recognition, and of having described a novel set of principles in recognition algorithm design.Acknowledgements
This research was conducted as part of this author's contribution to the work of the Speech Recognition Group at Leicester Polytechnic funded initially by SERC Grant No GR/B/54599 and later Speech Recognition Group of the Human Computer Interface Research Unit at Leicester Polytechnic funded by SERC Rolling Grant No GR/D/72662 and of the Loughborough University Human Computer Interface Research Centre, funded under Alvey Grant No. CR/E/39143 (MMI/062) The author would like to offer special thanks to Maria Guzy, Ernest Edmonds, John Connolly, Abdullah Hashim, Helmut Bez, Ian Newman, Sue Johnson, Andree Woodcock, Eddie McDaid, Mike Smyth, Tom Bayley, Stella O'Brien, Steve Guest, Andre Schappo, Jackie Highfield, and Louise Poole for their help, comments, advice, support and all the other things that go towards doing useful research.
CONTENTS
Chapter 1 A Refutation Approach to Automatic Speech Recognition
- 1.1. The problem area
- 1.2. Instrumental and rationalist phenomena
- 1.3. The problem situation
- 1.4. Popper's solution to the problem of induction
- 1.5. Outline basis for an approach to the recognition problem
- 1.6. Outline of a refutation approach to the recognition problem
- 1.7. Structure of remaining chapters
-
Chapter 2 A Worked Example of Refutation Based Recognition
- 2.1. Introduction
- 2.2. Description of the articulation /tu/ in the context /tunain/
- 2.3. Data capture
- 2.4. Outline recognition strategy
- 2.5. Representation of temporal, frequency and intensity information
- 2.6. Presentation of results
- 2.7. The refutation process
- 2.7.1. Start of /t/: refutation 1
- 2.7.2. Start of /t/: refutation 2
- 2.7.3. Start of /t/: refutation 3
- 2.7.4. Start of /t/: problems of silence
- 2.7.4.1. Start of /t/: problems of silence - refutation 4
- 2.7.4.2. Start of /t/: problems of silence - refutation 5
- 2.7.4.3. Start of /t/: problems of silence - refutations 6 and 7
- 2.8. Start of voicing
- 2.8.1. Start of voicing: refutation 1
- 2.8.2. Start of voicing: refutation 2
- 2.9. Start of nasalisation
- 2.9.1. Start of nasalisation: refutation 1
- 2.9.2. Start of nasalisation: refutation 2
- 2.9.3. Start of nasalisation: refutation 3
- 2.9.4. Start of nasalisation: refutation 4
- 2.9.5. Start of nasalisation: refutation 5
- 2.9.6. Start of nasalisation: refutation 6
- 2.10. Formant structure /tu/
- 2.10.1. Formant structure /tu/: refutation 1
- 2.10.2. Formant structure /tu/: refutation 2
- 2.10.3. Formant structure /tu/: refutation 3
-
Chapter 3 Discussion of Example of Refutation Based Recognition
- 3.1. Introduction
- 3.2. Spectrograms used in the tests.
- 3.3. Differences between the spectrograms
- 3.2.1. The initial /t/
- 3.2.1.1. Improving the tests for /t/
- 3.2.1.1. Incomplete closure.
- 3.2.1.2. /s/ identified as /t/
- 3.2.1.3. /d/ identified as /t/
- 3.2.1.4 /v/ identified as /t/
- 3.2.1.5. Start of voicing identified as /t/
- 3.2.2. Start of voicing
- 3.2.2.1. Improving the tests for start of voicing.
- 3.2.2.2. Increasing the constraints of the tests of start of voicing
- 3.2.3. Nasalisation.
- 3.2.4. Formants
- 3.2.1. The initial /t/
- 3.3. Gaps in the demonstration
- 3.4. Issues in computer program design
- 3.5. Applicability to speaker independent recognition
- 3.6. Summary of conclusions
-
Chapter 4 Related Approaches
- 4.1. Introduction
- 4.2. The templates of Blumstein and Stevens
- 4.3. Comparison
- 4.4. Conclusions
-
Chapter 5. Comparison with Previous Approaches
- 5.1. Introduction
- 5.2. Current systems and the recognition of ordinary speech
- 5.3. A model of spoken communication
- 5.4. Speech variability and the concept of invariance
- 5.5. Basic approach
- 5.6. Three basic pattern-matching techniques
- 5.7. Differences between pattern-matching techniques and refutation
- 5.7.1. Introductory comments
- 5.7.2. The reproduction of refutation criteria by pattern matching
- 5.7.3. Comparison with respect to recognition in open systems
- 5.7.4. Assumptions incompatible with the open system problem
- 5.7.5. Recognition of rationalist phenomena
- 5.8. Complexity and linguistic constraints
- 5.9. State of the Art
- 5.10. System Evaluation
- 5.11. Lack of universality and the evaluation problem
- 5.12. Comparative assessment of approaches
- 5.13. Summary of Differences
- 5.13. Summary of Differences
-
Chapter 6. A Refutation based spoken digit recogniser
- 6.1. Introduction.
- 6.2. System overview
- 6.3. Recognition performance evaluation
- 6.3.1. Theoretical basis
- 6.3.2. Measures of recognition performance
- 6.3.2.1. Notation
- 6.3.2.2. Preliminary definitions
- 6.3.2.3. Measures
- 6.4. Recognition performance estimates
- 6.4.1.Relation between algorithm performance and system performance
- 6.4.2.Estimates of system performance from algorithm performance
- 6.4.3.Estimates based on an assumed equality of algorithm performance
- 6.4.4.Estimated system performance characteristics
- 6.4.4.1. Estimated system response to non-target input
- 6.4.4.2. Limits on permitted algorithm error response to non-target input
- 6.4.4.3. Estimated system response to target input
- 6.4.4.4. Limits on permitted algorithm error response to target input
- 6.4.5. Estimation of recognition algorithm reliability
- 6.4.6. Estimation of system performance in applications
- 6.5. Applicability to evaluation of Pattern Matching based systems
- 6.5.1. Mapping a Refutation based system onto a Pattern Matching system
- 6.5.2. Basis for a comparative evaluation
- 6.5.3. Performance predictions for a Pattern Matching based system
- 6.6. Use of a statistical decision function in the result consolidator
- 6.6.1. A statistical decision function
- 6.6.2. Redundancy of employing a statistical decision function
- 6.7. The evaluation experiment
- 6.7.1. The Macrospeak system
- 6.7.2. Recording environment
- 6.7.3. Speaker population
- 6.7.4. Choice of counter example utterances
- 6.7.5. Data recording
- 6.7.6. RB-1 data
- 6.7.7. Macrospeak data and macros
- 6.8. Results
- 6.8.1. Run-time performance
- 6.8.2. RB-1 recognition performance
- 6.8.3. Macrospeak recognition performance
- 6.8.4. Comparative recognition performance
- 6.8.5. Discussion
- 6.9 Conclusions
-
Chapter 7. Concluding Remarks
- 7.1. Achievements
- 7.2. Change of problem conceptualisation
- 7.3. Applicability to speaker independent recognition
- 7.4. Future work
- 7.5. Final remarks
References
APPENDIX A
Format Used in Presenting the ResultsAPPENDIX B
Results: recognition of "two" when followed by "nine"-
spectrogram: JD1A28
hypotheses
- start of /t/
- start of voicing
- nasal - hypothesis refuted (valid result)
-
spectrogram: 2A22
hypotheses
- start of /t/B4start of voicing
- nasal - hypothesis refuted (valid result)
- start of /t/
- start of /t/
- start of voicing
- nasal
- formants - hypothesis accepted (valid result)
- start of /t/ - hypothesis refuted (invalid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis accepted (valid result)
- start of /t/
- start of /t/ - hypothesis refuted (valid result)
- start of /t/
- start of voicing - hypothesis refuted (invalid result)
- start of /t/
- start of voicing
- nasal - hypothesis refuted (valid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis refuted (valid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis accepted (valid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis accepted (invalid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis refuted (valid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis refuted (valid result)
- start of /t/
- start of voicing
- nasal
- formants - hypothesis accepted (valid result)
- start of /t/
APPENDIX C
Implementation of a Refutation based speech recogniser- C.1. Introduction
- C.2. System configuration
- C.3. Conceptual view of system
- C.4. Overview of the word analysis procedures
- C.5. Hypothesise and test procedures
- C.5.1. Single temporal hypotheses
- C.5.2. Start-or-end-of-formant-frequency-hypotheses
- C.5.3. Formant-hypotheses
- C.5.4. Combined-F2-F3-hypotheses
- C.6. Granularity reduction procedures
- C.7. Mutual compatibility procedures
- C.7.1. Mutual compatibility of formant trajectories
- C.7.2. Mutual compatibility of surviving same-word hypotheses
- C.7.2.1. Stage 1 (surviving temporal variable identification)
- C.7.2.2. Stage 1 (legitimate temporal contiguity check)
- C.7.2.3. Stage 2 (granularity reduction)
- C.7.2.4. Stage 3 (hypothesise and test)
- C.8. Result consolidator procedure
- C.9. A toolkit of test procedures
- C.9.1. Basic Procedures
- C.9.1.1. The average box
- C.9.1.2. The rectangular slider
- C.9.1.3. The angled slider
- C.9.1.4. The slope average
- C.9.2. Compound Procedures
- C.9.2.1. Relative masks
- C.9.2.2. Static box and slider
- C.9.2.3. Gap searcher
- C.9.2.4. Frequency of occurrence counters
- C.9.1. Basic Procedures
- C.10. Groups of tests and the selective application of test criteria