Automatic Speech Recognition : a Refutation Approach

Author: Julius Jonathan Guzy

These pages contain, with minor modification the thesis submitted in partial fulfilment of the requirements for my degreee of doctor of philosophy (Ph.D).
School of Mathematics, Computing and Statistics
Leicester Polytechnic
England
December 1988

"Scientific theories are not statements of fact; they are not even descriptions of effects; they are explanations, which means that only their consequences are open to inspection and available to be compared with our experience"

Bronowski, Humanism and the Growth of Knowledge

ABSTRACT

The thesis describes a "refutation" approach to the design of recognition algorithms for use in Speech Input Interfaces that operate in natural "open systems" environments. The approach is based on a view of the scientific method due to Popper. This gives rise to a view of perception that differs from that underlying other approaches.

An example is presented of the derivation of a refutation based algorithm from a theory of the articulation /tu/ in the context of a following /nain/. The resultant algorithm shows the method to be capable of expressing the generality required of speaker independent speech recognition algorithms.

A comparison is performed between the refutation approach and past approaches. Significant technical and philosophical differences are shown to exist between them. An example of an isolated word, speaker independent, Refutation based digit utterance recogniser: the RB-1, is presented. The problem of recognition performance evaluation for recognisers designed to operate in open system environments is examined and a set of performance evaluation measures is proposed.

A comparative evaluation of the recognition performance of the RB-1 and a commercial Pattern Matching based system: the Marconi Macrospeak, is described. To obtain a performance better than that of the RB-1 over all of the evaluation parameters considered, thresholds had to be imposed over the goodness of match score. The percentage correct responses to target utterances for the Macrospeak and the RB-1 were respectively: 44% and 25%, percentage error responses to target utterances were 0.05% and 0.06%, and percentage error responses to non-vocabulary utterances were 16% and 22%.

In conclusion the main area for future work is identified to be that of spoken command protocol design. The scientific contribution of this thesis is that of having introduced a different view of the problem of automatic speech recognition, and of having described a novel set of principles in recognition algorithm design.

Acknowledgements

This research was conducted as part of this author's contribution to the work of the Speech Recognition Group at Leicester Polytechnic funded initially by SERC Grant No GR/B/54599 and later Speech Recognition Group of the Human Computer Interface Research Unit at Leicester Polytechnic funded by SERC Rolling Grant No GR/D/72662 and of the Loughborough University Human Computer Interface Research Centre, funded under Alvey Grant No. CR/E/39143 (MMI/062) The author would like to offer special thanks to Maria Guzy, Ernest Edmonds, John Connolly, Abdullah Hashim, Helmut Bez, Ian Newman, Sue Johnson, Andree Woodcock, Eddie McDaid, Mike Smyth, Tom Bayley, Stella O'Brien, Steve Guest, Andre Schappo, Jackie Highfield, and Louise Poole for their help, comments, advice, support and all the other things that go towards doing useful research.

CONTENTS

  • Chapter 1 A Refutation Approach to Automatic Speech Recognition

    • 1.1. The problem area
    • 1.2. Instrumental and rationalist phenomena
    • 1.3. The problem situation
    • 1.4. Popper's solution to the problem of induction
    • 1.5. Outline basis for an approach to the recognition problem
    • 1.6. Outline of a refutation approach to the recognition problem
    • 1.7. Structure of remaining chapters
  • Chapter 2 A Worked Example of Refutation Based Recognition

    • 2.1. Introduction
    • 2.2. Description of the articulation /tu/ in the context /tunain/
    • 2.3. Data capture
    • 2.4. Outline recognition strategy
    • 2.5. Representation of temporal, frequency and intensity information
    • 2.6. Presentation of results
    • 2.7. The refutation process
      • 2.7.1. Start of /t/: refutation 1
      • 2.7.2. Start of /t/: refutation 2
      • 2.7.3. Start of /t/: refutation 3
      • 2.7.4. Start of /t/: problems of silence
        • 2.7.4.1. Start of /t/: problems of silence - refutation 4
        • 2.7.4.2. Start of /t/: problems of silence - refutation 5
        • 2.7.4.3. Start of /t/: problems of silence - refutations 6 and 7
    • 2.8. Start of voicing
      • 2.8.1. Start of voicing: refutation 1
      • 2.8.2. Start of voicing: refutation 2
    • 2.9. Start of nasalisation
      • 2.9.1. Start of nasalisation: refutation 1
      • 2.9.2. Start of nasalisation: refutation 2
      • 2.9.3. Start of nasalisation: refutation 3
      • 2.9.4. Start of nasalisation: refutation 4
      • 2.9.5. Start of nasalisation: refutation 5
      • 2.9.6. Start of nasalisation: refutation 6
    • 2.10. Formant structure /tu/
      • 2.10.1. Formant structure /tu/: refutation 1
      • 2.10.2. Formant structure /tu/: refutation 2
      • 2.10.3. Formant structure /tu/: refutation 3
  • Chapter 3 Discussion of Example of Refutation Based Recognition

    • 3.1. Introduction
    • 3.2. Spectrograms used in the tests.
    • 3.3. Differences between the spectrograms
      • 3.2.1. The initial /t/
        • 3.2.1.1. Improving the tests for /t/
        • 3.2.1.1. Incomplete closure.
        • 3.2.1.2. /s/ identified as /t/
        • 3.2.1.3. /d/ identified as /t/
        • 3.2.1.4 /v/ identified as /t/
        • 3.2.1.5. Start of voicing identified as /t/
      • 3.2.2. Start of voicing
        • 3.2.2.1. Improving the tests for start of voicing.
        • 3.2.2.2. Increasing the constraints of the tests of start of voicing
      • 3.2.3. Nasalisation.
      • 3.2.4. Formants
    • 3.3. Gaps in the demonstration
    • 3.4. Issues in computer program design
    • 3.5. Applicability to speaker independent recognition
    • 3.6. Summary of conclusions
  • Chapter 4 Related Approaches

    • 4.1. Introduction
    • 4.2. The templates of Blumstein and Stevens
    • 4.3. Comparison
    • 4.4. Conclusions
  • Chapter 5. Comparison with Previous Approaches

    • 5.1. Introduction
    • 5.2. Current systems and the recognition of ordinary speech
    • 5.3. A model of spoken communication
    • 5.4. Speech variability and the concept of invariance
    • 5.5. Basic approach
    • 5.6. Three basic pattern-matching techniques
    • 5.7. Differences between pattern-matching techniques and refutation
      • 5.7.1. Introductory comments
      • 5.7.2. The reproduction of refutation criteria by pattern matching
      • 5.7.3. Comparison with respect to recognition in open systems
      • 5.7.4. Assumptions incompatible with the open system problem
      • 5.7.5. Recognition of rationalist phenomena
    • 5.8. Complexity and linguistic constraints
    • 5.9. State of the Art
    • 5.10. System Evaluation
    • 5.11. Lack of universality and the evaluation problem
    • 5.12. Comparative assessment of approaches
    • 5.13. Summary of Differences
    • 5.13. Summary of Differences
  • Chapter 6. A Refutation based spoken digit recogniser

    • 6.1. Introduction.
    • 6.2. System overview
    • 6.3. Recognition performance evaluation
      • 6.3.1. Theoretical basis
      • 6.3.2. Measures of recognition performance
        • 6.3.2.1. Notation
        • 6.3.2.2. Preliminary definitions
        • 6.3.2.3. Measures
      • 6.4. Recognition performance estimates
        • 6.4.1.Relation between algorithm performance and system performance
        • 6.4.2.Estimates of system performance from algorithm performance
        • 6.4.3.Estimates based on an assumed equality of algorithm performance
        • 6.4.4.Estimated system performance characteristics
          • 6.4.4.1. Estimated system response to non-target input
          • 6.4.4.2. Limits on permitted algorithm error response to non-target input
          • 6.4.4.3. Estimated system response to target input
          • 6.4.4.4. Limits on permitted algorithm error response to target input
        • 6.4.5. Estimation of recognition algorithm reliability
        • 6.4.6. Estimation of system performance in applications
      • 6.5. Applicability to evaluation of Pattern Matching based systems
        • 6.5.1. Mapping a Refutation based system onto a Pattern Matching system
        • 6.5.2. Basis for a comparative evaluation
        • 6.5.3. Performance predictions for a Pattern Matching based system
      • 6.6. Use of a statistical decision function in the result consolidator
        • 6.6.1. A statistical decision function
        • 6.6.2. Redundancy of employing a statistical decision function
      • 6.7. The evaluation experiment
        • 6.7.1. The Macrospeak system
        • 6.7.2. Recording environment
        • 6.7.3. Speaker population
        • 6.7.4. Choice of counter example utterances
        • 6.7.5. Data recording
        • 6.7.6. RB-1 data
        • 6.7.7. Macrospeak data and macros
      • 6.8. Results
        • 6.8.1. Run-time performance
        • 6.8.2. RB-1 recognition performance
        • 6.8.3. Macrospeak recognition performance
        • 6.8.4. Comparative recognition performance
        • 6.8.5. Discussion
      • 6.9 Conclusions
    • Chapter 7. Concluding Remarks

      • 7.1. Achievements
      • 7.2. Change of problem conceptualisation
      • 7.3. Applicability to speaker independent recognition
      • 7.4. Future work
      • 7.5. Final remarks

    References

    APPENDIX A

    Format Used in Presenting the Results

    APPENDIX B

    Results: recognition of "two" when followed by "nine"
    • spectrogram: JD1A28 hypotheses
      • start of /t/
      • start of voicing
      • nasal - hypothesis refuted (valid result)
    • spectrogram: 2A22 hypotheses
      • start of /t/B4start of voicing
      • nasal - hypothesis refuted (valid result)
    • spectrogram: 8A219 hypotheses
      • start of /t/
      • start of voicing
      • nasal - hypothesis refuted (valid result)
    • spectrogram: 15X829 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis accepted (valid result)
    • spectrogram: 24A2 hypotheses
      • start of /t/ - hypothesis refuted (invalid result)
    • spectrogram: 24A29 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis accepted (valid result)
    • spectrogram: 36X6 hypotheses
      • start of /t/
    • spectrogram: 37A7 hypotheses
      • start of /t/ - hypothesis refuted (valid result)
    • spectrogram: 38X29 hypotheses
      • start of /t/
      • start of voicing - hypothesis refuted (invalid result)
    • spectrogram: 39A127 hypotheses
      • start of /t/
      • start of voicing
      • nasal - hypothesis refuted (valid result)
    • spectrogram: 48X801 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis refuted (valid result)
    • spectrogram: 49A29 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis accepted (valid result)
    • spectrogram: 50X7 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis accepted (invalid result)
    • spectrogram: 50XDUB hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis refuted (valid result)
    • spectrogram: 54A27 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis refuted (valid result)
    • spectrogram: 55X529 hypotheses
      • start of /t/
      • start of voicing
      • nasal
      • formants - hypothesis accepted (valid result)
    • spectrogram: 56X47 hypotheses
      • start of /t/

    APPENDIX C

    Implementation of a Refutation based speech recogniser
    • C.1. Introduction
    • C.2. System configuration
    • C.3. Conceptual view of system
    • C.4. Overview of the word analysis procedures
    • C.5. Hypothesise and test procedures
      • C.5.1. Single temporal hypotheses
      • C.5.2. Start-or-end-of-formant-frequency-hypotheses
      • C.5.3. Formant-hypotheses
      • C.5.4. Combined-F2-F3-hypotheses
    • C.6. Granularity reduction procedures
    • C.7. Mutual compatibility procedures
      • C.7.1. Mutual compatibility of formant trajectories
      • C.7.2. Mutual compatibility of surviving same-word hypotheses
        • C.7.2.1. Stage 1 (surviving temporal variable identification)
        • C.7.2.2. Stage 1 (legitimate temporal contiguity check)
        • C.7.2.3. Stage 2 (granularity reduction)
        • C.7.2.4. Stage 3 (hypothesise and test)
    • C.8. Result consolidator procedure
    • C.9. A toolkit of test procedures
      • C.9.1. Basic Procedures
        • C.9.1.1. The average box
        • C.9.1.2. The rectangular slider
        • C.9.1.3. The angled slider
        • C.9.1.4. The slope average
      • C.9.2. Compound Procedures
        • C.9.2.1. Relative masks
        • C.9.2.2. Static box and slider
        • C.9.2.3. Gap searcher
        • C.9.2.4. Frequency of occurrence counters
    • C.10. Groups of tests and the selective application of test criteria

    APPENDIX D

    Reprint: Guzy, J.J (1982) "The acquisition of linguistic knowledge from visible speech spectrograms of ordinary speech: a proposal", International Journal of Man-Machine Studies, 16, pp. 327-332

    APPENDIX E

    Reprint: Connolly, J.H., Edmonds, E.A., Guzy, J.J., Johnson, S.R., Woodcock, A. (1986) "Automatic Speech recognition Based on Spectrogram reading", International Journal of Man Machine Studies, Vol. 16, pp. 327-332


Please send me your comments

If you include your e-mail I may reply!  

Page last modified: 11:57 Monday 7th. November 2011