Introduction to Automatic Speech Recognition : a Refutation Approach

I obtained my Ph. D. sometime in 1988 and found it hard to write anything technical after that.

My experience of doing the Ph. D. was wonderful. I had decided to get one after seeing that all the people I admired for their ability to think quickly, incisively, analytically, self critically and to any serious question be able instantly to come up with at least six plausible explanations, had a doctorate. Of course other people had doctorates too but what was noticeable was that no one without a doctorate appeared to have all those qualities, all together as a lump. I wanted that ability and that is what I went for. And that's what I got. In fact I think I got a bit too much. After writing the thesis, I was useless for any speculative technical writing.

But doing the Ph. D. was fantastic. I read masses, wrote tons of code, and thought and thought and thought. I had got into Automatic Speech Recognition (ASR) almost by accient. After graduating in Compuer Science in 1978 I had wanted to do a higher degree in Artificial Intelligence (AI) and had dreamt up an interesting project. It struck me that we had great difficulties in seeing brains in action, so it was difficult to see how nature did things. Also I thought there was a problem with building AI programs, which was in knowing whether they worked. It seemed to me that there might be a few problems in determining whether one's program actually worked. I mean, we don't always make the best decisions and a lot of the time we seemingly do nothing at all. How was one to know that a complex program was thinking and not got itself locked in a complex loop.

The idea I had was to model a natural decision making system that we actually new something about - bees in a beehive. We could model the bees, how they move about through the hive, go find honey or water, decide on where to set up their nest and so on and we had data from bee keepers and researchers as to how bees tended to behave. Thus I thought that in principle we could determine if the system was working, and then, once it was, we could play with it by changing the system parameters. For instance we could make the bees able to communicate more quickly by extending and branching their antenae so that effectively each bee became a neuron.

But this door was closed to me because I had only got a lower second and to get research council funding for the studentship I needed an upper second or a first. So I wrote to a whole lot of companies to see if they would only ask me to work four days a week leaving the fith free to do research. All came up negative except one day when I was freelancing as a Dibol programmer in Nottingham I got a call from, what was his name.... Dave I think, David,... it will come to me later; he was head of the research division at Marconi New Parks in Leicester and he said he couldn't offer me four days a week but thought I would find it interesting to visit. That's how I became part of the team building the TEPGEN ship's bridge and flight simulator.

It was a great experience and I met a whole lot of amazingly bright people including a mass of New Zealanders. I have lots to say about approaches to the business of modeling the visual world but now's not the time. I got involved in database design and realised that no matter how much I tried to keep abreast of developments in this area, there would soon be kids coming out of universities who knew more than me. I had to get back to academia.

As it happened Ernest Edmonds who was reader in computing at Leicester Poly where I did my degree, together with John Connolly and Abdullah Hashim got a research grant on the back of work that had been done there by Bill Booth and Lynn Thomas developing the Palentype to Text translation system I think still used by the BBC for real time subtiting of programs for the deaf. The research grant was to take a look at whether it might be possible to take the palentype technology and for the palentype substitute a vision system which identified patterns in visible speech spectrograms. This was 1980 and Zue and Cole had just demonstrated to the surprise of all and sundry that experts could read speech spectrograms (Zue, V., R. Cole, "Experiments on spectrogram reading," Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, pp. 116-119, Washington, D.C., (1979).). I remember Abdullah asking me what I wanted from the research position and me saying "fame and fortune". I was back in academia as a research assistant.

It was a wonderful time. It became immediately obvious to me that people had a wholly erroneous view of the nature of "recognition". The word itself says it all: re-cognition, as if we somehow matched something seen or heard previously with that which we were seeing or hearing now. Impossible.! I had been a painter and had a bit of an idea of how complex was the business of seeing. One does not draw what one sees. One draws what one understands and the business of looking is a process of discovery akin to exploration. It could not possibly be a matter of mapping templates onto anything. Well, I understand a little more now than I did then and I might be a little more cautious with my language, so much the worse for me! but the thrust of it is the same. Vision, hearing, the business of understanding the world about us, is a far more complex business than can be captured on a few templates.

I threw myself into the business. I read everything. It was a really exciting time in computing research. The Americans had thrown a mountain of money into speech recognition, the results were just coming out and there was a mass of ideas being tried out. But reading some of that stuff required occasional convolutions of logic. Speech was a mysterious place and reading some authors you'd think that if you only heard part of a sentence one day and the rest of it the next, you'd not identify any of the words or anything in it until then! Yes, this is not bad grammar on my part. I swear, there were some crazy thoughts being thought, or rather, the logic was not being pursued to its proper and ridiculous conclusion.

The worst of it was that although there were a lot of assertions about how things worked there was next to no evidence and even less examination of the nature of the problem of determining what is happening in the world about us. We were by now a couple of years into a major SERC grant into Human Computer Interface research and I was reading about philosophy of language and perception, psychology of perception, books on language, not just the empiricist tradition but also semiotics, Derida and other crazy neo language mysticism stuff, as well as technical literature on speech and vision. The business of putting a visual front end onto the palentype was just to substitute the problem of vision for the problem of speech. We needed to do something else. We needed a handle on how to approach the problem. We needed to know what the problem was!

I had our spectrogram reader, Sue Johnston sitting accross the desk from me and a spectrogram I had made of something like two and a half minutes of speech from the House of Commons circling the room like a Greek frieze. The fact of our ability to read that stuff was there in front of me in the form of Sue and the nonsense of trying to identify by computer what she identified in those patterns just screamed at me. We were caught in a process of infinite regression. The expert could identify the patterns that carried the information but to try and identify those patterns was as big a problem as the whole of speech. We needed some way of getting Sue to express what she saw in the spectrograms in terms the machine could understand. It was not a matter then of building a program which could identify what Sue saw, but rather of providing Sue with the means to formalise the significant bits of her understanding. That is, we needed to talk to the machine in terms it could understand. Express ourselves in terms that the computer could handle. But what might that look like?

Take a simple, everyday piece of research computing: curve smoothing. Whenever one wants to extract the information bearing peaks in a signal one typically passes the signal through a filter which smooths out the bumps. But just how big a bump constitutes a bump and not the peak one wants to extract, i.e. we don't want to smooth things to the point where the peaks dissappear. And of course, the size of bump that constitutes a peak varies fantastically. There is no absolute rule to use. So the interesting question is: what are the criteria to use to determine whether a bump is noise or an information bearing peak in the signal. The answer of course is that things are complex. It all depends on what you are looking for, what you know of the behaviour of the thing sought etc etc etc. and how do you express all that! Not by Pattern Matching templates that's for sure.

That's when I came accross Popper. At last! Someone who was looking at the problem. Not just him but Hume also, and Kant. WOW! My education was begining. Ah logic. There is nothing like it. We now had the beginings of a handle on the problem, at least a point of view that differed from what people had been doing to date. Oh, I've failed to tell this bit of the story - it was gradually becoming apparent from the literature that without exception all research in speech recognition, no matter how complex, boiled down to endless repetitions on a single theme: create a template, compare it with the signal to obtain a number, compare that number with the number obtained from other templates, chose the highest number, or next best or whatever. Just repetition of the same thing over and over.

Ah and I really had missed out the best part. These attempts were all based on the idea that the world consisted of a finite number of different types of thing to be recognised, or to stand logic on its head, the set of things you want yur machine to identify and the thing which will be given the name: noise. So even the techniques bore no relation to the problem!

Ah, and again I missed another. Not only is the machine to do this truly incredible feat of identifying things in the world but it has to learn to do it from examples provided by the user or whoever. Lunacy.

Oh, and another thing; how it all comes tumbling out once the door's opened. Shannon and Weiner's information theory has been one of the major tools by which the whole of our electronics industry has developed. And what is the main point of the theory - that the amount of information carried by a signal is inverse to its probability, i.e. if you know exactly what's in the letter delivered by the postman there is no point opening it. So how does the speech recognition community make use of this? By looking for the most probable matches of symbol to template of course!, i.e. the most frequently occurrring correlations. Of course least probable messages are given increased probability of successful transmission by increasing the redundancy but if out of the blue someone shouts "LOOK OUT !" you don't have that much redundancy to work with. And it is in these life and death situations that you might think you should have got a recogniser that was less interested in getting the most probable signal!

Not really surprising then that people tried not to engage in technical discussions with me. It was all a bit fragile really. I remember at a conference in Windermere at dinner talking about Popper's ideas with a researcher accross the table from me and her almost having an aneurism at the thought that perception might not involve templates! It was unbelievable, she literally foamed at the mouth. Her colleague had to calm her by saying that it was allright to have different ideas in research! Who was it.. some biggie from the USA, was asked by a group of researchers I had been chatting to whether he didn't agree that our perception operated on the principles of a Boltzman machine. The visitor replied that the interlocutor might be a Boltzman machine, and so might be his colleague but that he himself wasn't, and I wasn't and the other people in the room wern't. Wow, Humour!

Aye, Boltzman machines. You'd think for all the attendant excitement that these things gave us some new decision making function. Well ok, they transform the signal in interesting ways but people still have to use pattern matching on their outputs! I could not believe it.

The trouble, in my opinion with so much of the research done in this field is that the linguists rarely want to get their hands dirty doing anything as technical as writing computer programs and the techies have no time for doing proper scientific work like investigating actual vocal tract properties and how to detect their various configurations from the acoustic signal, never mind how humans then interpret it as speech. Instead there's nothing as popular as tweaking this algorithm or that and seeing whether it gives a better result than the other, like hamsters on a treadmill.

By the time of Windermere I had already worked out the basic structure of the computer procedures based on Popper's ideas on the growth of scientific knowledge but there were miles to go yet, especially on the Ph. D. write up. It is one thing to think that you have something that is quite distinct from what has gone before. It is quite another to prove it. First you have to show how things were done in the past, and there is an awful lot of these things. You need to show what the ideas were or where they came from and people don't always tell you the background of ideas at a level useful to your purposes. Then sometimes the claims of one author as to the conclusions reached by another author cannot be reconciled with the literature. And Wittgenstein.... don't talk to me about Wittgenstein.

I went mad of course. I got locked into trying to show that Levinson's demonstration that all current approaches were logically equivalent did not apply to my method. Whether his argument is or is not wrong, it is as yet beyond my skill to demonstrate it. It is clear to me that there are aspects of the formalism of my recognition procedures which are in some fundamental sense distinct from what others have done but to date I can't show it. We are up against the fact that the computer is a Turing machine and a Turing machine is a Turing machine is a Turing machine. And a Turing machine is a finite beast and new elements do not come into existence outside of the logic of all the possible combinations of what is initially given. The tape may be infinite but you can only write on the tape that which can be written by a Turing machine and as argued by Penrose there are things that a Turing machine cannot write, but we can.

Anyway, as part of the process of getting to grips with the business of writing the thesis, I wrote a rather nice but esoteric hundred plus pages on the subject of recognition, language, knowledge and everything: the Physics of Thought. This enabled me to finish enough to get it bound. My examiner Frank Fallside thought I needed to add an extra chapter. That's Chapter 6 and Appendix C. It was a useful extra year's work. I think the evaluation measures described in chapter 6 might have a future. I learnt masses. He was right, I had not done enough. This is a technical area and the idea needed implementing in something more concrete than the simple demonstration I had provided. Frank said to my boss, Prof Edmonds, that he didn't agree with me but could not fault the logic. That's a real compliment. His premature death shocked me deeply.

The madness that locked me into the ridiculous attempt to square the circle so to speak, together with the desire to demonstrate too much and the feeling that because everything I was trying to say was outside the set of accepted ideas, I needed to dot every i and cross every t, rendered me unable to write anything. I was fine at helping others but for myself I became imobile. The inability to discuss this with anyone did not help. I was out on a limb and that was that.

I did present a paper to the IEE in London sometime in the early 90's but that turned into a hilarity. I was going there to argue that we needed to take account of the real world. I could not have hoped for a better example of the problem I was referring to. A researcher from one of the government research labs had demonstrated a ten or twenty word recogniser. The results came out in big letters on a screen behind him. He then forgot to turn it off and as he spoke so all his words were being identified as the wrong thing. For a while he never noticed but then turned it off fast enough when he saw it! I thought people might at least make a comment or something. This was after all a symposium on making these things usable in the real world. But by the time I came to present this had all been forgotten.

I was not popular. I had given a bad presentation mind, even so. Papers were thrown so to speak. The chairman whom I knew well said to me, "but Julius, we've had noise templates for years" to which I replied that if he thought that we could capture the whole of the world on a template then I had simly to disagree. It was a disaster. At the tea break there was a two metre empty space all around me.

By this time I was working at the University of Ulster and I was not set up to do any speech work even had I wanted to. I dabbled a bit in looking to see if there were any learning algorithms to be had for my method but the philosophy I had developed by then ruled this out as a logical possibility. I dabbled but could not come up with anything. And I just could not bear to go back, once again, to trying to write up the thesis for publication. Instead I developed a painting system and had a few thoughts about the theory of evolution.

And thus things have rested until now. I have a regular subscription to the New Scientist and occasionally they have an article on how things are progressing in speech. Invariably this leads me to dash off an angry letter to the effect that the speech recognition community is still peddling the same old thing as they ever were. But there was an article a week or so back saying how researchers were split over whether any significant progress was going to be made in this area in the short term. I just thought it was about time I put the thesis out into the world and see what happens.

It will take a bit of time to get it all up on the web because it was written in a lovely mac editor called WriteNow which is no longer supported. The business of translating the document into something new technology can read has been interesting. Suffice to say I'm doing it in stages.

  1. the summary and index so as to provide an overview,
  2. then chapter 1 which outlines the "Refutation Based" approach and presents the main arguments,
  3. then chapter 5 which compares the approach with other approaches.
  4. then chapter 2 which describes the method in a highly constrained context,
  5. then appendix c which describes a ten digit recogniser built using the approach
  6. then chapter 6 which describes a comparative performance evaluation with the digit recogniser and the Marconi Macrospeak and presents the rationale of an approach to system evaluation together with the corresponding evaluation formulae.



Please send me your comments

If you include your e-mail I may reply!  

Page last modified: 18:47 Monday 13th. May 2013