[MV] Voice recognition for transcription: Opinions sought

Jay A. Kreibich jay at kreibi.ch
Fri Feb 23 22:43:45 PST 2007


On Thu, Feb 22, 2007 at 08:33:36AM -0600, Chuck Rogers scratched on the wall:
> Rena (and everyone else):
> 
> Simply put, no speech recognition can do what you ask, and it will  
> probably be many years before computers have the processing power  
> necessary to do so. 

  That's true, but one might make a strong argument that this is true
  of all of today's dictation style system.  The need for per-user
  training and near perfect microphone placement all points to this
  being a technology that isn't really ready for the masses.  It's
  interesting research, but for someone like me who is able to use a
  keyboard quite well, it is nothing more than a toy... and one I gave
  up on fairly quickly, at that.

  Even if the benchmark is a human trained in dictation-- as opposed to
  someone such as a Court Recorder trained in stenography (and
  essentially does do on-the-fly transcriptions)-- modern software
  still isn't up to the task.

  In the case of my post, I'm well aware that what I proposed and
  envision as the ideal solution is far beyond current systems.
  Long term goals usually are.

> The problem will be that the speaker is not  
> speaking his or her punctuation,

  Yes, they are, they just aren't using words.  Oral languages came
  first.  All the "extra" bits in the written language are there to
  fill the expression gap that the oral language carries in pauses,
  speed changes, pitch bends, and a number of other nuances.  Those
  "extra marks" wouldn't be in the written language if the concepts
  they're attempting to express weren't in the oral language first.

  Saying punctuation isn't spoken is like saying every mark on a
  musical score that doesn't happen to be a note isn't "played."

> and not speaking in an environment  
> with a consistent noise level, nor will they be using a noise- 
> canceling microphone that is in a consistent position in relation to  
> their mouth.

  Again, true, but the human ear and auditory processing systems are
  extremely good at dealing with this.  A typical non-technical
  customer doesn't care about the fact that we don't really understand
  why the human auditory system is so amazingly good at isolating and
  tracking a single voice in a noisy environment.  All they know is
  that it is really easy for them to do, so that is the expectation.

  Once more, I would say this is an example of why the technology is
  not ready for main-stream mass use.  The fact that computers can't
  overcome these issues means the technology is lacking, not that
  consumers should re-adjust their expectations.  Nature has shown us
  all in a very personal way that this is possible.

  I know it's extremely hard.  Good things usually are.

> All of these factors will introduce enough inaccuracy in  
> the transcribed text to make it not worth the effort.

  Exactly.  So unless you're willing to learn the new and non-trivial
  skill of dictation and are able/willing to setup an environment in
  which that works, current voice recognition systems are a bust.


  Don't get me wrong-- I love the fact this technology is on the market
  and available those that need it.  While I think this type of
  technology has a long way to go, I also think it is "good enough" to
  justify being on the market, having people pay money for it, and for
  research and development to continue.  If you have no other choice,
  even the existing systems are a god-send.  But I'll stand by the
  idea that, for the general consumer market, voice recognition in
  it's current state is a "no other choice" type thing.  Building a
  market based off "no other choice" is an extremely poor position to
  work from.



> We have many people using our transcription solution and what they do  
> is re-speak the audio in their own voice, inserting punctuation as  
> they go. This produces much more reliable transcription and still  
> saves about 30% of what it would take to type in the text manually.

  You must type very slowly.
  
  In my experience this method actually took longer, in all but the
  most informal or short writings.  It is extremely difficult to go
  back though several pages of text that contain no punctuation
  what-so-ever and figure out something even as simple as where all
  the periods and commas go.  You more or less re-create the whole
  authoring process to understand the flows and blocking of the
  thoughts as they were put into words.  And once all that is done,
  all you have is a rough draft that still needs all the required
  editing and revising any other draft would require.

  This is actually why I first got into speech recognition systems.
  It wasn't worth it.  I'm glad others have had better luck, as I'm
  happy to see the technology continue to evolve and improve-- and
  there is a lot of room for improvement.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"'People who live in bamboo houses should not throw pandas.' Jesus said that."
   - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006"


More information about the MacVoice mailing list