AWS #Transcribe Speech to Text using C#

AWS has a transcribe service which converts audio containing speech to text. It is very tightly integrated with the AWS ecosystem, so it’s probably best used for systems that are already using AWS for other services – specifically, S3 for storage of audio, and perhaps Cloudwatch and Lambda for post processing.
So, the Transcribe service takes an audio file that is already in an S3 bucket with Amazon, and produces text output, which is placed in another bucket. The process is asynchronous, so it’s best to have another event (i.e. Cloudwatch + Lambda) dealing with the output.
First off, you need the Nuget package “Install-Package AWSSDK.TranscribeService” installed for your project. You should also have your local dev environment setup to access AWS via the CLI (aws configure). You don’t have to do that last step, but the code below assumes you have done this.
var client = new AmazonTranscribeServiceClient( RegionEndpoint.EUWest1);
var job = client.StartTranscriptionJobAsync(new StartTranscriptionJobRequest
{
LanguageCode = LanguageCode.EnUS,
Media = new Media
{
MediaFileUri = "s3://audioBucket/message.mp3"
},
MediaFormat = MediaFormat.Mp3,
OutputBucketName = "aws.serverless.2",
TranscriptionJobName = "message"
}).Result;
Here, we specify the input S3 Uri, which is an Mp3 file, in US English. I also specify the output bucket, and the name of the file.
This will run, and return immediately. At some time in the future, a file will appear in the output bucket with contents such as;
{
"jobName":"hotline2",
"accountId":"005445879168",
"results":{
"transcripts":[
{
"transcript":"thank you for Collins in his bank, the first American bank designed for international customers. Please leave your message and we will return your call shortly."
}
],
"items":[
{
"start_time":"0.44",
"end_time":"0.79",
"alternatives":[
{
"confidence":"1.0",
"content":"thank"
}
],
"type":"pronunciation"
},
{
"start_time":"0.79",
"end_time":"0.88",
"alternatives":[
{
"confidence":"1.0",
"content":"you"
}
],
"type":"pronunciation"
},
{
"start_time":"0.88",
"end_time":"1.01",
"alternatives":[
{
"confidence":"1.0",
"content":"for"
}
],
"type":"pronunciation"
},
{
"start_time":"1.01",
"end_time":"1.52",
"alternatives":[
{
"confidence":"0.9214",
"content":"Collins"
}
],
"type":"pronunciation"
},
{
"start_time":"1.52",
"end_time":"1.65",
"alternatives":[
{
"confidence":"0.9884",
"content":"in"
}
],
"type":"pronunciation"
},
{
"start_time":"1.65",
"end_time":"1.82",
"alternatives":[
{
"confidence":"0.9662",
"content":"his"
}
],
"type":"pronunciation"
},
{
"start_time":"1.82",
"end_time":"2.48",
"alternatives":[
{
"confidence":"1.0",
"content":"bank"
}
],
"type":"pronunciation"
},
{
"alternatives":[
{
"confidence":"0.0",
"content":","
}
],
"type":"punctuation"
},
{
"start_time":"2.51",
"end_time":"2.77",
"alternatives":[
{
"confidence":"1.0",
"content":"the"
}
],
"type":"pronunciation"
},
{
"start_time":"2.78",
"end_time":"3.12",
"alternatives":[
{
"confidence":"1.0",
"content":"first"
}
],
"type":"pronunciation"
},
{
"start_time":"3.12",
"end_time":"3.63",
"alternatives":[
{
"confidence":"1.0",
"content":"American"
}
],
"type":"pronunciation"
},
{
"start_time":"3.63",
"end_time":"3.99",
"alternatives":[
{
"confidence":"0.996",
"content":"bank"
}
],
"type":"pronunciation"
},
{
"start_time":"4.0",
"end_time":"4.58",
"alternatives":[
{
"confidence":"1.0",
"content":"designed"
}
],
"type":"pronunciation"
},
{
"start_time":"4.58",
"end_time":"4.74",
"alternatives":[
{
"confidence":"1.0",
"content":"for"
}
],
"type":"pronunciation"
},
{
"start_time":"4.74",
"end_time":"5.39",
"alternatives":[
{
"confidence":"0.9987",
"content":"international"
}
],
"type":"pronunciation"
},
{
"start_time":"5.39",
"end_time":"6.16",
"alternatives":[
{
"confidence":"0.9995",
"content":"customers"
}
],
"type":"pronunciation"
},
{
"alternatives":[
{
"confidence":"0.0",
"content":"."
}
],
"type":"punctuation"
},
{
"start_time":"6.54",
"end_time":"6.94",
"alternatives":[
{
"confidence":"1.0",
"content":"Please"
}
],
"type":"pronunciation"
},
{
"start_time":"6.94",
"end_time":"7.12",
"alternatives":[
{
"confidence":"1.0",
"content":"leave"
}
],
"type":"pronunciation"
},
{
"start_time":"7.12",
"end_time":"7.26",
"alternatives":[
{
"confidence":"1.0",
"content":"your"
}
],
"type":"pronunciation"
},
{
"start_time":"7.26",
"end_time":"7.86",
"alternatives":[
{
"confidence":"1.0",
"content":"message"
}
],
"type":"pronunciation"
},
{
"start_time":"7.87",
"end_time":"8.06",
"alternatives":[
{
"confidence":"1.0",
"content":"and"
}
],
"type":"pronunciation"
},
{
"start_time":"8.06",
"end_time":"8.17",
"alternatives":[
{
"confidence":"1.0",
"content":"we"
}
],
"type":"pronunciation"
},
{
"start_time":"8.17",
"end_time":"8.36",
"alternatives":[
{
"confidence":"1.0",
"content":"will"
}
],
"type":"pronunciation"
},
{
"start_time":"8.36",
"end_time":"8.78",
"alternatives":[
{
"confidence":"0.5229",
"content":"return"
}
],
"type":"pronunciation"
},
{
"start_time":"8.78",
"end_time":"8.95",
"alternatives":[
{
"confidence":"1.0",
"content":"your"
}
],
"type":"pronunciation"
},
{
"start_time":"8.95",
"end_time":"9.33",
"alternatives":[
{
"confidence":"1.0",
"content":"call"
}
],
"type":"pronunciation"
},
{
"start_time":"9.34",
"end_time":"10.05",
"alternatives":[
{
"confidence":"1.0",
"content":"shortly"
}
],
"type":"pronunciation"
},
{
"alternatives":[
{
"confidence":"0.0",
"content":"."
}
],
"type":"punctuation"
}
]
},
"status":"COMPLETED"
}
As you can see from the result, it can make some errors, for example, here it used the world “Collins” instead of “Calling”, so the process is not perfect. However, the word-by-word breakdown is really useful, for any other post-processing you may want to do.
This would be excellent for generating subtitles from movie audio, for example.