Audio silliness in the era of videoconferencing

Newsletter Jul 28, 2020

This week is less about data per se and more about some rabbit holes I feel into, repeatedly, the past few months. But we do get into some UX about meeting software! That’s pretty on topic, right? Incidentally, I link to various product pages in this article. None of them are affiliate links or even stores, but instead the original manufacturer page.

The Plague, which I’m continuing to avoid writing about about, nevertheless continues to ravage the United States, fueled massive ridiculousness at all levels. Many of us who are lucky enough to be working are doing so from home in a land of endless video conferencing.

As evidenced by a literal shortage of video conferencing peripherals like webcams (reseller price gouging has also been out of this world), a lot of people have been looking into upgrading their videoconferencing setups.

Wires and equipment EVERYWHERE

Prior to the pandemic, thanks to my constantly shifting hobby-scape, I had accumulated enough stuff to hold a decent-quality video conference. I had gotten myself a nice Blue Yeti USB mic and HD webcam to do some streaming for some game related work… then upgraded to a nice dynamic mic and mixer due to failed attempt at making a home-based karaoke system.

But when the opportunity to make further upgrades specifically for work came up (aka, permission to expense some home office upgrades), I was able to get some gear to hook up my existing gear to the corp laptop, which resulted in a very complicated mixed setup that routes between both devices, pictured below:

Don’t ask about the total cost, it’s a lot… much of this was bought for projects over the years

Since decent webcams from reputable brands are impossible to obtain right now, I gave up on improving that aspect, but I did add in lights, and then a significant amount of audio gear because the mic (and camera) of my work macbook is pretty garbage.

But was buying all this stuff even worth it?

On a long term basis, since I’m likely to use the gear for other things down the road, it’s probably worth it for me. But we’ll put that detail aside for this discussion. Today I want to consider whether, objectively speaking and for the purpose of video conferencing, was I just wasting my time and money adding to my existing setup?

The lighting was an easy win. I have a giant window behind me, so getting as much light as I could afford on my face was always going to help with keeping my face out of shadow. Plus, regardless of how cheap or expensive your camera is, if there’s more light, the camera sensor doesn’t have to boost the signal gain as much, which results in less noise in the picture.

But what about all the audio gear?

I’ve heard on multiple occasions, and it’s even on YouTube’s various tutorials/documentation material, that audio is important to video. That audiences are more forgiving of  lower quality video than they are of low quality audio. And while it passes a basic sniff test, I’ve always seen it just passed around as received wisdom, with no citations offered.

Good sound can significantly impact how viewers experience your video. Audiences are typically more forgiving of camera and lighting mistakes than they are of poor sound quality and recording.  — YouTube, recording sound like a pro

So obviously, we’re gonna go look into some research! Because science and overthinking everything.

First off, we should be clear about what research question we’re interested in here.

We would like to know, within a working-from-home videoconferencing context, would increasing the quality of the audio by upgrading the microphone make the audience have a better meeting experience.

The reason why we need to be clear about this is because a lot of the research in this space comes from people who in the business of building communications systems like video chat software, or video sites like YouTube. As an engineer, they need to know how to spend their available bandwidth budget, would devoting more of the budget towards audio, or video make users more satisfied or be a waste? What kind of compression codec should be used to still get a good quality rating? When do stutters, drops, echo, become unacceptable?

If we’re in the business of building such products, than this central UX question takes center stage. We’d definitely want to know that “x% of packet loss yields very dissatisfied users” because we need to set performance metrics around these failures. Since I work in cloud services, I don’t have to worry about any of this.

Instead, because we’re all USERS of a product where the engineers have (hopefully) digested the relevant research, a lot of the findings aren’t directly relevant to us. For example, we can’t control what codecs is being used, the software picks it for us. We have to have faith that the engineers did their best to give us a good user experience, and build on top of that.

Even if I bought a $10k audio/video setup, if the signals are compressed horribly and sent over a legacy 2.5G phone line, I’ll still look and sound horrible. Similarly, if I use absolutely garbage camera/mic hardware, no software can ever make that sound good.

Our locus of control extends only to the signals we feed into the software.

What Research is Saying

For the rest of this article, I’m going to primarily rely on the work in this 2015 Thesis by Sara Långvik. It’s recent enough that the lit review covers the major prior papers in this space. Here’s my “quick” summary of what I believe I understand of the field. Obviously, as a non-expert, I’m likely to get some details wrong and leave out important results.

  • “Quality” is complicated, the field seems to agree to divide the construct into “Audio quality” “Video quality” and “Audiovisual quality” components, where audiovisual refers to an overall rating of quality for a clip of video with included audio.
  • Quality scores are typically measured on a scale of some sort, like a 9-point scale, this thesis used a scale from 10 (low rating)-50 (high rating). I guess it’s meant to be a 1-5 rating with decimals.
  • Changes in quality (typically by added noise, or similar degradations) in the audio channel will also have an effect on the quality rating of the other channel. Having bad audio will cause people to rate the video as being lower quality too despite the video actually being unchanged. The effect isn’t linear, but is observable.
  • Similarly, video quality also has an effect on audio quality perception. The effect is apparently stronger in this direction compared to the audio->video route.
  • Context and setting are important - people are more picky about audio/visual quality for a music video than for a home recording. People playing a video on a TV or desktop computer experience the same video differently due to hardware and the context the device is used.
  • User task is important - if it’s important to follow the audio closely (spoken instructions, a meeting interaction, etc) people will be more unhappy at poor audio than for tasks where they need to refer to the video.
  • Quality and Comprehension are very different things - One fun result in the 2015 thesis was that when users answered questions about the videos they saw, the subjects who had audio with coffee shop noise added in rated the audio lower (which is expected), but they answered more questions correctly. Apparently the added noise made subjects pay more attention.

Uh, that’s nice, but tl;dr. What’s it mean for us?

Since we’re video-conferencing, audio is probably qutie important

The majority of my time on VC is usually spent looking at tiny 1-inch thumbnails of my coworkers, while most of the window is taken up by a presentation, or someone sharing their screen to show something. Only occasionally do I see someone’s face in detail.

You should be fine so long as the basics of good video are observed — avoid backlighting yourself, have lots of even lighting on your face, try to have the camera above your eye level for a more flattering angle, use decent camera if you have one.

I’ve got an old Logitech c920 that works just fine and is honestly too wide angle and shows my kid’s toy explosion behind me. I know some people have taken their DSLRs or cameras with HDMI output, and use a HDMI to UVC capture device to have their fancy camera detect as a webcam. It’s great video quality if you can set it up, but likely way overkill.

The diminishing returns for audio equipment kick in very quickly.

A Blue Yeti (~$130) or Audiotechnica’s AT2020+ (~$150) will get you about 95% of the way there to having “better than needed for a video conference”, and those will connect straight to your computer with no fuss or extra gear. You could also easily host your own podcast or whatever with either mic. No need for a fancypants audio interface ($150-$350+) with phantom power to connect to pro-audio mics ($100-$∞) with XLR connections.

While more expensive mics will give you marginally improved sound quality and various features that are not super relevant to having a meeting, they’re not worth your trouble unless you plan on finding alternate uses for the mic, like recording music or voice-over work.

Instead, what’s more important is noise control, because the above research implies that people will be more unhappy if they have to work to hear you.

Lucky for us in 2020, there are software solutions that can help! The most recent ones are powered by AI-based systems to isolate human speech from noisy backgrounds with surprisingly accuracy.

For example Google’s Meet software recently added a noise cancellation feature that works quite well at muting out random background noise. Nvidia’s RTX graphics cards also have a similar AI-based noise cancellation feature called RTX Voice that works for a bunch of things. Some demos of these software filters are quite impressive. I have to use this functionality to make sure my super-loud air conditioner doesn’t ruin my meetings.

Hrmmm…. maybe I can justify the cost of a graphics card upgrade because I need to do some Deep Learning, and now meetings for my side gigs…

Improve your sound environment

Software, no matter how good, will never be perfect at isolating voice from everything else. There will always be stray sounds that don’t get caught, and weird artifacts for things where it overreacts. It’d be much better if the software didn’t have to do that to begin with.

The cheapest way (aka free if you have an external mic) is to get closer to your mic. The closer you are, the louder you’ll appear to the mic, and the conferencing software will adjust the amplification downward to suit (within certain bounds). This means more distant noises will be less obvious. It’s an instant free signal-to-noise ratio boost so long as you remember to place yourself in the right spot. As a bonus, most mics (except omnidirectional ones) have a proximity effect which makes your voice richer and sound deeper the closer you get to it.

The next best thing is to work in a good acoustical space. We can’t all have a sound treated room, but we can avoid big empty rooms that have lots of very distracting reveb. In fact, some professional broadcasters have actually broadcasted from their closets because the hanging clothing acts as a very good room treatment.

Have more fun with your setup

Finally, if you’re sick of going to video conferences and worrying about how you look, the power of AI and technology has some toys for you.

You can take a stab at being a Vtuber, a Virtual Youtuber, by using some software like Facerig, Luppet, Live3d… feed that data into OBS Studio to make a virtual camera, then throw in various motion control goodies like Leapmotion controller that lets you track your hands… just pile on some art skills and never be a human ever again in meetings.

Or I guess you can use a “simple” camera filter like Snapchat Camera.

🙃

Tags