Jump to content
The Dark Mod Forums

Check this reddit post - "AI Speech Synthesis for FMs"


Hooded Lantern

Recommended Posts

While using this to add professional voice-actors to missions is likely illegal, it may be worthwhile to see whether our own voice-actors might be OK with FM authors using this if they are not available to offer their services.

For example, ERH+ has a WIP mission "Seed of the Loadstar" where he is using voice-to-text for the cinematic sequence. Rather than waiting for voice-actors to have time to assist, it might be easier to use this AI tool. Even if just for beta-testing the concept to see how it sounds in mission.

@New Horizon @redleaf @Norbert @AndrosTheOxen @Deadlove @Lux @Mollyness

@Goldwell @Goldchocobo @Mortem Desino @ocn @BrokenArts @Narrator @Noelker

@V-Man339  would anyone in the voice-acting group allow mission authors or the TDM team to use AI to generate new vocal lines for missions or new AI characters (etc)?

Does anyone strongly object to any AI usage even for beta-testing?

If we get some responses we can add the voice-actor's stances to the wiki:

https://wiki.thedarkmod.com/index.php?title=Voice_actors

Please visit TDM's IndieDB site and help promote the mod:

 

http://www.indiedb.com/mods/the-dark-mod

 

(Yeah, shameless promotion... but traffic is traffic folks...)

Link to comment
Share on other sites

To my knowledge, the legality and ethics of AI generated derivative works has yet to be determined. Until then we should be cautious about assuming an IP maximalist perspective will prevail. In almost all creative fields, copyright terms already extend far beyond what is natural or healthy for maximizing creative output, and it is frequently overextended to monopolize ideas when it is meant to only cover expressions. IP holders don't need the help.

It's important to remember, despite what some irresponsible commentators have asserted, that generative AI does not store or reproduce copies of the original (training) works OR even their constituent components. Rather it works by modulating a random input seed into a completely novel product that imitates its inspiration by sharing as many salient features of the relevant training works as the AI can recognize and match. This is no different from a human voice actor doing an imitation of Stephen Russell, and if the law you are proposing were to be applied consistently, both would be equally illegal.

Of course, courts and legislatures may decide that applying the rules consistently between humans and computers is not what's best for society, but until then let's not jump the gun. 

Stephen Russell's vocal performances in the Thief games don't belong to Stephen Russell. He sold them to Looking Glass Studios, who then gifted them to the public by publishing the games onto the open marketplace, retaining only the copy-right over the artistic expressions distributed in the game--for a limited time, as codified in the law. As the law currently stands, we as the public retain the absolute right to produce derivative imitations with the qualities that shaped those artistic expressions, even if we use generative AI models to do so. So long as we don't 1:1 copy the actual expression or its separable expressive components, which generative AI does not, it is all fair use. (And IMO it should remain so, even if that hurts the revenue stream of a few performers.) 

Link to comment
Share on other sites

Yes, if we did this we'd just have our own voice actors contribute their voice. The old way was concatenation. You have the voice actors say literally every possible phoneme and transition in English, and if possible in multiple ways apiece, and then the program knits them together. I think I read that can take more than 6 hours of recording. But I believe newer systems can take a good stretch of recorded speech from a person and generate the phonemes itself. That would be a great project for us if someone wants to take it on.

There may also be some open source voice models out there at this point, but you'd have to make very sure they're consistent with our CC license.

  • Like 1

What do you see when you turn out the light? I can't tell you but I know that it's mine.

Link to comment
Share on other sites

What if this AI software is used with existing forum member voice actors and then  modified or adjusted to the extend that their voice is similar or even identical to Stephen Russels?

How can that be illegal?

edit: Ok I think I just repeated with fewer words what others already said.

Edited by kin
Link to comment
Share on other sites

I was looking at ElevenLabs earlier. I think the quality is very good in some cases, but falls flat in others.  Also, it's not open source so it could be yanked or paywalled at any moment.

4 hours ago, ChronA said:

As the law currently stands, we as the public retain the absolute right to produce derivative imitations with the qualities that shaped those artistic expressions, even if we use generative AI models to do so. So long as we don't 1:1 copy the actual expression or its separable expressive components, which generative AI does not, it is all fair use. (And IMO it should remain so, even if that hurts the revenue stream of a few performers.) 

There are additional rights such as personality rights that may provide an avenue for living persons or estates to legally attack commercial or fan projects. They aren't universally recognized but could muddy the waters. Obviously, we will see attempts to expand aspects of intellectual property rights as a response to AI in the near future.

1 hour ago, kin said:

What if this AI software is used with existing forum member voice actors and then  modified or adjusted to the extend that their voice is similar or even identical to Stephen Russels?

How can that be illegal?

Remember that one guy with a pretty good imitation of Stephen Russell's Garrett who ended up voicing a few Thief fan missions? What happens if you use his voice samples to train the AI?

  • Like 1
Link to comment
Share on other sites

As someone who likes nothing better than writing dialogs, briefings and texts for audio logs for missions, from my point of view I can only agree with @Narrator

I love giving my texts to real people and giving them directions here and there (quite often I dont even have to do that, because the guys already recognize from the text what I want from them).

So in the future, even if it would be legally okay, I will prefer to continue working with real people....Cyberline Systems can keep its malicious technology, because I know where this will lead, at the latest when Arnold is at my door!

 

  • Like 1
Link to comment
Share on other sites

The value is that authors can make their own dialog instantly, listen to it, change it instantly, and go through 20 iterations in a half hour, and do it all night long.  In particular, you can keep doing takes of the same line until it gets the prosody how you like it.

You also only need about 1 minute of a sound clip to make a perfect rendition of a person's voice.

And it doesn't even have to really be a good voice actor. You can use your own voice or family members, etc. The system makes it sound good as far as voice acting. Using a real person is great, but it can't really compete.

  • Like 4

What do you see when you turn out the light? I can't tell you but I know that it's mine.

Link to comment
Share on other sites

44 minutes ago, kin said:

I imagine the day that you can feed the existing fan mission database to AI and get back random missions with the choice to adjust its features and at the end do some hand polishing.

 

I'd say that's difficult, very difficult, but not an impossible scenario.

We have seen some experimentation with procedural generation. Creating random objectives and story within a predefined city/map might also be possible.

Link to comment
Share on other sites

On 2/4/2023 at 8:16 AM, jaxa said:

I'd say that's difficult, very difficult, but not an impossible scenario.

We have seen some experimentation with procedural generation. Creating random objectives and story within a predefined city/map might also be possible.

It depends how it would be used.

By having AI randomly build (in an architectual manner at least) a mission, even coarsely, authors could save alot of time and focus on refining it.

I am thinking it could very well be used as an inspiration. Kind like adopting an abandoned project.

Also it could draw more authors in the editing field since it would be alot more interesting and less time consuming to modify rather build a mission from scratch.

Hell, I would try that for sure if it was real.

Edited by kin
Link to comment
Share on other sites

On 1/30/2023 at 11:35 AM, Hooded Lantern said:

"AI Speech Synthesis for FMs" by u/that1sluttycelebrity

The voice itself is pretty good. But the sentence pacing just doesn't exist. As a result, the voice has no emotion - it sounds "dead". That said, an emotionless dead-sounding voice would obviously be perfect for characters like Dagoth Ur...

  • Like 1
Link to comment
Share on other sites

2 hours ago, Oktokolo said:

The voice itself is pretty good. But the sentence pacing just doesn't exist. As a result, the voice has no emotion - it sounds "dead". That said, an emotionless dead-sounding voice would obviously be perfect for characters like Dagoth Ur...

IMO, the ElevenLabs AI is attempting sentence pacing and emotion, just getting it wrong often and it sounds uncanny when wrong.

To fix it probably requires adding other tools like markup to give more manual control over the model. Similar to how Stable Diffusion has tools like inpainting and negative prompts that can give a skilled user incredible flexibility.

Another idea would be to allow users to record the lines so they can get the pacing and emphasis just right, and use that as input. So you read the script exactly the way you want, and then the transformation happens and you have Barack Obama talking instead. This would be the equivalent of Stable Diffusion's image-to-image generation.

That method would actually give existing voice actors a competitive advantage when using the software, because they know how to speak with precision.

Just like Stable Diffusion, the user who puts in 5 hours of work is going to get a better result than someone typing prompts for 5 minutes. If you are sufficiently motivated to make a very convincing deepfake, you'll put in the hours.

  • Like 1
Link to comment
Share on other sites

16 hours ago, jaxa said:

To fix it probably requires adding other tools like markup to give more manual control over the model.

Humans use semantic information to modulate pacing and pitch over sentences and even entire paragraphs.
Having a language model like ChatGPT detect the semantic "features" of the text and feeding them as additional input into the speech synthesis model might reduce the amount of markup significantly or even eliminate the need for the common case where the speaker's emotional state is rather neutral and the meaning of the message matches the actual text.

16 hours ago, jaxa said:

Another idea would be to allow users to record the lines so they can get the pacing and emphasis just right, and use that as input.

That might be a pretty intuitive way to provide additional emotional context that can't be derived from the text alone - like the state of the speaker (exhausted, happy...) or subtext (sarcastic, ironic, bragging, threatening).
But just slapping an emoticon in front of some parts of the text might also work good enough when combined with a language model trained to detect them.

I'm excited to see, which path speech synthesis will go. Pretty sure, results will become indistinguishable from professional voice-acting in the next few years.

  • Like 1
Link to comment
Share on other sites

1 hour ago, Oktokolo said:

Humans use semantic information to modulate pacing and pitch over sentences and even entire paragraphs.
Having a language model like ChatGPT detect the semantic "features" of the text and feeding them as additional input into the speech synthesis model might reduce the amount of markup significantly or even eliminate the need for the common case where the speaker's emotional state is rather neutral and the meaning of the message matches the actual text.

Could be. There will always be some edge cases where you want finer control.

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language
https://en.wikipedia.org/wiki/Java_Speech_Markup_Language
https://en.wikipedia.org/wiki/SABLE

Some of these seem to have been abandoned due to lack of interest. I think interest in the topic just exploded.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recent Status Updates

    • Petike the Taffer

      I've finally managed to log in to The Dark Mod Wiki. I'm back in the saddle and before the holidays start in full, I'll be adding a few new FM articles and doing other updates. Written in Stone is already done.
      · 0 replies
    • nbohr1more

      TDM 15th Anniversary Contest is now active! Please declare your participation: https://forums.thedarkmod.com/index.php?/topic/22413-the-dark-mod-15th-anniversary-contest-entry-thread/
       
      · 0 replies
    • JackFarmer

      @TheUnbeholden
      You cannot receive PMs. Could you please be so kind and check your mailbox if it is full (or maybe you switched off the function)?
      · 1 reply
    • OrbWeaver

      I like the new frob highlight but it would nice if it was less "flickery" while moving over objects (especially barred metal doors).
      · 4 replies
    • nbohr1more

      Please vote in the 15th Anniversary Contest Theme Poll
       
      · 0 replies
×
×
  • Create New...