I) How to capture your footage to create your personal Avatar

1. Environment and Background

Location: Choose a quiet area to ensure clear audio.
Setting: Select a well-lit location, preferably with a clean white background.

2. Camera Setup and Technical Specifications

Device Options: You can use a professional camera, a smartphone, or a webcam.
Positioning: Set the camera at eye level. If needed, use books to raise the camera height.
Framing: Frame the subject from the chest up, ensuring there is some headroom and the subject is centered.
Resolution and Aspect Ratio:
- Professional Camera: 16x9 (horizontal) or 9x16 (vertical). Recommended 4K at 30fps, though 1080p is acceptable.
- Smartphone: 4K or 1080p. Important: Turn off HDR in camera settings to avoid color saturation issues.
- Webcam: 16x9 horizontal at 1080p or 720p.

3. Lighting

Quality: Use soft, even lighting and avoid harsh shadows.
Direction: Face toward a window or a natural light source.

4. Performance and Recording

Duration: Record for a minimum of 2 minutes for best results.
Speech: Speak naturally about any topic. Every word should be clear and intentional.
Pacing: Include a 1 to 2-second pause between every other sentence to allow for natural transitions.
Demeanor: Maintain a calm, confident, and natural presence, ensuring your expressions convey emotion.
Movement: Stay centered and keep your delivery smooth and steady.

II) How to capture your Audio to create your voice clone

Record at least 1 minute of audio
Avoid recording more than 3 minutes, this will yield little improvement and can, in some cases, even be detrimental to the clone.
How the audio was recorded is more important than the total length (total runtime) of the samples. The number of samples you use doesn’t matter; it is the total combined length (total runtime) that is the important part.
Approximately 1-2 minutes of clear audio without any reverb, artifacts, or background noise of any kind is recommended. When we speak of “audio or recording quality,” we do not mean the codec, such as MP3 or WAV; we mean how the audio was captured. However, regarding audio codecs, using MP3 at 128 kbps and above is advised. Higher bitrates don’t have a significant impact on the quality of the clone.
Keep the audio consistent
The AI will attempt to mimic everything it hears in the audio. This includes the speed of the person talking, the inflections, the accent, tonality, breathing pattern and strength, as well as noise and mouth clicks. Even noise and artefacts which can confuse it are factored in.
Ensure that the voice maintains a consistent tone throughout, with a consistent performance. Also, make sure that the audio quality of the voice remains consistent across all the samples. Even if you only use a single sample, ensure that it remains consistent throughout the full sample. Feeding the AI audio that is very dynamic, meaning wide fluctuations in pitch and volume, will yield less predictable results.
Replicate your performance
Another important thing to keep in mind is that the AI will try to replicate the performance of the voice you provide. If you talk in a slow, monotone voice without much emotion, that is what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what the AI will try to replicate.
It is crucial that the voice remains consistent throughout all the samples, not only in tone but also in performance. If there is too much variance, it might confuse the AI, leading to more varied output between generations.
Find a good balance for the volume
Find a good balance for the volume so the audio is neither too quiet nor too loud. The ideal would be between -23 dB and -18 dB RMS with a true peak of -3 dB.