Senior Speech & Audio ML Engineer

  • August 25, 2025
  • Programming & Tech
  • Full time
  • Remote
  • Senior level
  • 4400 to 5500 USD
Audio Signal Processing
Deep Learning
Machine Learning
Speech Recognition
Tensorflow
English advanced
Spanish intermediate

1 talent has already applied, you are still on time!

CF
Share it on:

Job Title: Senior Speech & Audio ML Engineer Location: Remote Type: Full-time Salary Range: $2,500–5,500 USD / month Role Purpose We are looking for a Senior ML Engineer to build and ship core models for a speech-driven behavioral engine. You will own end-to-end modeling from raw, long-form audio and layered annotations to production inference. Responsibilities include: Designing audio features and embeddings. Training and evaluating a suite of models. Delivering reproducible pipelines that meet targets for accuracy, robustness, latency, and cost. Non-Negotiables Experience: 5+ years building production ML systems, including 2+ years in speech/audio. Speech & Signal Processing: VAD, diarization, segmentation, denoising, spectral features (log-mel/MFCC), prosody (pitch/energy), long-form audio handling. SOTA Audio Models & Embeddings: Wav2Vec2, HuBERT, wavLM (or similar); fine-tuning/self-supervised learning; contrastive/metric learning for downstream tasks. Data Engineering & Quality: SQL, Python data stack (Pandas/Polars), ETL for audio + metadata, stratified sampling, leakage prevention, feature stores. Evaluation Discipline: Golden sets, robust speaker/content splits, ROC/PR/calibration, fairness/bias checks, ablations, drift/shift detection on embeddings and audio quality. MLOps, Serving & Reproducibility: FastAPI/gRPC around HF/torchaudio models, experiment tracking (W&B/MLflow), artifact/model versioning, CI/CD, observability, scalable batch/streaming inference. Proven ability to create and document novel IP (methods, architectures, or training/eval techniques) with clear prior-art awareness. Nice to Have Tooling: SpeechBrain, Lightning, OpenSMILE/Praat, Kaldi/Conformer/Emformer, Label Studio. Multimodal Skills: ASR (e.g., Whisper) + paralinguistic features; emotion/prosody modeling; speaker embeddings (x-vectors, ECAPA-TDNN). Performance & Deployment: Quantization/distillation, Triton/CUDA basics, distributed training, real-time/streaming inference, on-device DSP (Rust/C++). Publications/Patents/Competitions: Demonstrating novel audio modeling work.

You might also like to apply for these jobs

Apply now
How it works for talents

Get hired with Mappa

1

Apply for a job

Our AI-powered matching algorithm considers over 100,000 data points to curate a thoroughly vetted shortlist just for you.

step-1
2

Get matched

Our AI-powered matching algorithm considers over 100,000 data points to curate a thoroughly vetted shortlist just for you.

step-2
3

Meet the company

Our AI-powered matching algorithm considers over 100,000 data points to curate a thoroughly vetted shortlist just for you.

step-3
4

Get hired

Our AI-powered matching algorithm considers over 100,000 data points to curate a thoroughly vetted shortlist just for you.

step-4

Ready to start?

Apply
Extra services

Take your international career to
the next level

Dollar payments

Get paid in US dollars while working remotely and earn ~50% more than working locally.

Career growth

Strengthen your international career by working at the most exciting companies across the US, and Europe.

Benefits

Mappa provides you with an extra annual salary. Make a difference and get rewarded for your efforts and achievements.
Get started

Secure your dream job
in just a few steps

Apply now