Back to Projects
Back to Projects
AWARD-WINNING ML SYSTEM

HearAll

An accessibility software platform employing custom deep-learning computer vision models to translate hand-sign gestures into spoken voice and clear text in real time.

github.com/Thet9354/HearAll2.0
HearAll ML Architecture Demonstration

The Core Problem & Mission

For over 70 million deaf and hard-of-hearing individuals worldwide, daily communication is a persistent barrier. While sign language is a rich, natural medium, only a tiny fraction of the hearing population understands it. This creates a severe structural divide, leaving deaf individuals excluded from standard classrooms, workplaces, and public interfaces.

Existing transcription tools only go one way (converting spoken speech to text). But they do not solve the reciprocal problem: enabling a deaf individual to express their thoughts dynamically and be understood instantly by standard listeners. HearAll was designed to bridge this divide using live edge computer vision.

Field-Tested Validity: When we tested HearAll with Singaporean accessibility advocates and members of the hard-of-hearing community, they awarded the platform an **87.5% usefulness rating**, praising its speed and minimal lag compared to traditional human transcription translators.

Design Thinking: High-Contrast, Zero-Latency

Designing for accessibility meant re-evaluating core smartphone UX:

  • Real-time Overlay UI: We kept the camera viewport unobstructed. Sign translation captions are displayed in massive, high-contrast typography overlays directly underneath the speaker's hands, making it easy to read in direct sunlight.
  • Offline Edge Inference: Transporting video feeds to a cloud server introduced over 1.2 seconds of network lag—completely destroying conversational flow. We compiled our TensorFlow weights down into **CoreML** and **TensorFlow Lite**, allowing models to run 100% on-device at **60fps** with zero battery drain.
  • Haptic Gestural Prompts: Deaf users cannot hear sound indicators for calibration. We mapped detailed micro-haptic patterns to guide users' hands into the camera frame.

AI Architecture Model

Below is a layout of our pipeline, extracting hand landmarks and routing them into an LSTM sequence classifier for translation:

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Camera Frame │ ───> │ MediaPipe Hand │ ───> │ LandMark Vector │ │ (30 FPS Stream) │ │ Landmark Engine │ │ (21 Points) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Spoken TTS Audio│ <─── │ Text Output │ <─── │ LSTM Neural │ │ (AVSpeechSynth)│ │ Caption Overlay│ │ Classifier │ └─────────────────┘ └─────────────────┘ └─────────────────┘

Technical Achievements

  1. Spatial Landmark Normalization: Raw pixel points vary based on camera distance. We normalized the 21 3D Cartesian coordinates relative to the palm center, ensuring consistent translation regardless of hand size or position.
  2. LSTM Temporal Sequencing: Signs are dynamic movements over time. We trained a Long Short-Term Memory (LSTM) recurrent network in PyTorch, mapping sequences of 30 frame landmarks to individual actions with **94.2% accuracy**.
  3. Cross-Platform Edge Integration: Wrapped the CoreML model within a Flutter shell, ensuring the exact same low-latency interface serves both Android and iOS devices seamlessly.