Audio and Visual Perception in Sign Language

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

Abstract: Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual ...

IEEE

EMVP: An Edge-Assisted Multi-Task Visual Perception System for Multi-Vehicle Scenarios

Abstract: Visual perception, as a core component of Intelligent Transportation Systems (ITS), plays a key role in enhancing safety and efficiency in urban mobility. While single-task visual perception ...

GitHub

Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.

Python == 3.12 PyTorch == 2.8.0 ffmpeg GPU Memory: ~24GB for inference, 4×80GB for training For more details, please refer to web_demo/server/README.md and web_demo ...

marktechpost

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio ...

Wired

Show inaccessible results

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

EMVP: An Edge-Assisted Multi-Task Visual Perception System for Multi-Vehicle Scenarios

Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

What Is Lossless Audio, and Do You Really Need It?

Massachusetts looks at re-banning cannabis

Our New SAM Audio Model Transforms Audio Editing