Xianzhi Du

I'm a Staff Research Scientist at Apple AI/ML, where I work on building and training large language model (LLM), multi-modality LLM and sparse Mixture-of-Experts (MoE).

I was previously a Senior Research Engineer at Google Brain and Google CoreML working on computer vision research and building TensorFlow official models. I also worked extensively on collaborating with Alphabet's teams, including Waymo, Google Cloud, Google Maps, Google Photos, Nest, X, to apply state-of-the-art research models to Alphabet's applications.

I did my PhD at UMIACS, University of Maryland, College Park, where I was advised by Larry Davis and David Doermann.

(Last update: May 2024)

Email  /  Google Scholar  /  LinkedIn  /  Twitter  

profile photo

Research Highlights

Apple Intelligence Foundation Language Models
arXiv, 2024
blog post / project page / Github

Introduces Appleā€™s on-device and server foundation models. Announced at the 2024 Worldwide Developers Conference.

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
arXiv, 2024

Compares MoE and dense LLMs by using step time to measure model speed and designing total train budget with Chinchilla compute-optimal setting. We show MoE consistently outperforms dense models at 6B, 13B and 30B.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
arXiv, 2024
project page / Github

Builds high-performance MLLMs by identifying crucial architectural and data choices, leading to the creation of MM1 models, which excel in pre-training metrics and few-shot learning across various benchmarks.

Ferret: Refer and ground anything anywhere at any granularity
ICLR, 2024 (Spotlight)
project page / Github

A novel MLLM that excels in fine-grained spatial understanding and grounding descriptions within images, using a hybrid region representation and a specialized dataset, demonstrating superior performance and reduced object hallucination.

Guiding instruction-based image editing via multimodal large language models
ICLR, 2024 (Spotlight)
project page / Github

Leverages Multimodal LLMs to enhance instruction-based image editing, derive expressive instructions and provide explicit guidance.

Adamv-moe: Adaptive multi-task vision mixture-of-experts
ICCV, 2023

An adaptive MoE framework that dynamically adjusts experts per task, enhancing multi-task vision recognition performance on ImageNet and COCO.

A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
ECCV, 2022

Designs a simple single-scale vision transformer architecture, achieves strong performance on object localization and instance segmentation tasks.

Revisiting resnets: Improved training and scaling strategies
Neurips, 2021 (Spotlight)
Github / Google Cloud / Blog post

Revisits training recipe and model scaling strategies for ResNets, improving ResNets to be competitive with state-of-the-art.

Simple training strategies and model scaling for object detection
arXiv, 2021
Github / Google Cloud

Revisits training recipe and model scaling strategies for Object Detection, improving conventional object detectors, e.g. RetinaNet and Cascade R-CNN, to be competitive with state-of-the-art.

Dilated SpineNet for semantic segmentation
arXiv, 2021

Designs a scale-permuted backbone dilated convolutions that is learned by Neural Architecture Search (NAS) on semantic segmentation. Achieved SoTA on Cityscape semantic segmentation benchmark on 03/2021.

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
CVPR, 2020
Github / Yannic Kilcher's Tutorial / Google Cloud / Blog post

Designs a scale-permuted backbone with intermediate features and cross-scale connections that is learned by Neural Architecture Search (NAS) on object detection. Achieved SoTA on COCO detection and segmentation benchmark on 12/2019.

Amnet: Deep atrous multiscale stereo disparity estimation networks
ICCE, 2020

Introduces AMNet with depthwise-separable convolutions, extended cost volume, and stacked atrous multiscale network for disparity estimation. Ranked No.1 on KITTI Stereo 2015 and 2012 benchmarks on 11/2018.

Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection
WACV, 2017

Proposes a deep neural network fusion architecture for accurate, fast and robust pedestrian detection, especially in detecting small-size and occluded pedestrians. Ranked No.1 on Caltech Pedestrian Detection benchmark on 08/2016.

Template credits.