Xianzhi Du
I'm a Staff Research Scientist at Apple AI/ML, where I work on building and training large language model (LLM), multi-modality LLM and sparse Mixture-of-Experts (MoE).
I was previously a Senior Research Engineer at Google Brain and Google CoreML working on computer vision research and building TensorFlow official models. I also worked extensively on collaborating with Alphabet's teams, including Waymo, Google Cloud, Google Maps, Google Photos, Nest, X, to apply state-of-the-art research models to Alphabet's applications.
I did my PhD at UMIACS, University of Maryland, College Park, where I was advised by Larry Davis and David Doermann.
(Last update: May 2024)
Email /
Google Scholar /
LinkedIn /
Twitter
|
|
|
Apple Intelligence Foundation Language Models
arXiv, 2024
blog post
/
project page
/
Github
Introduces Appleās on-device and server foundation models. Announced at the 2024 Worldwide Developers Conference.
|
|
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
arXiv, 2024
Github
Compares MoE and dense LLMs by using step time to measure model speed and designing total train budget with Chinchilla compute-optimal setting. We show MoE consistently outperforms dense models at 6B, 13B and 30B.
|
|
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
arXiv, 2024
project page
/
Github
Builds high-performance MLLMs by identifying crucial architectural and data choices, leading to the creation of MM1 models, which excel in pre-training metrics and few-shot learning across various benchmarks.
|
|
Ferret: Refer and ground anything anywhere at any granularity
ICLR, 2024 (Spotlight)
project page
/
Github
A novel MLLM that excels in fine-grained spatial understanding and grounding descriptions within images, using a hybrid region representation and a specialized dataset, demonstrating superior performance and reduced object hallucination.
|
|
Guiding instruction-based image editing via multimodal large language models
ICLR, 2024 (Spotlight)
project page
/
Github
Leverages Multimodal LLMs to enhance instruction-based image editing, derive expressive instructions and provide explicit guidance.
|
|
Adamv-moe: Adaptive multi-task vision mixture-of-experts
ICCV, 2023
Github
An adaptive MoE framework that dynamically adjusts experts per task, enhancing multi-task vision recognition performance on ImageNet and COCO.
|
|
A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
ECCV, 2022
Github
Designs a simple single-scale vision transformer architecture, achieves strong performance on object localization and instance segmentation tasks.
|
|
Revisiting resnets: Improved training and scaling strategies
Neurips, 2021 (Spotlight)
Github
/
Google Cloud
/
Blog post
Revisits training recipe and model scaling strategies for ResNets, improving ResNets to be competitive with state-of-the-art.
|
|
Simple training strategies and model scaling for object detection
arXiv, 2021
Github
/
Google Cloud
Revisits training recipe and model scaling strategies for Object Detection, improving conventional object detectors, e.g. RetinaNet and Cascade R-CNN, to be competitive with state-of-the-art.
|
|
Dilated SpineNet for semantic segmentation
arXiv, 2021
Github
Designs a scale-permuted backbone dilated convolutions that is learned by Neural Architecture Search (NAS) on semantic segmentation. Achieved SoTA on Cityscape semantic segmentation benchmark on 03/2021.
|
|
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
CVPR, 2020
Github
/
Yannic Kilcher's Tutorial
/
Google Cloud
/
Blog post
Designs a scale-permuted backbone with intermediate features and cross-scale connections that is learned by Neural Architecture Search (NAS) on object detection. Achieved SoTA on COCO detection and segmentation benchmark on 12/2019.
|
|
Amnet: Deep atrous multiscale stereo disparity estimation networks
ICCE, 2020
Introduces AMNet with depthwise-separable convolutions, extended cost volume, and stacked atrous multiscale network for disparity estimation. Ranked No.1 on KITTI Stereo 2015 and 2012 benchmarks on 11/2018.
|
|
Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection
WACV, 2017
Proposes a deep neural network fusion architecture for accurate, fast and robust pedestrian detection, especially in detecting small-size and occluded pedestrians. Ranked No.1 on Caltech Pedestrian Detection benchmark on 08/2016.
|
|