I have over 14 years of experience delivering large-scale AI solutions and leading multi-year technology strategies. I’ve spent more than seven years managing research and engineering teams across multiple sites, driving innovation and impact in AI.
Compressed Sensing and Network Coding
Peer-to-peer networks
I have formed and currently lead a team of over 50 AI scientists and engineers to pre-train, post-train, and deploy a 200B+ parameter foundational model for LinkedIn’s personalization tasks at scale.
Led a medium-sized team with diverse profiles, research scientists and software engineers. Our mission was to advance AI technologies to keep users safe online. My team built multimodal content understanding services used across many Meta integrity products.
Led LinkedIn Ads Sponsored Update relevance (five engineers, one analyst, one PM). The team was responsible for modeling and ranking advertising content on the LinkedIn news feed that shows ads from millions of advertisers to hundreds of millions of LinkedIn daily active users.
Led a four-engineer group for forecasting. We were responsible for a) Predicting sales attributes (dollar amount, closed date, and the closing probability) for the Sales team b) Predicting the possibility of churn for the Customer Success (CSM) team. media coverage
Built an early warning system based on the Bayesian network that provides diagnosis and prognosis of large industrial machines.
CoT-ICL Lab is a framework to study chain-of-thought (CoT) and in-context learning (ICL) by decoupling causal structure from token processing functions. Experiments show that 1) CoT accelerates the accuracy transition to higher values across model sizes 2) deeper models require fewer in-context examples to leverage CoT effectively while more examples help shallow models match deeper model performance. Along with detailed analyses we provide theoretical insights. The code is available here.
In this work we show how to leverage On-Policy forward Knowledge Distillation, Model Compression (Pruning and Quantization), and Serving Optimizations (detailed on RadixAttention, FlashInfer, and tensor parallelism) to deliver a 20x reduction in both cost and latency while maintaining the quality of our 360Brew XL model.
In this report, we demonstrate that 360Brew model, a 150B parameter foundation model trained on 1T tokens can solve over 30 personalization tasks on LinkedIn platform without task-specific fine-tuning and no complex feature engineering. It can generalize to out-of-domain tasks and surfaces, and achieves performance similar to or better than the production model.
In this work we demonstrate that LLM's performance is affected by the relative distance between pieces of information in the context. The further apart the information is within long context, the more the model’s performance deteriorates.
Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%.
This system-model co-design work focuses on leveraging synchronization in data parallelism hierarchical partitioning to avoid race condition in gradient updates for LLM training.
We formulate CoT as a reasoning graph and propose a prompt strategy for multi-step reasoning that can capture complex processes in tasks such as mathematics and commonsense reasoning.
The race condition between AllGather and device-to-device copy for the 2nd partition causes instability in training large models such as Llama-7B and Falcon-40B on a moderately large number of GPUs. After discovering the algorithmic issue, we landed the fix in the DeepSpeed repository.
We propose a framework for understanding how Data Augmentation interacts with class-level learning dynamics. We show that simple class-conditional augmentation strategies informed by our framework improve performance on the negatively affected classes.