I have over 14 years of experience delivering large-scale AI solutions and leading multi-year technology strategies. I’ve spent more than seven years managing research and engineering teams across multiple sites, driving innovation and impact in AI.
Compressed Sensing and Network Coding
Peer-to-peer networks
I have formed and currently lead a team of over 50 AI scientists and engineers to pre-train, post-train, and deploy a 200B+ parameter foundational model for LinkedIn’s personalization tasks at scale.
Led a medium-sized team with diverse profiles, research scientists and software engineers. Our mission was to advance AI technologies to keep users safe online. My team built multimodal content understanding services used across many Meta integrity products.
Led LinkedIn Ads Sponsored Update relevance (five engineers, one analyst, one PM). The team was responsible for modeling and ranking advertising content on the LinkedIn news feed that shows ads from millions of advertisers to hundreds of millions of LinkedIn daily active users.
Led a four-engineer group for forecasting. We were responsible for a) Predicting sales attributes (dollar amount, closed date, and the closing probability) for the Sales team b) Predicting the possibility of churn for the Customer Success (CSM) team. media coverage
Built an early warning system based on the Bayesian network that provides diagnosis and prognosis of large industrial machines.
In this report, we demonstrate that 360Brew model, a 150B parameter foundation model trained on 1T tokens can solve over 30 personalization tasks on LinkedIn platform without task-specific fine-tuning and no complex feature engineering. The can generalize to out-of-domain tasks and surfaces, and achieves performance similar to or better than the production model.
In this work we demonstrate that LLMs performance affected by the relative distance between pieces of information in the context. The further apart the information is within long context, the more the model’s performance deteriorates.
Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%.
This system-model co-design work focus on leveraging syncronization in data parallelism hierarchical partitioning to avoid race conditioning in gradient update for LLM training.
We formulate CoT as a reasoning graph and propose a prompt strategy for multi-step reasoning that can capture complex processes in tasks such as mathematics and commonsense reasoning.
The race condition between AllGather and device-to-device copy for the 2nd partition causes instability in training large models such as Llama-7B and Falcon-40B on a moderately large number of GPUs. After discovering the algorithmic issue, we landed the fix in the DeepSpeed repository.
We propose a framework for understanding how Data Aaugmentation interacts with class-level learning dynamics. We show that simple class-conditional augmentation strategies informed by our framework improve performance on the negatively affected classes.