Data Science

Estimation and Sensitivity Analysis for Causal Decomposition: Assessing Robustness Toward Omitted Variable Bias

Abstract: A key objective of decomposition analysis is to identify risks or resources (‘mediators’) that contribute to disparities between groups of individuals defined by social characteristics such as race, ethnicity, gender, class, and sexual orientations. In decomposition analysis, a scholarly interest often centers on estimating how much the disparity (e.g., health disparities between Black women...

June 03, 2022

A Bayesian multilevel time-varying framework for joint modeling of hospitalization and survival in patients on dialysis.

Abstract: Over 782,000 individuals in the U.S. have end-stage kidney disease with about 72% of patients on dialysis, a life-sustaining treatment. Dialysis patients experience high mortality and frequent hospitalizations, at about twice per year. These poor outcomes are exacerbated at key time periods, such as the fragile period after transition to dialysis. In order to...

May 27, 2022

Deplatforming Right-Wing Extremists on Twitter Following the January 6 Insurrection

Abstract: What happened when Twitter deplatformed 70,000 right-wing extremists following the January 6 insurrection? Using a panel of over a half million active Twitter users and a sharp regression discontinuity design, we test the causal effects of this intervention on the circulation of misinformation by those deplatformed, and by users from adjacent groups such as...

May 20, 2022

Understanding Large ML Models through the Structure of Feature Covariance

Abstract: An overarching goal in machine learning is to enable accurate statistical inference in the setting where the sample size is less than the number of parameters. This overparameterized setting is particularly common in deep learning where it is typical to train large neural nets with relatively smaller sample sizes and little concern of overfitting...

May 06, 2022

Multiview learning for knowledge discovery

Abstract: Extracting hidden patterns of multiview data containing heterogeneous feature representations is attracting more and more attention in various scientific fields such as image processing and natural language processing. In this talk we will present a comprehensive unsupervised framework that leverages existing and novel multiview learning models, towards obtaining a single node embedding from a...

April 29, 2022

Characterizing soil – plant – water relationships across scales for sustainable agricultural management

Abstract: Agricultural systems are pressured by growing global population, increasing water scarcity, and changing climate. In the pursuit of increasing food security, agriculture (especially intensive systems) should also minimize negative and undesired impacts on the environment and on rural societies. Part of the solution to this challenge lies in understanding how environmental factors such as...

March 06, 2020

Immune regulatory pathways in infection, inflammation and sepsis

Abstract: My lab investigates the immune responses to infection and inflammation using mouse models of parasitic worm infection and clinical samples from sepsis patients. Our ultimate goal is to identify protective or pathogenic immune pathways that we can target for diagnostic or therapeutic purposes. In our mouse infection models we investigate macrophages as first responders...

February 28, 2020

Learning Binary Code Representations for Security Applications

Abstract: Learning a numeric representation (also known as embedded vector, or simply embedding) for a piece of binary code (an instruction, a basic block, a function, or even an entire program) has many important security applications, ranging from vulnerability search, plagiarism detection, to malware classification. By reducing a binary code with complex control-flow and data-flow...

February 14, 2020

Too Many Dimensional Gas Chromatography/Mass Spectrometry Analysis of Compounds in Smoke Samples from Wildland Fires

The air quality and fire management communities are faced with increasingly difficult decisions regarding critical fire management activities, given the potential contribution of wildland fires to fine particulate matter (PM2.5). Unfortunately, in model frameworks used for air quality management, the ability to represent PM2.5 from fires is severely limited. This is due in part to...

February 07, 2020

Lost in translation: The challenges and benefits of understanding complex insect societies

Social insects include the termites, ants and the social bees and wasps, which are a very large and ecologically very successful group of animals. They are also of tremendous importance for humans. Whereas some social insects are serious pest species that become increasingly difficult to control, others are of central importance for agricultural food production...

January 31, 2020

Outcomes from an experiment in creating data science centers

The Berkeley Institute for Data Science (or BIDS) was founded as part of a high-profile, multi-university initiative funded by the Moore and Sloan Foundations, collectively known as the Moore-Sloan Data Science Environments (or MSDSE), with the mission of creating ``institutional change'' around data science in academia. I will discuss some of the lessons learned in...

January 24, 2020

Fusioning big-data ecology and genomics: from data to dynamic system understanding and prediction

Much of current application efforts of data science in both of ecology and genomics has been focusing on the data-driven, static but not fully dynamic understanding of those systems. In this talk, I will introduce our recent work on fusioning data- and model-driven approaches to understand the fundamental nitrogen biochemical processes in fluctuating soil redox...

January 17, 2020

Automating Deep Neural Network Model Selection for Edge Inference

The ever increasing size of deep neural network (DNN) models once implied that they were only limited to cloud data centers for runtime inference. Nonetheless, the recent plethora of DNN model compression techniques have successfully overcome this limit, turning into a reality that DNN-based inference can be run on numerous resource-constrained edge devices including mobile...

January 10, 2020

Using bio-monitoring data to infer ecological dynamics in streams and rivers

Government agencies have long collected biological samples to assess and monitor the environmental health of streams and rivers. In California, one such example of this monitoring is the Surface Water Ambient Monitoring Protocol (SWAMP); monitoring under this protocol has generated a vast amount of data that has recently been made publicly available. These data include...

December 06, 2019

Accelerated Machine Learning for Computational Proteomics

In the past few decades, mass spectrometry-based proteomics has dramatically improved our fundamental knowledge of biology, leading to advancements in the understanding of diseases and methods for clinical diagnoses. However, the complexity and sheer volume of typical proteomics datasets make both fast and accurate analysis difficult to accomplish simultaneously; while machine learning methods have proven...

November 22, 2019

Data Science and Environmental Systems: Applications of Deterministic Models, Optimization, and Machine Learning to Address Multi-scale Air Quality Challenges

Globally, human exposure to air pollution is a known risk factor for increased morbidity and mortality, and its chemical composition can vary significantly by region and season. Variabilities are largely driven by topography, meteorology, land cover, and human activities. State-of-the-science air quality modeling systems, such as the U.S. EPA’s Community Multiscale Air Quality (CMAQ) model...

November 15, 2019

The adapting brain: The role of posterior parietal cortex in learning and adaption

The ability to select between competing options and adapt to new situations underlies our impressive capabilities of playing soccer, flying aircrafts and skiing on the Olympics. To select between actions, the brain needs an accurate representation of the state of the body and the environment it is in. Despite the sophistication of our sensory system...

November 08, 2019

Mobile AR/VR with Edge-based Deep Learning

Augmented and virtual reality (AR/VR) are at the frontier of mobile computing. While AR/VR applications are gaining popularity today, the technologies to support these applications are far from mature. This talk will first outline the current state of mobile AR/VR platforms, including what functionality is currently available, what is needed/desired, and how edge computing can...

November 01, 2019

Putting the ‘Science’ Into Data Science

When people talk about skills that are important for data science, they tend to focus only on the technical skills, like statistics and computer programming. Often overlooked is the scientific mindset. Being a critical thinker helps you interpret data and avoid doing analysis on auto-pilot. A skeptical mindset will keep you vigilant for the “silent...

October 18, 2019

Constructing quantitative models of pathogen evolution

Highly mutable pathogens such as influenza and HIV pose a serious threat to public health. Better understanding of how these pathogens evolve could inform efforts to treat and prevent infection. In this talk, I’ll discuss the statistical problem of inferring an evolutionary model from data, and how we’ve developed a new method to solve this...

October 11, 2019

Latest News