Breadcrumb

Latest News

Learning Binary Code Representations for Security Applications

Abstract: Learning a numeric representation (also known as embedded vector, or simply embedding) for a piece of binary code (an instruction, a basic block, a function, or even an entire program) has many important security applications, ranging from vulnerability search, plagiarism detection, to malware classification. By reducing a binary code with complex control-flow and data-flow...

Too Many Dimensional Gas Chromatography/Mass Spectrometry Analysis of Compounds in Smoke Samples from Wildland Fires

The air quality and fire management communities are faced with increasingly difficult decisions regarding critical fire management activities, given the potential contribution of wildland fires to fine particulate matter (PM2.5). Unfortunately, in model frameworks used for air quality management, the ability to represent PM2.5 from fires is severely limited. This is due in part to...

Lost in translation: The challenges and benefits of understanding complex insect societies

Social insects include the termites, ants and the social bees and wasps, which are a very large and ecologically very successful group of animals. They are also of tremendous importance for humans. Whereas some social insects are serious pest species that become increasingly difficult to control, others are of central importance for agricultural food production...

Outcomes from an experiment in creating data science centers

The Berkeley Institute for Data Science (or BIDS) was founded as part of a high-profile, multi-university initiative funded by the Moore and Sloan Foundations, collectively known as the Moore-Sloan Data Science Environments (or MSDSE), with the mission of creating ``institutional change'' around data science in academia. I will discuss some of the lessons learned in...

Fusioning big-data ecology and genomics: from data to dynamic system understanding and prediction

Much of current application efforts of data science in both of ecology and genomics has been focusing on the data-driven, static but not fully dynamic understanding of those systems. In this talk, I will introduce our recent work on fusioning data- and model-driven approaches to understand the fundamental nitrogen biochemical processes in fluctuating soil redox...

Automating Deep Neural Network Model Selection for Edge Inference

The ever increasing size of deep neural network (DNN) models once implied that they were only limited to cloud data centers for runtime inference. Nonetheless, the recent plethora of DNN model compression techniques have successfully overcome this limit, turning into a reality that DNN-based inference can be run on numerous resource-constrained edge devices including mobile...

Using bio-monitoring data to infer ecological dynamics in streams and rivers

Government agencies have long collected biological samples to assess and monitor the environmental health of streams and rivers. In California, one such example of this monitoring is the Surface Water Ambient Monitoring Protocol (SWAMP); monitoring under this protocol has generated a vast amount of data that has recently been made publicly available. These data include...

Accelerated Machine Learning for Computational Proteomics

In the past few decades, mass spectrometry-based proteomics has dramatically improved our fundamental knowledge of biology, leading to advancements in the understanding of diseases and methods for clinical diagnoses. However, the complexity and sheer volume of typical proteomics datasets make both fast and accurate analysis difficult to accomplish simultaneously; while machine learning methods have proven...

Data Science and Environmental Systems: Applications of Deterministic Models, Optimization, and Machine Learning to Address Multi-scale Air Quality Challenges

Globally, human exposure to air pollution is a known risk factor for increased morbidity and mortality, and its chemical composition can vary significantly by region and season. Variabilities are largely driven by topography, meteorology, land cover, and human activities. State-of-the-science air quality modeling systems, such as the U.S. EPA’s Community Multiscale Air Quality (CMAQ) model...

The adapting brain: The role of posterior parietal cortex in learning and adaption

The ability to select between competing options and adapt to new situations underlies our impressive capabilities of playing soccer, flying aircrafts and skiing on the Olympics. To select between actions, the brain needs an accurate representation of the state of the body and the environment it is in. Despite the sophistication of our sensory system...

Mobile AR/VR with Edge-based Deep Learning

Augmented and virtual reality (AR/VR) are at the frontier of mobile computing. While AR/VR applications are gaining popularity today, the technologies to support these applications are far from mature. This talk will first outline the current state of mobile AR/VR platforms, including what functionality is currently available, what is needed/desired, and how edge computing can...

Putting the ‘Science’ Into Data Science

When people talk about skills that are important for data science, they tend to focus only on the technical skills, like statistics and computer programming. Often overlooked is the scientific mindset. Being a critical thinker helps you interpret data and avoid doing analysis on auto-pilot. A skeptical mindset will keep you vigilant for the “silent...

Constructing quantitative models of pathogen evolution

Highly mutable pathogens such as influenza and HIV pose a serious threat to public health. Better understanding of how these pathogens evolve could inform efforts to treat and prevent infection. In this talk, I’ll discuss the statistical problem of inferring an evolutionary model from data, and how we’ve developed a new method to solve this...

Information Loss in Neural Classifiers from Sampling

An estimator is limited to the information that it has about the variable it's estimating. But this information is limited to what the estimator has seen from the samples training it. The full information of a random variable cannot be transferred to an estimator by finite samples - some information is lost. This presentation analyzes...

Training machines to understand the Universe

Upcoming large-scale datasets in astrophysics will challenge our ability to effectively analyze and interpret the data. Surveys of the 2020s (e.g., Euclid, LSST, WFIRST and SPHEREx) will provide multiple deep views of the universe, each survey with its own observational characteristics such as noise levels, resolution, and wavelength coverage. How do we best interpret the...

Constructing Confidence Intervals for Selected Parameters

In large-scale problems, it is common practice to select important parameters by a procedure such as the BH procedure (Benjamini and Hochberg, 1995) and construct confidence intervals (CIs) for further investigation while the false coverage-statement rate (FCR) for the CIs is controlled at a desired level. Although the well-known BY CIs (Benjamini and Yekutieli, 2005)...

Spatial Analytics for Efficient and Equitable Public Transportation

Assessing the performance of public transportation services has long been an important yet challenging issue for transportation agencies and researchers. However, the performance evaluation of transportation services is complicated by an array of quantitative measures available to assess the goals and the diversity in the goals themselves, which usually include improving operational efficiency and providing...

Leveraging big datasets to understand how ecological communities respond to global change

Simultaneous ongoing changes to earth's ecosystems, including climate change and species invasions, are reshuffling ecological communities in space and time. Spatially, species distributions are shifting, often in species-specific ways, leading to novel communities. Changing climate is also altering species’ phenologies – i.e., the seasonal timing of life cycle events such as flowering, bird migration, or...

Multilevel Joint Modeling of Hospitalization and Survival in Patients on Dialysis

More than 720,000 patients with end-stage renal disease in the US require life-sustaining dialysis treatment that is predominantly received at local dialysis facilities. In this population of typically older patients with a high morbidity burden, hospitalization is frequent at a rate of about twice per patient-year. Aside from frequent hospitalizations, which is a major source...

Modeling Data Using Regression: Testing Conjectures Strongly

Too often, exploratory approaches to data analysis are used, even in situations in which confirmatory methods could be used. The contrast between exploratory and confirmatory approaches to analyses will be emphasized, and several examples will be presented that illustrate the advantages of confirmatory methods — particularly the avoidance of Type II errors when confirmatory methods...