Abstract: Learning a numeric representation (also known as embedded vector, or simply embedding) for a piece of binary code (an instruction, a basic block, a function, or even an entire program) has many important security applications, ranging from vulnerability search, plagiarism detection, to malware classification. By reducing a binary code with complex control-flow and data-flow...
The air quality and fire management communities are faced with increasingly difficult decisions regarding critical fire management activities, given the potential contribution of wildland fires to fine particulate matter (PM2.5). Unfortunately, in model frameworks used for air quality management, the ability to represent PM2.5 from fires is severely limited. This is due in part to...
Social insects include the termites, ants and the social bees and wasps, which are a very large and ecologically very successful group of animals. They are also of tremendous importance for humans. Whereas some social insects are serious pest species that become increasingly difficult to control, others are of central importance for agricultural food production...
The Berkeley Institute for Data Science (or BIDS) was founded as part of a high-profile, multi-university initiative funded by the Moore and Sloan Foundations, collectively known as the Moore-Sloan Data Science Environments (or MSDSE), with the mission of creating ``institutional change'' around data science in academia. I will discuss some of the lessons learned in...
Much of current application efforts of data science in both of ecology and genomics has been focusing on the data-driven, static but not fully dynamic understanding of those systems. In this talk, I will introduce our recent work on fusioning data- and model-driven approaches to understand the fundamental nitrogen biochemical processes in fluctuating soil redox...
The ever increasing size of deep neural network (DNN) models once implied that they were only limited to cloud data centers for runtime inference. Nonetheless, the recent plethora of DNN model compression techniques have successfully overcome this limit, turning into a reality that DNN-based inference can be run on numerous resource-constrained edge devices including mobile...
Government agencies have long collected biological samples to assess and monitor the environmental health of streams and rivers. In California, one such example of this monitoring is the Surface Water Ambient Monitoring Protocol (SWAMP); monitoring under this protocol has generated a vast amount of data that has recently been made publicly available. These data include...
In the past few decades, mass spectrometry-based proteomics has dramatically improved our fundamental knowledge of biology, leading to advancements in the understanding of diseases and methods for clinical diagnoses. However, the complexity and sheer volume of typical proteomics datasets make both fast and accurate analysis difficult to accomplish simultaneously; while machine learning methods have proven...
Globally, human exposure to air pollution is a known risk factor for increased morbidity and mortality, and its chemical composition can vary significantly by region and season. Variabilities are largely driven by topography, meteorology, land cover, and human activities. State-of-the-science air quality modeling systems, such as the U.S. EPA’s Community Multiscale Air Quality (CMAQ) model...
The ability to select between competing options and adapt to new situations underlies our impressive capabilities of playing soccer, flying aircrafts and skiing on the Olympics. To select between actions, the brain needs an accurate representation of the state of the body and the environment it is in. Despite the sophistication of our sensory system...
Augmented and virtual reality (AR/VR) are at the frontier of mobile computing. While AR/VR applications are gaining popularity today, the technologies to support these applications are far from mature. This talk will first outline the current state of mobile AR/VR platforms, including what functionality is currently available, what is needed/desired, and how edge computing can...
When people talk about skills that are important for data science, they tend to focus only on the technical skills, like statistics and computer programming. Often overlooked is the scientific mindset. Being a critical thinker helps you interpret data and avoid doing analysis on auto-pilot. A skeptical mindset will keep you vigilant for the “silent...
Highly mutable pathogens such as influenza and HIV pose a serious threat to public health. Better understanding of how these pathogens evolve could inform efforts to treat and prevent infection. In this talk, I’ll discuss the statistical problem of inferring an evolutionary model from data, and how we’ve developed a new method to solve this...
An estimator is limited to the information that it has about the variable it's estimating. But this information is limited to what the estimator has seen from the samples training it. The full information of a random variable cannot be transferred to an estimator by finite samples - some information is lost. This presentation analyzes...
Upcoming large-scale datasets in astrophysics will challenge our ability to effectively analyze and interpret the data. Surveys of the 2020s (e.g., Euclid, LSST, WFIRST and SPHEREx) will provide multiple deep views of the universe, each survey with its own observational characteristics such as noise levels, resolution, and wavelength coverage. How do we best interpret the...
In large-scale problems, it is common practice to select important parameters by a procedure such as the BH procedure (Benjamini and Hochberg, 1995) and construct confidence intervals (CIs) for further investigation while the false coverage-statement rate (FCR) for the CIs is controlled at a desired level. Although the well-known BY CIs (Benjamini and Yekutieli, 2005)...
Assessing the performance of public transportation services has long been an important yet challenging issue for transportation agencies and researchers. However, the performance evaluation of transportation services is complicated by an array of quantitative measures available to assess the goals and the diversity in the goals themselves, which usually include improving operational efficiency and providing...
Simultaneous ongoing changes to earth's ecosystems, including climate change and species invasions, are reshuffling ecological communities in space and time. Spatially, species distributions are shifting, often in species-specific ways, leading to novel communities. Changing climate is also altering species’ phenologies – i.e., the seasonal timing of life cycle events such as flowering, bird migration, or...
More than 720,000 patients with end-stage renal disease in the US require life-sustaining dialysis treatment that is predominantly received at local dialysis facilities. In this population of typically older patients with a high morbidity burden, hospitalization is frequent at a rate of about twice per patient-year. Aside from frequent hospitalizations, which is a major source...
Too often, exploratory approaches to data analysis are used, even in situations in which confirmatory methods could be used. The contrast between exploratory and confirmatory approaches to analyses will be emphasized, and several examples will be presented that illustrate the advantages of confirmatory methods — particularly the avoidance of Type II errors when confirmatory methods...