Hello world!

Hello world! I am Amy Zhang, PhD candidate in Statistics at Penn State. My research is mainly focused in Bayesian methods, but I have done applied work in text analysis and HIV prevalence estimation. My research projects are listed below.

 

Methodology work

The most interesting problems to me are those that are human, and so I started off working on applied projects that are at the intersection of health and social issues. These kinds of problems are still the most motivating for me, but I have found in my research that, because of extreme data imbalance, it was necessary to take a step back and examine how the data imbalance was affecting model estimates. To my surprise, as someone who has always emphasized application, I have found myself happily occupied with theoretical work for the past two years. In doing so, we also discovered a novel method for approximating cross-validated mean estimates. These two projects have become the bulk of my PhD work, specifically: a) quantifying the effect of data imbalance, and b) approximating cross-validated mean estimates for Bayesian hierarchical regression models.

 

The effect of data imbalance and information borrowing

Bayesian hierarchical models have gained the reputation of being robust against data imbalance and data scarcity through many historical cross-validation studies. But it has not been clear how or why, though early studies, have suggested that this robustness is due to shrinkage and information pooling. But the application of shrinkage and pooling within the literature is limited–it is not possible, for instance, to get an overall value of how much a data point Y is shrunk to obtain the estimate for Y, versus how much is borrowed from other data points. The goal of our work has been to quantify exactly how much is borrowed from point X to obtain the estimate for point Y. Our novel method generalizes the shrinkage and pooling factors of the Bayesian literature to any regression model, including Frequentist regression methods. It can be used to, essentially, look “under the hood” of Bayesian hierarchical regression models and to understand exactly how data imbalance affects estimates. This is particularly important for any methods which are applied to addressing social issues–well-documented instances of data imbalance have led to what is commonly now called “machine bias”; the job of a model is to replicate the data–if the data are imbalanced, and that imbalance is due to social bias, then the model can end up reinforcing social biases. Some overview papers from Nature below:

We are also working on a generic Shiny app and R package which can be used to implement our method.

 

Approximate cross-validated mean estimates

Bayesian hierarchical  regression models are popular for their ability to model complex dependence structures and provide probabilistic uncertainty estimates, but can be computationally expensive to run. Cross-validation (CV) is therefore not a common practice to evaluate their predictive performance. Our method circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs.  We have established theoretical results, with empirical comparisons on several publicly available data sets. Our method is an order of magnitude faster than existing methods and as or more accurate, often producing estimates which are equivalent to full CV.

 

Applied work

I believe that technical problems require human solutions. In healthcare, a surprisingly large challenge for doctors and pharmacies is actually just getting their patients to consistently take medicine, particularly for the elderly. Another example is in agriculture, where it was found that social conditions such as number of children, general health, and income level are far more predictive of crop growth in third-world countries than what fertilizer is actually applied to the soil (this work was done by the Data Science for Social Good organization at the University of Washington).

 

Estimating HIV prevalence for high risk-groups†

The HIV epidemic as a whole is on a decline, but HIV still disproportionately impacts certain high-risk groups. These high-risk groups are those who are often overlooked or ignored in society (one example is intravenous drug users), and so while they remain some of the people most affected by the HIV epidemic, the data for these groups are scarce. Effective policies that will help these people requires, above all, accurate data on the size of these groups and their HIV prevalence–both of which are lacking.

As part of my thesis, my advisor, Dr. Le Bao, and I work with UNAIDS to help improve HIV estimation for these at-risk populations through Bayesian hierarchical modeling. Bayesian hierarchical modeling offers efficient pooling of information, essentially allowing us to borrow information from populations that are data-rich and use those to extend our knowledge of HIV prevalence within high-risk subpopulations.

†This work is supported under NIH grants  R56AI120812-01A1 and R01 AI136664-01.

 

Analysis of Twitter data: bullying and sentiment*

I worked with Dr. Diane Felmlee in Sociology to collect and identify tweets that contain bullying. We collected tweets that contain certain slurs, identified by her research team as often correlating with bullying tweets. I then built a classifier whose goal was to extract the tweets that contain bullying (rather than reports of being bullied, complaining about tasks or objects, or tongue-in-cheek non-insults between friends). Due to limited training data available, we opted for more parsimonious models which involved principal components analysis (PCA) of the bag of words and classification into binary predictions of “contains bullying” or “does not contain bullying” using support vector machines (SVM) and the passive-aggressive classifier (PAC).

In another project, Dr. Felmlee and I collected tweets on the Women’s March on DC, which took place after President Trump’s inauguration on January 21, 2017, and are working on a project that analyzes sentiment towards women in tweets.

*This work was supported by the National Science Foundation under IGERT grant DGE- 1144860, Big Data Social Science.