Hello world! I am Amy Zhang, currently a third-year PhD student in Statistics at Penn State. Methodologically, I am interested in Bayesian methods and text analysis. My research projects are listed below.
The applications I work on tend to be at the intersection of health and social issues, because I think the most interesting problems are those that are human. I strongly believe that technical problems require human solutions, and I could go on with examples of these forever :). In healthcare, a surprisingly large challenge for doctors and pharmacies is actually just getting their patients to consistently take medicine, particularly for the elderly. Another example is in agriculture, where it was found that social conditions such as number of children, general health, and income level are far more predictive of crop growth in third-world countries than what fertilizer is actually applied to the soil (this work was done by the Data Science for Social Good organization at the University of Washington).
Estimation when data is sparse
Estimating HIV prevalence for at-risk populations
Technical problems require human solutions—this project is one of those cases. The HIV epidemic as a whole is on a decline, but HIV still disproportionately impacts certain at-risk populations. These at-risk populations are typically marginalized groups that are unlikely to identify themselves (intravenous drug users, sex workers, clients of sex workers, gay men, and so on), and so while they remain some of the people most affected by the HIV epidemic, the data for these groups are scarce. Effective policies that will help these people requires, above all, accurate data on the size of these groups and their HIV prevalence–both of which are lacking.
My advisor, Dr. Le Bao, and I work with UNAIDS to help improve HIV estimation for these at-risk populations through hierarchical Bayesian modeling. The basic idea is that we can pool the information from the at-risk populations and the general population to estimate a mean HIV prevalence curve over time, and use this to inform our estimations of HIV prevalence for individual subpopulations. Bayesian modeling provides an elegant way to efficiently pool the information from the different subpopulations together to form a better estimate of HIV prevalence.
Addressing missing data: pattern mixture models
It is the sad reality of things that we live in an imperfect world and are not always able to put together data collection efforts that are designed to create unbiased statistical estimates…but the upside of this is that we are then faced with more interesting problems. The data we have on HIV prevalence estimates from various countries working with UNAIDS suggests that when countries chose particular sites to set up HIV clinics, where our data would then come from, they sensibly selected sites where they believed HIV prevalence was higher, so they could better serve those areas. Later, as efforts to address HIV increased, new sites at areas with lower HIV prevalence were added.
This naturally skews HIV prevalence estimates to be higher than the likely reality for the early years. I am working with Dr. Le Bao and Dr. Michael Daniels to evaluate how this information is skewed and jointly model the missingness mechanism with our HIV model to produce better estimates. We do this using pattern mixture models in a Bayesian setting.
Bullying on Twitter*
I worked with Dr. Diane Felmlee to collect and identify tweets that contain bullying. We collected tweets that contain certain slurs, identified by her research team as often correlating with bullying tweets. I then built a classifier whose goal was to extract the tweets that contain bullying (rather than reports of being bullied, complaining about tasks or objects, or tongue-in-cheek non-insults between friends). In broad strokes, we used bag-of-words, principal components analysis (PCA), and finally support vector machines (SVM) and the passive-aggressive classifier algorithm (PAC) to produce binary predictions of “contains bullying” or “does not contain bullying”.
For me, the most interesting (and occasionally frustrating!) part of this project was working with algorithms that don’t contain theoretical guarantees. In statistics, we rely on probability theory to know that, at least asymptotically, we have good estimates that allow us to produce confidence intervals for our estimates and to evaluate how well our models perform. Machine learning has no such guarantees, and so a large ecosystem of widely varying algorithms has cropped up within text analysis. My goal for this project was to learn more about the properties of these algorithms (and achieve something helpful to society in the process).
Separately, Dr. Felmlee and I collected tweets on the Women’s March on DC, which took place after President Trump’s inauguration on January 21, 2017, and are working on a project that analyzes sentiment towards women in tweets.
*This work was supported by the National Science Foundation under IGERT grant DGE- 1144860, Big Data Social Science.
The newest project I am working with Dr. Bao on is developing a statistical model that will extract relevant information from research articles. The main application for this model is to gather information for a data repository on at-risk populations for HIV: the size of these populations, programmatic and intervention data for these populations, individual and structural determinants of HIV risk, their HIV prevalence and incidence. Text analysis is not a common topic in statistics, as extracting the data alone tends to require significant effort and knowledge. We hope that statistics, with its emphasis on distributions rather than points estimates, can be of aid in this area.