Hello world! I am Amy Zhang, currently a third-year PhD student in Statistics at Penn State. Methodologically, I am interested in Bayesian methods and text analysis. My research projects are listed below.
The applications I work on tend to be at the intersection of health and social issues, because I think the most interesting problems are those that are human. I strongly believe that technical problems require human solutions, and I could go on with examples of these forever :). In healthcare, a surprisingly large challenge for doctors and pharmacies is actually just getting their patients to consistently take medicine, particularly for the elderly. Another example is in agriculture, where it was found that social conditions such as number of children, general health, and income level are far more predictive of crop growth in third-world countries than what fertilizer is actually applied to the soil (this work was done by the Data Science for Social Good organization at the University of Washington).
Estimation when data is sparse
Estimating HIV prevalence for high risk-groups†
Technical problems require human solutions—this project is one of those cases. The HIV epidemic as a whole is on a decline, but HIV still disproportionately impacts certain high-risk groups. These high-risk groups are those who are often overlooked or ignored in society (one example is intravenous drug users), and so while they remain some of the people most affected by the HIV epidemic, the data for these groups are scarce. Effective policies that will help these people requires, above all, accurate data on the size of these groups and their HIV prevalence–both of which are lacking.
My advisor, Dr. Le Bao, and I work with UNAIDS to help improve HIV estimation for these at-risk populations through Bayesian hierarchical modeling. Bayesian hierarchical modeling offers efficient pooling of information, essentially allowing us to borrow information from populations that are data-rich and use those to extend our knowledge of HIV prevalence within high-risk subpopulations.
†This work is supported under NIH grants R56AI120812-01A1 and R01 AI136664-01.
Addressing missing data: pattern mixture models‡
It is the sad reality of things that we live in an imperfect world and are not always able to put together data collection efforts that are designed to create unbiased statistical estimates…but the upside of this is that we are then faced with more interesting problems. The data we have on HIV prevalence estimates from various countries working with UNAIDS suggests that when countries chose particular sites to set up HIV clinics, where our data would then come from, they sensibly selected sites where they believed HIV prevalence was higher, so they could better serve those areas. Later, as efforts to address HIV increased, new sites at areas with lower HIV prevalence were added.
This naturally skews HIV prevalence estimates to be higher than the likely reality for the early years. I am working with Dr. Le Bao and Dr. Michael Daniels to evaluate how this information is skewed and jointly model the missingness mechanism with our HIV model to produce better estimates. We do this using pattern mixture models in a Bayesian setting.
‡This work is supported under NIH grant R01 AI136664-01.
Bullying on Twitter*
I worked with Dr. Diane Felmlee to collect and identify tweets that contain bullying. We collected tweets that contain certain slurs, identified by her research team as often correlating with bullying tweets. I then built a classifier whose goal was to extract the tweets that contain bullying (rather than reports of being bullied, complaining about tasks or objects, or tongue-in-cheek non-insults between friends). In broad strokes, we used bag-of-words, principal components analysis (PCA), and finally support vector machines (SVM) and the passive-aggressive classifier algorithm (PAC) to produce binary predictions of “contains bullying” or “does not contain bullying”.
For me, the most interesting (and occasionally frustrating!) part of this project was working with algorithms that don’t contain theoretical guarantees. In statistics, we rely on probability theory to know that, at least asymptotically, we have good estimates that allow us to produce confidence intervals for our estimates and to evaluate how well our models perform. Machine learning has no such guarantees, and so a large ecosystem of widely varying algorithms has cropped up within text analysis. It is necessary to understand these algorithms well in order to predict what is suited for your data–PCA, for example, helps reduce the dimension of the data and thus pairs naturally well with SVM, which estimates a hyperplane that separates two classes across dimensions.
Separately, Dr. Felmlee and I collected tweets on the Women’s March on DC, which took place after President Trump’s inauguration on January 21, 2017, and are working on a project that analyzes sentiment towards women in tweets.
*This work was supported by the National Science Foundation under IGERT grant DGE- 1144860, Big Data Social Science.
The newest project I am working with Dr. Bao on is developing a statistical model that will extract relevant information from research articles. The main application for this model is to gather information for a data repository on at-risk populations for HIV: the size of these populations, programmatic and intervention data for these populations, individual and structural determinants of HIV risk, their HIV prevalence and incidence. Text analysis is not a common topic in statistics, as extracting the data alone tends to require significant effort and knowledge. We hope that statistics, with its emphasis on distributions rather than points estimates, can be of aid in this area.