How to build a “reasonable” data set to sample from

With the fall semester quickly approaching (next Monday), I’m trying to get this project ready for launch. I’ve got a few interviews lined up for the student research position (still accepting applications!) and need to revisit my interview questions. However, the big question I’m tackling this week is creating a data set (of students) to sample and select to interview.

When I originally proposed this project, I’m not sure how I thought I would find my students (granted, I thought I would sample 100 students, haha).

[Gif of Shirley Temple giggling, via Giphy]

As soon as I got to IRDL, my mentors quickly asked me to more fully think through my sampling choices. From that initial 100 students I suggested in my proposal, I ended up proposing a more modest number, 24 students. Through conversations, I ended up limiting my study to University Park (with the aspiration of extending this research to our campuses), talk to students with a sophomore, junior, or senior standing, and would interview 2 students from each of the 12 colleges located here.

With a more reasonable number of students I could feasibly interview in the next year, the question became how to recruit them. At first, I thought about using my network to find “champions” in each college, who would help recruit these two students. This would be a type of convenience sample, because I would be able to select from those “closest” to me. While this made a lot of sense, the problem I would run into is that these champions would gravitate towards a select group of students. These are the students we know are engaged, and who are the ones we call on for help and to share their experiences (we all know those students). While I’m definitely interested in their maps, I also want to make sure I get varied experiences, and some of students who might not be the most visible in their colleges.

Near the end of IRDL, during a conversation with Marie (one of my mentors), she suggested using a stratified random sample. From the large population (students at University Park), I would create subpopulations (strata). In this sampling strategy, I would gather lists of student names from each college, across the three class standings. From there, I would randomly select two students from that list, email them, see if they had completed one of the ten student engagement types, and were interested in doing an interview with me. If a student said no, or did not get back to me, I would take another random name from the list. By using a stratified random sample, I increase the rigor of the research, and while I am not trying to generalize student engagement experiences from these 24 students, I would potentially increase my chances of having a varied set of student journey maps.

I was very jazzed about the stratified random sample when I left IRDL. I came back to Penn State and reached out to our Office of Planning and Assessment to see if they would be able to help me create the data set. They directed me to the Registrar Office and then I met some radio silence.

While I waited for a response, I had a conversation with our Libraries’ Head of Assessment. He was interested in exploring how his department could support me in getting this data sample. This led to me meeting with one of our Analysis and Planning Consultants, Leigh, this week to discuss my sample further.

My meeting with Leigh was a big brainstorming conversation. Essentially, Leigh informed me that while she could pull the list I wanted, I would essentially be looking at a data set of 31,307 student names (approximately, based on figures from the Penn State Factbook around undergraduate enrollment in 2018).

[ Gif of Rebecca Bloom saying ‘Gasp’ via GIPHY]

Yeah, that’s a lot of names. And probably an unmanageable list.

With that in mind, Leigh and I tried to figure out a way to build our own set of student names to randomly select from. The data warehouse we have access to contains student information, such as courses taken, credit standing (roughly corresponds to class standing), campus location, degree type, major, etc. As you can see, it’s mostly academic focused information; our system currently doesn’t “flag” extra-curricular work such as clubs. So while we could start to find students who have taken certain classes that correspond to student engagement opportunities (like study abroad courses, independent studies for undergraduate research, or an internship course), we could be missing students who have participated in more extra-curricular opportunities. It can be lot to think about to find just 24 students!

So what’s next?

It was great to talk about my sample with Leigh. The new plan is to try to create our own data set that we will stratify and then randomly sample from. I am going to try to identify courses that correspond with student engagement opportunity types and then use my network to collect other student names to add to the set. We are going to focus on the 2018-19 academic year, just to keep it manageable. I’ve got a lot of thoughts swirling in my head, but happy to be challenged and to see how this plays out. I’m definitely learning a lot, about Penn State, our systems to keep track of student information, and how this will add to my research project.

Get in touch!

Have thoughts on this? Courses I shouldn’t forget about? Feel free to send me an email at hmf14@psu.edu! I really want this project to be collaborative and a way to build our networks and conversations around student engagement.

Leave a Reply

Your email address will not be published. Required fields are marked *