‘Uncategorized’ Category

  1. Examining Trends via Google Ngram: Taking a Closer Look

    September 29, 2013 by Stephanie E. Vasko

    Recently, I have seen an increased use of Google NGram in academic talks or academic talks about Google NGram.  Never one to turn down an opportunity to investigate data visualization tools, I decided to play around with NGram on my own to see how I would use it in my talks and what I felt the strengths and weaknesses of this tool were.

    First and foremost, I am excited by the prospect to scan so many books for the keywords that I am interested in.  Never has it been easier to do this much research so quickly.  However, note that I said “books” above: what Ngram does not include are research articles, blog posts, websites, white papers, etc.  Additionally, if you are interested in studying current trends in this manner, Ngram may not be for you.  With resources available only up until 2008, Ngram does not give you a sense for the last five years worth of data.

    Ngram represents 6% of all books ever published (http://aclweb.org/anthology-new/P/P12/P12-3029.pdf), ergo, the data represented by an Ngram, if not presented with this caveat, can be misleading.

    Let’s look at two examples I have constructed on comparing searches of different terms.  I am mostly concerned with relatively recent ideas and terminology in my research, be it the research I did for my Ph.D. or the research I am pursuing here at Penn State.

    The first search is for “MOOC” (blue), a relatively new term, but perhaps not a new concept.  I’ve also paired this with “distance learning” (green), “distance education” (purple), “online learning” (red) and “online education” (yellow) to see how these trends have evolved from 1960 until 2008.  MOOCs have grown exponentially in the last year/two years, but the Ngram, due to its limited corpus, does not reflect this growth.

    ngram1

    I also created a search based on “nanotechnology” (blue) and “biotechnology” (red), which I thought might show a greater percentage change in the 1990s/early 2000s.  Through the very small percentages listed for each, I do not believe this reflects the reality of the current literature and the trends for these terms, perhaps because a vast amount of literature on these topics is present in journal articles.

    ngram2

    For now, Ngram is a tool that will not feature in my talks.  For those of you who love R, there is a new R package out for ngram (http://www.r-bloggers.com/ngramr-an-r-package-for-google-ngrams/).

     


  2. Easy Entry Points for Learning a Computer Programming Language

    August 19, 2013 by Stephanie E. Vasko

    Last week I had the pleasure of attending Penn State’s Liberal Arts Scholarship and Technology Summit.  From my perspective, one of the pervasive themes of this event revolved around coding: learning (and barriers to learning) to code and figuring out how and where to start.

    To me, this is a very exciting time to learn something about computer programming!  In the last year, several new resources have come online that make it easier and more fun to learn to code.  For the purposes of this post, I’m going to focus on two languages (which were the subject of excellent workshops at LASTS13): Python and R.  As mentioned in the previous post, I’m a convert to R.  I love its flexibility and power for applications like data mining and data visualization.  Python is an extremely powerful and widespread language and offers the ability for data mining as well.  And while you may not need or want to dig too deeply into either language, an understanding of the basic concepts of computer programming (assigning variables, loops, arrays) could be beneficial in future collaborations or discussions.

    My recommended first step to anyone looking to learn about computer programming is to think about a project or problem that interests you.  For me, it’s the needs of a project and my passion towards completing it, which help motivate me to learn a language or learn new features of a language I already work with.  For example, my desire to make aesthetically appealing visualizations for learning analytics has driven me to learn and use the ggplot2 package in R (which I will be posting about in the near future).

    Below I have provided some online resources and starting points.  I’ve neglected books from this list because after reading books on both languages, I found the online approaches to R and Python to both sustain my interest in programming and aid me in developing skill mastery.  If you would like some book suggestions on either, please send me an email.

    Course and Course-like resources:
    Udacity offers an asynchronous course in learning Python skills through building a web crawler (Note, while this indicates beginner, I played with this course during it’s first offering and found some of the exercises to be beginner+ or intermediate (and some I found hard)): https://www.udacity.com/course/cs101

    Codecademy is a great way to try several different languages without the need for installing anything.  They offer a Python track (and is also an extremely fun and easy to navigate resource for learning HTML, CSS, JavaScript, JQuery, and (more recently) PHP and Ruby):
    http://www.codecademy.com

    Coursera is offering several Python-related courses coming up or starting right now (August 18th).  Note: you must have a Coursera account to access these courses:
    From Rice University: “An Introduction to Interactive Programming in Python”
    https://www.coursera.org/course/interactivepython

    From University of Toronto:” Learn to Program: The Fundamentals”
    https://www.coursera.org/course/programming1
    (Full disclosure: I have completed the University of Toronto course previously and lurked in the previous offering of the Rice course)

    Coursera is also offering an R course that starts mid-September:
    From Johns Hopkins University: https://www.coursera.org/course/compdata
    (Full disclosure: I have lurked in the previous offering of this course)

    Google is offering a playlist on YouTube featuring Introduction to R video tutorials:
    http://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP

    CodeSchool (with this specific offering sponsored by O’Reilly) offers an introductory course on R for free (unlike most CodeSchool offerings) that allows you to try R without installing the language or an IDE:
    http://tryr.codeschool.com/

    Twotorials offers two-minute videos on various aspects of R and can provide a good starting place for R and brief answers for how to approach many topics:
    http://www.twotorials.com/

    Alternate resources for Getting Started on your machine:
    Don’t want to install Python (or if you’re on a Mac, find Python)?  Try CodeSkulptor, which allows you to try out coding in Python in an online environment:
    (http://www.codeskulptor.org/)

    The CodeSchool course above lets you try the same with R.
    [If I am missing any and you’d like to see them added, please drop me a line!]

    Finally, I’d like to echo/paraphase what Jeanne Spicer said #LASTS13 last week: don’t be afraid to Google for help.  Learning to code requires building basic skills, doing by example, trying new things, and seeing how functions or packages work among other ideas.  Looking to places like Google, r-bloggers, and Stack Overflow (for both the R and Python communities) will help you build your skills and maybe discover new and interesting aspects of coding that you hadn’t previously thought about.  What is common to both of these languages are the very active, very engaged user communities which can be very helpful in all stages of your programming career.


  3. Data Mining in R: Analyzing the word use in tweets with the hashtag #tltsym13

    March 18, 2013 by Stephanie E. Vasko

    This weekend I had the opportunity to attend Penn State’s “Teaching and Learning with Technology” Symposium.  In addition to hearing some great talks about innovation, learning analytics, and the PSU strategy on MOOCs, I was energized to pick up with data/text mining in R.  Learning analytics (LA) and the future of their use have fascinated me for quite sometime, and I have been eager to combine my developing R skills with data mining techniques.

    I’ve been lurking in George Siemens’ MOOC on Learning Analytics over at Canvas (https://www.canvas.net/courses/learning-analytics-and-knowledge), which features some tutorials on getting started with technologies for LA.  R is a powerful open-source language best known for applications in statistics and the sciences.  It has a very active developer community, and there are many specially developed packages for doing analytics, including sna (for social network analysis), twitteR (for interfacing with Twitter), and tm (a text mining package).

    In order to do data mining with Twitter, you must sign up for both a Twitter account and create a Twitter development application.  From the Vignettes (all R code comes with documentation, Vignettes expand the documentation) on twitteR, “This is because OAuth authentication is required for all Twitter transactions.”  Of all of the coding for this project, setting up the authentication was the hardest part.  I was previously unfamiliar with Twitter and many of the examples I had seen for using twitteR did not expressly show the code for this portion.  I was thankful for the explicit way the Vignette spelled out the syntax, but still found it tricky to make everything work.

    In order to complete this project, I followed two very informative tutorials.  I used this tutorial from Crunch (linked from the MOOC discussed above,  http://crunch.kmi.open.ac.uk/people/~fwild/services/twitter-demo.Rmw) in conjunction with Gaston Sanchez’ wordcloud example (https://sites.google.com/site/miningtwitter/questions/talking-about/wordclouds/wordcloud1)

    When I work in R, I tend to do much of my development in RStudio (which can be downloaded for free at: www.rstudio.com).  It’s an enjoyable IDE to work with, I have it set to show me the default view of a file editor, console, history, and plotting window.  The problem I ran into with RStudio during this project was copying my Twitter authentication link and entering my PIN into the console (Info on this is also found at: http://crunch.kmi.open.ac.uk/people/~fwild/services/twitter-demo.Rmw)  Thus, I switched back to the standard R console for sourcing and running project; where I did not encounter the same issues. I did, however, choose to keep RStudio open while working, as I find installing new packages to be easier with this application.

    Below you will find a wordmap I was able to create using the twitteR, wordmap, tm, and RColorBrewer packages (RColorBrewer has its roots at Penn State, check it out: http://colorbrewer2.org/). I chose not to strip out the #tltsym13 for aesthetic reasons; I wanted the cloud to center around the hashtag of interest.  I also left in punctuation so that I could see who conference attendees were most interested in tweeting at and what other hashtags were of interest during the conference.  I chose to analyze the last 1000 tweets starting from the time I ran the R program, but this number can be easily changed by the user and time parameters could be defined around which tweets are analyzed.  For additional ways to analyze this specific hashtag, one could envision looking at wordclouds of time-specific tweets, i.e., analyzing the wordclouds from the days before, during, and after a conference to understand what attendees are excited to see, are enjoying during the conference, and what subjects/talks/interactions from the conference has resonated with attendees.

    It is clear that my text-mining of the contents of the tweets could be cleaned up by adding additional arguments to remove certain syntaxes.  This would clear up problems like “MOOCs” and “MOOCs,” being represented by two different entries.  However, as a first foray into data mining with R, I am pleased with the results and look forward to working more with R for data mining, learning analytics, and data visualization in the future.

     

    With 1000 tweets:

    tlt

    Finally, I would like to cite all the additional resources that I consulted and used in the construction of this example:

    Click to access twitteR.pdf

    http://cran.r-project.org/web/packages/twitteR/vignettes/twitteR.pdf
    http://cran.r-project.org/web/packages/tm/tm.pdf

    Click to access tm.pdf

    Click to access wordcloud.pdf


Skip to toolbar