Beginning Data Visualization in Unity: Scatterplot Creation

Introduction

This guide is intended to illustrate from the ground up how to create a simple 3D scatterplot based of tabular data using the Unity game engine, for use in virtual reality (VR) experiences. This includes creating a Unity project, creating a prefab, loading a CSV (comma separated value) file, and assigning positions of objects according to values in the CSV. This guide assumes very little prior knowledge of Unity, but some basic programming skills (though theoretically you can copy and paste and it will work). Essentially, this is targeted as someone who is basically familiar with working with languages like R and Python, but has never used Unity or similar 3D software. For the bare understanding Unity you need to understand this guide, watching this two minute video will give a usable overview of the interface nomenclature.

Unity (or Unity 3D) is a game engine built for making games. It can create games and other experiences for web, PC, Android, iOS, etc, etc. Importantly for VR research, it makes working with VR technology like Oculus Rift or HTC Vive fairly straightforward. Unity projects themselves are essentially composed of content (assets), scripts (the programming), and the engine itself which smashes everything together and makes it work on your particular platform. The completed project this guide sets up is uploaded to GitHub here for reference.

If you have not used Unity before, I highly recommend first checking out the whole series official tutorial videos, starting with Interface Essentials. One upside of Unity is that there is a large knowledge base aimed at beginners, and very accessible documentation, among them the Unity Manual and the Unity Scripting Reference. Note that this guide uses C# scripting, and Microsoft’s C# Programming Guide might also help clear some nomenclature and syntax differences if you are coming from another language like R.

Setting Up Unity

First, you will need to make a Unity account and download the latest version of Unity. Unity is not open source, but it is free for educational purposes. Warning: when you install Unity, it may default to installing Visual Studio community IDE, which is a huge download. While it is a nice piece of software, Unity also includes the Monodevelop IDE, which works just fine (and also is well supported in the Unity community).

Once you have Unity installed, open it up and you will be greeted by a window (create a new account/login if necessary). You want to create a new project, which takes you to the following window. Pick a name for your project, make sure it is “3D,” and select a location to save it. This will create a whole file hierarchy for the project.

One useful thing to do is add the Standard Assets package under Add Asset Package, as shown in the screenshot below. It contains a variety of useful assets for testing/prototyping such as sample 3D models and basic interaction features. We will be using some of those assets at the very end of this guide.

Once you hit Create Project Unity will dutifully create the project. It may take some time to complete, as Standard Assets needs to be downloaded and unpacked. You will then be greeted by the Unity interface, which looks like this:

I again recommend taking a look at this video for a very short overview of the interface if you haven’t used Unity before or need a refresher, but in short:

  • The Hiearchy window shows the “GameObject” that exist in the scene. By default there is a “camera” and a “directional light”.
  • The center panel defaults to the scene view. This is how you navigate in 3D space (full list of hotkeys here, Mousewheel/Q is Pan, W goes back to default, Alt pivots, right click rotates view)
  • Switching to the game view gives you the view from the camera GameObject.
  • The Inspector shows the details of any selected GameObjects or scripts, which is how many, many options are changed in Unity.
  • The Project folder shows you the assets/scripts of the project and is basically Unity’s file system view, but with recognition of different Unity asset types so that you can simply drag and drop into the hierarchy or scene view.
  • The Console, which works much like any other development environment, notifying you of errors and allowing things to be printed to it through Debug.Log( example: Debug.Log(“Hello world”); )

Each window can be dragged to reposition and resized, and you can select from several precooked layouts in Window -> Layouts at the top of the main window. The one pictured above is “default,” but I recommend placing the Console window in a place where it is always visible. Note that since Standard Assets do not update with each Unity version the console will be full of warnings and/or errors. Generally you can ignore these or hit clear to erase them. If they give you too much trouble, you can go ahead delete the Standard Assets folder, since it is not core to anything we will be doing.

Adding GameObjects

In order to create a scatterplot, we need points to represent the data. There are many ways to do this, but one of the more straightforward ways is to use a Sphere, on of Unity’s built-in 3D assets, and turn it into what’s called a “prefab,” essentially a template object than can be cloned and modified as needed.

To make a prefab, add a sphere to your scene by selecting GameObject -> 3D Object -> Sphere at the top of the main window. The sphere should now be visible in the Hierarchy and in the Scene view, and should be selected (has arrows pointing out of it), as in the screenshot below.

The sphere is an example of a “GameObject,” a generic type of object within Unity to which different attributes can be assigned or modified via the inspector or via C# scripting. The sphere comes with some of these already populated, as shown in the Inspector. In essence, these are:

  • Transform controls the location, rotation, and scale of the sphere in Euclidean X/Y/Z space (units are theoretically meters). All GameObjects at the very least must have a transform.
  • The Mesh is the actual 3D model.
  • The Collider is the boundary of the object for simulated physical interactions.
  • The Mesh Renderer controls how the model is rendered (displayed), such as how it is affected by light (e.g., if it can cast shadows).
  • The Material contains the texture information, or how the model is “painted.”

Creating a prefab

We need to turn the sphere into a prefab, so we can create clones of it on demand for our scatterplot. First, select the sphere and change its to be .25 in x, y, z, since the default size is too large for our purposes. This can be done by either changing the values in the inspector (under Transform), or by selecting the scale tool button in the top left corner of the main unity window and manipulating it in the scene window.

Then, create a prefab object by right-clicking in the Project window, under Assets, and selecting Create -> Prefab in the menu that opens up, and name it something meaningful, like DataBall.

Once the prefab exists in the Project, you can populate it with the Sphere GameObject by dragging the it from either the Inspector or the Scene window into the prefab object in the Project window. It should look something like this:

Saving the Scene

While Assets are stored as part of the file structure on your computer, the associations and placement of objects within the “Scene” are not. To save the Scene, simply go to File->Menu-> Save Scenes, or hit Ctrl+S. You will be prompted for a name and location the first time. The easiest place to store them is within the Assets folder within the Unity Project.

Importing CSV Data into Unity

Unity can read a wide variety of data types. As you can imagine, it takes a lot of different kinds of data to build a fully-functioning video game, such as images, 3d models, and sounds. Naturally, Unity can recognize CSV files. Getting Unity to parse CSVs, however, is unfortunately not as easy. There are two main steps: 1) creating a script to actually read and parse the data, 2) getting your data and putting it into a folder called Resources.

Creating a Script.

Parsing a CSV into something usable, like an array of values, is not built-in, requiring some scripting magic. Fortunately, brave internet souls have provided code to do this, so we do not need to reinvent the wheel (at least here). Namely, developer Teemu Ikonen has created a lightweight script for loading CSVs into a list of Libraries, described here, and posted to github here.

We have the code available, but how to get it, you know, into Unity? We need to create a script in the editor. To do this, within the Project window in Assets, right-click then select Create -> C# Script, and give it the name CSVReader.

Note: It is very important the name of the script file matches the name of the Class within the script. This is an assumption Unity makes in order to reference scripts from other scripts, and your scene will not run if there is a mismatch.

Okay, open up the script. This will mean a new program will open up, probably Monodevelop, but maybe Visual Studio (depending on what you chose to install). In the window, you should find some very plain code:

 

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class CSVReader : MonoBehaviour {

 // Use this for initialization
 void Start () {
 
 }
 
 // Update is called once per frame
 void Update () {
 
 }
}

First, it declares namespaces at the top for access to various other classes. If you are used to R or Python, these are roughly analogous to packages, giving you access to various pre-made functions. As noted, the class name defined here must match the name of the script. We can use it to reference the methods we define in this script in others (like in our case to connect to a the script that places our points in the scatterplot). Two functions are included by default, Start () and Update(). Start runs when the game is started, while Update runs every rendered “frame” (think video/movie frames). Others exist for more specialized purposes.

In our case, all we need to do is take Ikonen’s code and paste it in, replacing all existing code, and save it. If you are curious about how it works, I have a commented version of his code uploaded on github here.

We can leave this script alone for the remainder of the guid, since we will be referencing (i.e., running) the function it contains in another script.

Getting some data

First, to get Unity to read the CSV, we need to create a folder called Resources under Assets within the Project window. Make sure the naming is exact, since a folder with the name Resources has a special meaning for Unity (it allows for simple direct references to assets rather than manually associating them in the editor).

Think very hard about what dataset you want to graph. Now stop thinking, because I have something cooked up already: the iris dataset (wikipedia). You can download my cleaned up version here: iris.csv. To get it, I exported it from R in Rstudio using this code (getting rid of the quotes is not necessary but makes for cleaner display):

# Write without quotes
write.csv(iris, file = "iris.csv", quote = FALSE)

Okay, now make sure the iris.csv is in the Resources folder (and make sure it is actually named that). You can drag and drop it into the Project window as if it were an OS file window, or you can actually right-click on the folder and select the option to view it in your OS file system (“Show in Explorer” on Windows), and then put the file there like any other. Whenever you import a new file, Unity will take a moment to import it, which involves creating some additional Unity-specific metadata.

Plotting the Points

Step 1: Loading the Data

In order to plot the points, we need a script that gets the values from the CSVParser script, turns those values into XYZ coordinates, and then creates a clone of our prefab DataBall at that location.

To start, create another C# Script (in Project, right-click -> Create -> C# Script) titled something appealing, like DataPlotter, and open it in your development environment of choice. It will look very familiar:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DataPlotter : MonoBehaviour {

 // Use this for initialization
 void Start () {
 
 }
 
 // Update is called once per frame
 void Update () {
 
 }
}

First, we need to make sure we can get our input file read using our CSVReader script. To do this, we need a variable to hold the name of the file, a variable to hold the data that the script outputs, and the code to run the other script and populate set that variable. Delete the Update() function as we will not be needing it. Namespaces are omitted here for space.

public class DataPlotter : MonoBehaviour {

 // Name of the input file, no extension
 public string inputfile;

 
 // List for holding data from CSV reader
 private List<Dictionary<string, object>> pointList;

 // Use this for initialization
 void Start () {

 // Set pointlist to results of function Reader with argument inputfile
 pointList = CSVReader.Read(inputfile);

 //Log to console
 Debug.Log(pointList);
 }
 
}

Public variables are accessible by other scripts, but also modifiable within the Unity editor, which will override any default values at runtime. Private variables are inaccessible by other scripts and hidden in the editor.

Once you save your script, move back to Unity. For a script to run, it needs to be attached to a GameObject within the Scene. We have a sphere sitting in our scene, but a better option is to right-click within the Hiearchy and select Create Empty. This will create an empty GameObject, which is useful for holding other GameObjects and serving as something to attach scripts to. Give it a memorable name, such as Plotter, and drag your script to it from the Project menu to make the script a Component of the Plotter GameObject.

Warning: Make sure you are not in “Play mode” (that the buttons at the top are not blue) when you are changing settings in the Editor. Anything changed during Play mode will revert back to the way it was when Play mode began, including the placement of GameObjects, components attached to GameObjects (like scripts), and variables set within scripts.

You should now be able to select Plotter in the Hierarchy and see the script in the Inspector, like this:

As you can see, there is a field for Inputfile, which we defined within the script. Go ahead and put the name of the CSV, minus file extension, into that field (in our case, iris).

Now, hit the big play button at the top of the screen and look at the Console (I recommend setting it to “Clear on Play”). The last entry should be something like:

System.Collections.Generic.List`1[System.Collections.Generic.Dictionary`2[System.String,System.Object]]
UnityEngine.Debug:Log(Object)
DataPlotter:Start() (at Assets/DataPlotter.cs:18)

While this does not neatly print our data, it does show that our data was loaded (it is a List that contains Dictionaries), and that it came from line 18 of the DataPlotter.cs script.

Step 2: Setting up Column Names

To actually begin printing meaningful things (that we can also store and use for displaying our data), we need to do a little conversion. This code goes right after the Debug.Log() in the previous code block, within the Start() function:

 
// Declare list of strings, fill with keys (column names)
        List<string> columnList = new List<string>(pointList[1].Keys);

 // Print number of keys (using .count)
 Debug.Log("There are " + columnList.Count + " columns in CSV");

 foreach (string key in columnList)
 Debug.Log("Column name is " + key);

pointList[1].Keys is technically the list of “keys” of the index 1 Dictionary in pointList. These are the column names within the CSV. These are counted and printed, and then each column name is printed via a foreach loop. If you hit the play button, the console should fill with the following:

Okay, we have a list of column names we can use to reference points within pointList in order to get the coordinates for the data points.

What that lets us do is designate which column we want graphed by its index, rather than having to manually type the full name in the Editor. To implement that, we need to create two sets of variables, three ints to expose in the editor, and three strings which hold the full column names. We can leave the strings Public in order to see the actual Column names at runtime.

These should go below the existing variables, but before the Start() function:

// Indices for columns to be assigned
 public int columnX = 0;
 public int columnY = 1;
 public int columnZ = 2;

 // Full column names
 public string xName;
 public string yName;
 public string zName;

Keep in mind that here we have some values assigned to the columns. This ensures that if the user does nothing, the first three columns are taken as the default. Anything the user inputs in the Inspector will override these values.

The next block actually assigns these variables. Within the Start() function, after the last Debug.Log(). Keep in mind this needs to be after the creation of columnList, because it relies on that variable to be populated:

// Assign column name from columnList to Name variables
 xName = columnList[columnX];
 yName = columnList[columnY];
 zName = columnList[columnZ];

What this does is take the string within columnList, at the index specified by the column variables, and assign it to the Name variables.

Save and go back to the Editor, and hit Play. Select the Plotter GameObject and look in the Inspector. You should see the fields populated (if not, you may need to exit Play mode, then manually input the Column values), like so:

Note that Column X has no name… which is true, it doesn’t in our data! Try exiting play mode, changing the columns to 1, 2, 3, and hitting play again to see how it updates. Also note that you can put in nonsense into the Names fields, and it just gets overwritten by the script (since you just told it to get new names from columnList).

You entire DataPlotter script should now look like this:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DataPlotter : MonoBehaviour {

 // Name of the input file, no extension
 public string inputfile;
 
 // List for holding data from CSV reader
 private List<Dictionary<string, object>> pointList;

 // Indices for columns to be assigned
 public int columnX = 0;
 public int columnY = 1;
 public int columnZ = 2;

 // Full column names
 public string xName;
 public string yName;
 public string zName;

 // Use this for initialization
 void Start () {

 // Set pointlist to results of function Reader with argument inputfile
 pointList = CSVReader.Read(inputfile);

 //Log to console
 Debug.Log(pointList);

 // Declare list of strings, fill with keys (column names)
 List<string> columnList = new List<string>(pointList[1].Keys);

 // Print number of keys (using .count)
 Debug.Log("There are " + columnList.Count + " columns in the CSV");

 foreach (string key in columnList)
 Debug.Log("Column name is " + key);

 // Assign column name from columnList to Name variables
 xName = columnList[columnX];
 yName = columnList[columnY];
 zName = columnList[columnZ];

 }
}

Step 3: Instantiating the Prefab

Before placing our points, we first need to associate the prefab DataBall we made with our script, then direct the script to instantiate (make a clone).

First, we need to let our script know what the prefab is it will be placing. To do this, we need to declare a public GameObject variable within the script, like this (place it just below our other variables, but above Start() ):

// The prefab for the data points to be instantiated
 public GameObject PointPrefab;

Save your script and go back to the Editor, and look at Plotter again. There is now another field open in the Inspector, under our Column and Names variables. It has some text, with a little circle next to it:

To populate it, all you need to do is drag the DataBall prefab we created from the Project window to that field. It should look like this:

Now, our script “knows” about DataBall, but right now it is doing nothing with it.

While you are looking at the editor, delete the Sphere in our Scene within the Hierarchy (Not our prefab!). Don’t worry, the Sphere will rise again, in the form of the prefab.

Placing a clone of DataBall means instantiating it. Three pieces of information are needed to instantiate a prefab. A reference to the prefab itself (our PointPrefab variable will do), the position in XYZ space in a data type called a Vector3, and its rotation in a data type called a Quaternion.

For now, we can leave the position and rotation at zero (Quaternion.identity is shorthand for zero rotation), like so:

//instantiate prefab
Instantiate(PointPrefab, new Vector3(0,0,0), Quaternion.identity);

Note that this code needs to be within the Start() function in order to run, preferably at the end (for now).

Once you save, go back to the Editor, and hit Play, you should see DataBall appear in the scene view and the hierarchy, with the name (Clone) appended. If you stop playing, it will disappear. This is because instantiated objects only last for as long as the scene is run (unless destroyed by something else, which won’t happen in this guide).

Feel free to change the values within the Vector3 to other values, such as 1, 3, 4, and saving and hitting Play to see how the position it spawns at moves.

Step 3: Looping and Instantiating

Now that we know how to instantiate a DataBall, we need to instantiate one for each row in our table, according to the values in the three columns we have selected.

To do this, we need to loop through every row, get the value at each column position, then use those values as the coordinates to instantiate our DataBall.

This code will replace the Instantiate() code above, but still be within Start(). It needs to be after where the Name variables are assigned.

//Loop through Pointlist
 for (var i = 0; i < pointList.Count; i++)
 {
 // Get value in poinList at ith "row", in "column" Name
 float x = System.Convert.ToSingle(pointList[i][xName]);
 float y = System.Convert.ToSingle(pointList[i][yName]);
 float z = System.Convert.ToSingle(pointList[i][zName]);

 //instantiate the prefab with coordinates defined above
 Instantiate(PointPrefab, new Vector3(x, y, z), Quaternion.identity);

 }

pointList.Count returns the length of the List (in other words, the number of rows), so that the loop runs as many times as there are rows.

I want to point out that System.Convert.ToSingle simply ensures the value given by pointList is a float (Single is a type of floating point number).

Once you save and hit Play, you should have a whole mess of Clones in your hierarchy, and a cloud of data points in your Game/Scene view, as shown below. Remember, at the moment these points are assigned according the values in the CSV, so aren’t necessarily going to be around the origin (0,0,0). To get a better look, you can go into the scene view tab and navigate around.

Try changing the column values (and remember to start/stop Play) and see what happens. Warning: if you put in a column that does not exist (like 12), or if the column is full of strings (like 5), nothing will be plotted and there will be an error in the console. Essentially, make sure you are giving the script numerical data, or it won’t work.

To recap, your full DataPlotter script should look like this:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DataPlotter : MonoBehaviour {

 // Name of the input file, no extension
 public string inputfile;
 
 // List for holding data from CSV reader
 private List<Dictionary<string, object>> pointList;

 // Indices for columns to be assigned
 public int columnX = 0;
 public int columnY = 1;
 public int columnZ = 2;

 // Full column names
 public string xName;
 public string yName;
 public string zName;
 
 // The prefab for the data points that will be instantiated
 public GameObject PointPrefab; 

 // Use this for initialization
 void Start () {

 // Set pointlist to results of function Reader with argument inputfile
 pointList = CSVReader.Read(inputfile);

 //Log to console
 Debug.Log(pointList);

 // Declare list of strings, fill with keys (column names)
 List<string> columnList = new List<string>(pointList[1].Keys);

 // Print number of keys (using .count)
 Debug.Log("There are " + columnList.Count + " columns in the CSV");

 foreach (string key in columnList)
 Debug.Log("Column name is " + key);

 // Assign column name from columnList to Name variables
 xName = columnList[columnX];
 yName = columnList[columnY];
 zName = columnList[columnZ];

 //Loop through Pointlist
 for (var i = 0; i < pointList.Count; i++)
 {
 // Get value in poinList at ith "row", in "column" Name
 float x = System.Convert.ToSingle(pointList[i][xName]);
 float y = System.Convert.ToSingle(pointList[i][yName]);
 float z = System.Convert.ToSingle(pointList[i][zName]);

 //instantiate the prefab with coordinates defined above
 Instantiate(PointPrefab, new Vector3(x, y, z), Quaternion.identity);

 } 

 }
 
}

Cleanup: Instantiating Clones as Children

Right now, we dump a series of clones in the Hierarchy, which is messy. What would be better is to instantiate clones as a child of another object in the Hierarchy, which is both neater in terms of organization and lets you manipulate all the points at once by manipulating the parent object.

First, create an empty GameObject by right-clicking in the Hierarchy, and selecting Create Empty, and give it a name like PointHolder.

Now that we have this object ready to go, we need to make space for it in our script.

// The prefab for the data points that will be instantiated
 public GameObject PointPrefab;

Much like before, we simply need to declare a GameObject variable in our script (and save the script!), then drag our PointHolder object from the Hierarchy into the newly empty slot in the Inspector (make sure to select Plotter in the Hierarchy).

Here is where things get slightly more involved. In short, instead of just calling Instantiate() in our loop, we need to set it to a new GameObject. By setting it to a new GameObject, we can more easily manipulate the attributes of each prefab right after its made. This code replaces the previous Instantiate() line.

// Instantiate as gameobject variable so that it can be manipulated within loop
 GameObject dataPoint = Instantiate(
 PointPrefab, 
 new Vector3(x, y, z), 
 Quaternion.identity);

Now we can assign it to be the child of our our PointHolder object. In Unity, this entails making the Transform component of our newly generated prefab (dataPoint) a Child of PointHolder’s transform. Remember, order is important, so this code needs to be after you instantiate the object.

// Make dataPoint child of PointHolder object 
 dataPoint.transform.parent = PointHolder.transform;

Remember that Transform is also the thing determines position in the Hierarchy in addition to 3D location/rotation/scaling. What this means is that anything affecting the Transform of the Parent will also affect the Children (relative to the Parent).

We can do just a simple thing and give our prefab clones a more meaningful name, like the actual values they represent:

// Assigns original values to dataPointName
 string dataPointName = 
 pointList[i][xName] + " " 
 + pointList[i][yName] + " " 
 + pointList[i][zName];

Then of course, we actually need to give the string dataPointName to dataPoint, which is done like so (name is another property of Transform):

// Assigns name to the prefab
 dataPoint.transform.name = dataPointName;

Save and return to the Editor, Play, and check that the points are neatly nested in the Hierarchy and have their new names assigned.

Your entire script should look like this:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DataPlotter : MonoBehaviour
{

    // Name of the input file, no extension
    public string inputfile;

    // List for holding data from CSV reader
    private List<Dictionary<string, object>> pointList;

    // Indices for columns to be assigned
    public int columnX = 0;
    public int columnY = 1;
    public int columnZ = 2;

    // Full column names
    public string xName;
    public string yName;
    public string zName;
    
    // The prefab for the data points that will be instantiated
    public GameObject PointPrefab;

    // Object which will contain instantiated prefabs in hiearchy
    public GameObject PointHolder;

    // Use this for initialization
    void Start()
    {

        // Set pointlist to results of function Reader with argument inputfile
        pointList = CSVReader.Read(inputfile);

        //Log to console
        Debug.Log(pointList);

        // Declare list of strings, fill with keys (column names)
        List<string> columnList = new List<string>(pointList[1].Keys);

        // Print number of keys (using .count)
        Debug.Log("There are " + columnList.Count + " columns in the CSV");

        foreach (string key in columnList)
            Debug.Log("Column name is " + key);

        // Assign column name from columnList to Name variables
        xName = columnList[columnX];
        yName = columnList[columnY];
        zName = columnList[columnZ];

        //Loop through Pointlist
        for (var i = 0; i < pointList.Count; i++)
        {
            // Get value in poinList at ith "row", in "column" Name
            float x = System.Convert.ToSingle(pointList[i][xName]);
            float y = System.Convert.ToSingle(pointList[i][yName]);
            float z = System.Convert.ToSingle(pointList[i][zName]);

            //instantiate the prefab with coordinates defined above
            Instantiate(PointPrefab, new Vector3(x, y, z), Quaternion.identity);

            // Instantiate as gameobject variable so that it can be manipulated within loop
            GameObject dataPoint = Instantiate(
                    PointPrefab,
                    new Vector3(x, y, z),
                    Quaternion.identity);

            // Make child of PointHolder object, to keep points within container in hierarchy
            dataPoint.transform.parent = PointHolder.transform;

            // Assigns original values to dataPointName
            string dataPointName =
                pointList[i][xName] + " "
                + pointList[i][yName] + " "
                + pointList[i][zName];

            // Assigns name to the prefab
            dataPoint.transform.name = dataPointName;

            
        }

    }

}

Normalizing the Values for Display

Right now, our data is placed according to the raw value in the file. This isn’t very flexible, since if you are dealing with a different dataset with values in the hundreds or thousands, the graph will be that large. This becomes very important when implementing interaction, particularly in VR, since you don’t really want people to have to walk a mile to just look at the graph.

To do this, we will scale all values to between 0-10 before using them as coordinates for instantiating our dataBalls. First, we need to find the minimum and maximum values per-column. We can then work a little math with our raw values to get them into our 0-10 range.

Finding the min and max within our Start() function would mean packing in quite a few more lines, so this is a good time to do the C# thing and create a Method (function) outside Start(), that we can then call as needed.

First, some housekeeping: we will be using Convert.ToSingle a lot. Currently, we specify its namespace explicitly each time we use it (System.Convert.ToSingle), but instead we can declare it at the beginning of our script along with others (like using UnityEngine;).

using System;

Now we can just use Convert.ToSingle by itself, which makes the code a little cleaner. Next, onto the implementation

Find Maximums and Minimums

I won’t walk through it line by line, but here is the method FindMaxValue() to find the maximum value in a column in our pointList. In essence, it takes the column name as the argument (string columnName) and returns the maximum value as a float (private float). It does this by looping through the particular Dictionary (column), and overwriting the value it has if it finds a bigger one. Keep in mind that it assumes that pointList is defined, which is fine for our purposes, but means it would need to be modified to be extensible.

Place these functions into its own Method, after Start(), but within the last curly bracket (i.e., within the DataPlotter class). Note that we could place almost all our code within separate methods, which would be preferred for most projects, but we are keeping much of our code within Start() so that it is more readable.

    private float FindMaxValue(string columnName)
    {
        //set initial value to first value
        float maxValue = Convert.ToSingle(pointList[0][columnName]);

        //Loop through Dictionary, overwrite existing maxValue if new value is larger
        for (var i = 0; i < pointList.Count; i++)
        {
            if (maxValue < Convert.ToSingle(pointList[i][columnName]))
                maxValue = Convert.ToSingle(pointList[i][columnName]);
        }

        //Spit out the max value
        return maxValue;
    }

Similarly, here is the code for finding the minimum value. Place it after the block for the FindMaxValue, and make sure all your curly bracket make sense- remember, these are separate methods, at the same level as Start().

 private float FindMinValue(string columnName)
    {

        float minValue = Convert.ToSingle(pointList[0][columnName]);

        //Loop through Dictionary, overwrite existing minValue if new value is smaller
        for (var i = 0; i < pointList.Count; i++)
        {
            if (Convert.ToSingle(pointList[i][columnName]) < minValue)
                minValue = Convert.ToSingle(pointList[i][columnName]);
        }

        return minValue;
    }

From now on, we can just refer to these methods within our Start() function. Knowing they take the columnName, we can give it our existing string variables something-Name, and store the results in new floats.

// Get maxes of each axis
 float xMax = FindMaxValue(xName);
 float yMax = FindMaxValue(yName);
 float zMax = FindMaxValue(zName);

 // Get minimums of each axis
 float xMin = FindMinValue(xName);
 float yMin = FindMinValue(yName);
 float zMin = FindMinValue(zName);

Now all we need to do is work a little math magic to calculate the normalized position of the points: (i -min)/(max-min). This code then can replace the existing code for defining the x, y, and z floats within Start():

 // Get value in poinList at ith "row", in "column" Name, normalize
 float x = 
 (System.Convert.ToSingle(pointList[i][xName]) - xMin) / (xMax - xMin);

 float y = 
 (System.Convert.ToSingle(pointList[i][yName]) - yMin) / (yMax - yMin);

 float z = 
 (System.Convert.ToSingle(pointList[i][zName]) - zMin) / (zMax - zMin);

One more thing: it would be nice to be able to change the scale of the graph, meaning how far in space the maximum points go, so let’s quickly create a variable to do that, defined at the top of the script with the other variables.

public float plotScale = 10;

Then we can just add that variable to our Instantiate code, so that it becomes:

GameObject dataPoint = Instantiate(
                    PointPrefab, 
                    new Vector3(x, y, z)* plotScale, 
                    Quaternion.identity);

Save and return to the Editor. Hit Play, and if all your curly brackets are right (probably won’t be the first time), you should see all the points in their new positions. Try exiting play mode and changing the columns they represent in the Inspector for Plotter. The positions of the DataBalls will not exceed 10 in any axis, a value you can alter by changing the plotScale variable we created (but you will need to exit and re-enter play mode) in the Inspector.


Adding Color

Obviously, our plot is a little drab. Fortunately, we only need one line of code in order to dynamically assign color to our DataBall prefabs as we instantiate them. This is because we already have normalized values created for x, y, z in 3D world space that we can instead map to red, blue, green in RGB color space. There are several ways to define color in Unity, but the most straightforward way is as a set of four floats with values between 0-1, one for red, one for green, one for blue, and one for Alpha (transparency).

Conveniently, we already have x, y, z in our rendering loop in that format.  We can use them to create a new color (we can just leave Alpha at 1), and then assign it as the color of the prefab dataPoint, to override the default color. That code looks like this:

// Gets material color and sets it to a new RGBA color we define
 dataPoint.GetComponent<Renderer>().material.color = 
 new Color(x,y,z, 1.0f);

Note the syntax is a little different because we need to actually get the Material Component on the GameObject. It is what contains the color information.

Now (after saving and hitting Play in the Editor), you can see our fancily colored DataBalls, like so:

Finally, our code should look like this:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System;

public class DataPlotter : MonoBehaviour {

    // Name of the input file, no extension
    public string inputfile;
    
    // List for holding data from CSV reader
    private List<Dictionary<string, object>> pointList;

    // Indices for columns to be assigned
    public int columnX = 0;
    public int columnY = 1;
    public int columnZ = 2;

    // Full column names
    public string xName;
    public string yName;
    public string zName;

    public float plotScale = 10;

    // The prefab for the data points that will be instantiated
    public GameObject PointPrefab;

    // Object which will contain instantiated prefabs in hiearchy
    public GameObject PointHolder;

    // Use this for initialization
    void Start () {

        // Set pointlist to results of function Reader with argument inputfile
        pointList = CSVReader.Read(inputfile);

        //Log to console
        Debug.Log(pointList);

        // Declare list of strings, fill with keys (column names)
        List<string> columnList = new List<string>(pointList[1].Keys);

        // Print number of keys (using .count)
        Debug.Log("There are " + columnList.Count + " columns in the CSV");

        foreach (string key in columnList)
            Debug.Log("Column name is " + key);

        // Assign column name from columnList to Name variables
        xName = columnList[columnX];
        yName = columnList[columnY];
        zName = columnList[columnZ];

        // Get maxes of each axis
        float xMax = FindMaxValue(xName);
        float yMax = FindMaxValue(yName);
        float zMax = FindMaxValue(zName);

        // Get minimums of each axis
        float xMin = FindMinValue(xName);
        float yMin = FindMinValue(yName);
        float zMin = FindMinValue(zName);

        
        //Loop through Pointlist
        for (var i = 0; i < pointList.Count; i++)
        {
            // Get value in poinList at ith "row", in "column" Name, normalize
            float x = 
                (System.Convert.ToSingle(pointList[i][xName]) - xMin) 
                / (xMax - xMin);

            float y = 
                (System.Convert.ToSingle(pointList[i][yName]) - yMin) 
                / (yMax - yMin);

            float z = 
                (System.Convert.ToSingle(pointList[i][zName]) - zMin) 
                / (zMax - zMin);


            // Instantiate as gameobject variable so that it can be manipulated within loop
            GameObject dataPoint = Instantiate(
                    PointPrefab, 
                    new Vector3(x, y, z)* plotScale, 
                    Quaternion.identity);
                       
            // Make child of PointHolder object, to keep points within container in hiearchy
            dataPoint.transform.parent = PointHolder.transform;

            // Assigns original values to dataPointName
            string dataPointName = 
                pointList[i][xName] + " " 
                + pointList[i][yName] + " " 
                + pointList[i][zName];

            // Assigns name to the prefab
            dataPoint.transform.name = dataPointName;

            // Gets material color and sets it to a new RGB color we define
            dataPoint.GetComponent<Renderer>().material.color = 
                new Color(x,y,z, 1.0f);
        }       

    }

    private float FindMaxValue(string columnName)
    {
        //set initial value to first value
        float maxValue = Convert.ToSingle(pointList[0][columnName]);

        //Loop through Dictionary, overwrite existing maxValue if new value is larger
        for (var i = 0; i < pointList.Count; i++)
        {
            if (maxValue < Convert.ToSingle(pointList[i][columnName]))
                maxValue = Convert.ToSingle(pointList[i][columnName]);
        }

        //Spit out the max value
        return maxValue;
    }

    private float FindMinValue(string columnName)
    {

        float minValue = Convert.ToSingle(pointList[0][columnName]);

        //Loop through Dictionary, overwrite existing minValue if new value is smaller
        for (var i = 0; i < pointList.Count; i++)
        {
            if (Convert.ToSingle(pointList[i][columnName]) < minValue)
                minValue = Convert.ToSingle(pointList[i][columnName]);
        }

        return minValue;
    }

}

Exploring your Data

So far we’ve been looking at our data largely through the view of the camera when you hit play, which is not very dynamic. While going ahead and implementing VR technology is beyond the scope of this guide, we can emulate a VR-ish experience by adding a controllable camera. We can use a prefab player controller from Standard Assets, which allows you to wander around the environment like standard first-person game. If you did not download Standard Assets at the very beginning, you can go to the Asset Store window, search for Standard Assets, and download them from there (there will be a few windows asking about import settings, but the defaults are safe and you can agree to the prompts).

Before we add the prefab, we need to create a ground to walk on, or our FPS controller will fall through to infinity. Reminder: make sure you are not in play mode!

We do this by adding a Plane to our scene, through Game Object -> 3D Object -> Prefab at the top of the main window. It will be a little small, so change the scale to 10 in x, y, z by modifying the scale values in the Transform Component visible in the Inspector (with the plane selected).

Next add the prefab FPS controller by dragging it from the Project window into the Scene (make sure it’s on the plane!). It’s located in Standard Assets -> Characters -> FirstPersonCharacter -> FPSController.prefab. Also make sure it’s above the plane, or it might fall through.

Your scene should look like this:

Now disable the Main Camera, which you can do by unchecking it in the Inspector (a little box near the top, pictured below), since it will clash with the camera attached to the FPSController.

Okay, now finally hit Play! You should be able to walk around the environment freely by using the WASD or the arrow keys to move, and looking around with the mouse. These controls should seem very familiar if you have played a first person shooter game. You can even press spacebar to jump on top of the dataPoints!

Little Things (Aesthetics)

If you haven’t noticed, having a white ground is pretty jarring, and our bluish points don’t contrast well with the sky. Predictably, we can change both those things.

To get a prettier ground, we can just apply one of the existing materials in Standard Assets that we have downloaded. Personally, I like NavyGrid, which is contained within Standard Assets -> Prototyping -> NavyGrid. To assign it to our plane, all you need to do is drag it from the Project window onto the Plane in the Hierarchy. Your plane should instantly update to look like this:

Feel free to manipulate the options of the Material in the Inspector, either by selecting the Material itself, or selecting the plane. For example, I made my plane a little darker by picking a darker color in Albedo (click the colored square by the little eyedropper symbol).

To change the sky color is more counterintuitive. It’s actually a property of the camera. For us, that’s now in the FPSController GameObject, on the FirstPersonCharacter in the Hiearchy. If you select FirstPersonCharacter, you can see the Camera in the Inspector.

First, you need to change the Clear Flags option to “Solid Color” from “Skybox,” and then set the Background by clicking on the colored bar.(I prefer solid black for contrast, but you can pick whatever you want). Once set, it should look like this:

Now when you run your script, you are given a more modern scene, ripe for screenshots, or amazing your friends.

Extensions

You may have noticed we are still missing some important components of a real plot, such as labels and axes for reference. These are not difficult to make, but are time consuming, and I won’t be covering them here. If you wish to make your own labels, we already have much of the information coded in our script. To create labels, you will need to add 3D Text (Game Object -> 3D Object -> 3D Text), give that object a specific name, such as X_Title, and add code in your script to find that object by name and change the text (via TextMesh), which looks like the following:

GameObject.Find("X_Title").GetComponent<TextMesh>().text = xName;

Similarly, we have mins and maxes already stored, so it is largely a matter of placing 3D Text GameObjects at the extent of our graph area, which is defined by plotScale.

At the beginning, I mentioned deployment to different platforms. This involved “Building” your project, and how you build your project depends on your target platform, something that is out of scope for this guide, but not too difficult to find documentation on.

Conclusion

This guide is meant to give a taste of Unity for data visualization, and illustrate many of the idiosyncrasies that need to be dealt with in order to use Unity for displaying data. While at present, few tools exist to quickly and easily create data graphics in the variety of VR technology we have today. But, hopefully it won’t be long until this post is a quaint reminder of how things used to be.

Update March 2, 2018:

I have uploaded a more complete scatterplot with the labels included, along with some other minor improvements here: https://github.com/PrinzEugn/Scatterplot_Standalone

Scraping Data in Python Using WebDrivers*

* An interactive IPython notebook version of this post can be found on my GitHub page (here).

1. Introduction

The Internet is a treasure trove of interesting, unique, and often underutilized data. Getting that data, however, can be an arduous task. There are numerous libraries, implemented in various programming languages, that can help to ease the burden and have been a boon to data miners everywhere. Most notable are the BeautifulSoup and urllib2 libraries in Python.

1.1 The Problem

Still though, there are a number of instances where you may feel that these aren’t the right tools for the job. Sites where you need to enter information into forms, select boxes, or navigate by dropdown menus are especially tricky when using these more traditional methods. Dynamic websites, with and without static addresses are close to impossible.

1.2 The Solution

WebDrivers can provide a (generally) user-friendly answer to these problems. Although this post will focus on using the selenium library paired with ChromeDriver in Python, there are other WebDrivers (e.g., Firefox, headless browsers) and languages (e.g., Java) that can be used for this.

1.2.1 What is a WebDriver and Why is it the Solution?

A WebDriver is simply a live instance of an Internet browser controlled by a program rather than real-time human interaction. In essence, it looks like a regular browser (both to you and sites’ servers), it quacks like a regular browser, but it isn’t quite a regular browser. The appeal lies in the fact that you are able to automate natural web-navigation that would be difficult or impossible with traditional html and xml extractors.

A simple example is filling out a form. Say that you want to search a site for documents associated with a set of boolean strings (e.g., [“selenium NOT java”, “java NOT selenium”, …]) over a set of specific time spans. However, the address of those search results are dynamic – making them impossible to generate a priori. That’s going to be a problem for other tools, but with a WebDriver you can execute the search by filling out the search bar and specifying the date range (e.g., by clicking on a calendar GUI, entering in the dates, or using a dropdown menu).


2. Setup

Assuming that you have Python 2.7+ up and running on your machine, the first thing that you will need is ChromeDriver. You’ll want to install this somewhere that’s easily accessible (I just have it in my “Desktop” folder).

Next you will need to install selenium, which can be done via pip:

pip install selenium

Optional libraries that you may find useful are: os, random, re, time, and sys.


3. Getting Started

Once those are installed, you can start getting acquainted with using a WebDriver through Python.

First, we need to import the necessary libraries. The Select utility allows us to isolate user interfaces that we want to operate on.

import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import random
import time
import sys

Next, we’re actually going to connect to the WebDriver and open up a browser window.

## Connecting
chromedriver = "C:\\Users\\rbm166\\Desktop\\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

## Opening
driver = webdriver.Chrome(chromedriver)
# Clear cache
driver.delete_all_cookies()
# Full screen
driver.maximize_window()
# Wait...
time.sleep(5)

This should open up a Google Chrome Browser page that isn’t on any particular webpage:


4. Example

Now that you know how to get started, let’s move on to a real example.

4.1 The Constitute Project

The Constitute Project is a collection of national constitutions from across the world. Users can read full constitutions, compare countries, and explore topics within and across documents. This database, however, would be very difficult to scrape without using a WebDriver. This is because the site’s main page listing the constitutions is dynamic and based on JavaScript.

Using Python and ChromeDriver, we can easily navigate this page and grab all the links to countries constitutions, whose pages are static HTML and can be parsed with a standard library like BeautifulSoup.

First, we’re going to point our driver to the page we want to load:

## Provide the starting page:
link0 = 'https://www.constituteproject.org/search?lang=en'

## Go to that starting page:
driver.get(link0)

## Wait for it to load:
time.sleep(5)

Once loaded, the driver’s browser should look something like this:

Using the inspect element option available in that window or a separate browser, we can find the identifier associated with the links that we want to grab. In this case, we can use the link text “View HTML”.

Now we can tell our driver to find all of the page’s elements associated with that link text:

## Pull all "View HTML" related objects:
objects = driver.find_elements_by_link_text('View HTML')

We can perform a quick check to make sure that we’re only grabbing what we want. The number of objects should equal the number of constitutions, n = 194, listed at the top of the page:

## Sanity Check:
len(objects)
194

We just have to get the links out of these objects now. Selenium makes this easy:

## List to hold the links:
links = []

## Iterate over list and get the links:
for obj in objects:
    links.append(obj.get_attribute('href'))

## Inspect the links:
links[0:6]
[u'https://www.constituteproject.org/constitution/Afghanistan_2004?lang=en',
 u'https://www.constituteproject.org/constitution/Albania_2012?lang=en',
 u'https://www.constituteproject.org/constitution/Algeria_2008?lang=en',
 u'https://www.constituteproject.org/constitution/Andorra_1993?lang=en',
 u'https://www.constituteproject.org/constitution/Angola_2010?lang=en',
 u'https://www.constituteproject.org/constitution/Antigua_and_Barbuda_1981?lang=en']

These links can now be written out to a *.txt or *.csv file and parsed using a library like BeautifulSoup, or manipulated later in the script.


## Now we can close the driver:
driver.close()

5. Conclusion

Traditional web scraping libraries and packages are well developed tools that make web scraping easier. They do, however, fall short on some fronts. WebDrivers provide an elegant solution to many of the problems faced by these traditional methods. As shown above, WebDrivers can navigate dynamic websites with ease and are easily adaptable to most situations.

Disclaimer

Just because you can doesn’t mean you should. Be sure to check and observe sites’ policies regarding scrapers. These policies can most often be found at “[insert url here].com/robots.txt


 

Introduction to Network Description and Visualization in R

Cassie McMillan

2016-12-05

This vignette is designed to give an introduction of how to perform network analysis in R using the statnet package. Statnet is a suite of several network-related R packages including sna, network, ergm, and tergm. The statnet website includes a complete list of the included packages. While one of the main focuses of statnet is the statistical modeling of networks, this vignette will mainly focus on data handling, calculating descriptives, and visualization. At the end, I’ll provide resources you can check out if you’re interested in learning more about more advanced functions in statnet. Also, I’ll quickly note that there exist other network analysis R packages, most notably igraph. All these packages are able to do the more basic stuff there are some differences that are worth checking out.

Getting Started

Like any R package, it’s first necessary to install and access it. The package can easily be installed through CRAN.

install.packages("statnet")
library(statnet)

For this vignette, I’ll be using data that David Krackhardt collected from managers in a company. There are two csv files containing the data. The first krack_advice is an adjacency matrix indicating who gives advice to who. The second krack_attributes is a list of attributes for each of the managers such as age and tenure. After setting your working directory, load in the data like this:

advice <- read.csv("Krack_Advice.csv", header=T, row.names=1, check.names=FALSE)
#include header = T and row.names = 1 to account for the node labels

att = read.csv("Krack_Attributes.csv", header=T)

advice
head(att)

Data Management

First, I’ll go over how to manage network data using statnet. To use most of statnet’s functions it is necessary to first convert the adjacenecy matrices into network objects. This can be done in the following code:

advice.m = as.matrix(advice) # tell R data should be understood as a matrix
advice.n =network(advice.m, matrix.type="adjacency",directed=TRUE) # transforms into a network object
advice.m

The argument matrix.type specifies that we are putting in an adjacency matrix, you can also read in edge lists. Also in the network command, you can specify whether the matrix is directed or undirected, if it’s a bipartite network, if self-loops are allowed, etc.

Next, you need to link the attribute data with the network object. To do this it’s necessary that the nodes in the network object and attribute data be in the same order. Then, there are two ways to link the data. First, you can go in and add the attributes in one at a time. Alternatively, you can add the attributes in when you make the network option. It’s also possible to add in edge attributes, but the current data set does not have any of these.

#check if nodes are in the same order
att$ID==rownames(as.sociomatrix(advice.n))

#option 1
advice.n %v% "dept" <- att$DEPT
advice.n

#option 2
advice.n = network(advice.m, matrix.type = "adjacency", directed = T, vertex.attr = att)
advice.n

If you need to, you can easily add in new edges to the network object or delete existing edges. You can also check whether edges exist between pairs.

advice.n[2,1] <- 1 #adds an advice tie from node 2 to node 1
advice.n[2,1] <- 0 #removes an advice tie from node 2 to node 1
advice.n[1,2] #does node 1 send an advice tie to node 2?

Reading in your network data is pretty simple if your network data are already in adjacency matrices. However, network data can come in a variety of formats, including adjacency lists and edge lists. If your data are in this form, then there’s a lot of code out there that you can use to transform your data into adjacenecy matrices, which tend to be much easier to work with. In the past, I’ve used code from this blog to transform other data formats into adjacenecy matrices.

Descriptives

Now I’m going to present a couple descriptive measures that are available from the statnet package. First, here’s a bunch of functions that calculate network-level descriptives:

summary(advice.n) #summary of network object, also provides an edge list
network.size(advice.n) #number of nodes in the network
network.dyadcount(advice.n) #number dyads that exist in network (n*(n-1))
network.edgecount(advice.n) #number of edges present in the network
gden(advice.n) #network density (edge count/dyad count)
grecip(advice.n, measure = "dyadic") #proportion of symmetric dyads
gtrans(advice.n, measure = "weak") #proportion of transitive triads
symmetrize(advice.n, rule = "weak") #symmetrize so i<->j iff i->j OR i<-j 
symmetrize(advice.n, rule = "strong") #symmetrize so i<->j iff i->j OR i<-j 
dyad.census(advice.n) #MAN dyad census
triad.census(advice.n) #standard directed triad census
kpath.census(advice.n, maxlen=3, tabulate.by.vertex=FALSE) # Count paths of length <=3
kcycle.census(advice.n, maxlen=3, tabulate.by.vertex=FALSE) # Count cycles of length <=3
clique.census(advice.n, tabulate.by.vertex=FALSE, enumerate=FALSE) # counts of cliques by size

And here are some node-level descriptive statistics:

degree(advice.n, cmode="indegree") #indegree, number of nominations received
degree(advice.n, cmode="outdegree") #outdegree, number of nominations sent
degree(advice.n) #total degree (sent+received)
betweenness(advice.n) #betweenness
closeness(advice.n) #closeness
isolates(advice.n) #lists the isolates in the graph
geodist(advice.n) #gives number and lengths of all geodesics (shortest paths) between all nodes 

Visualization

The statnet package also allows for you to easily visualize your network graphs with the gplot function.

gplot(advice.n)

blog_post_rplot

There are a lot of interesting arguments that you can add into the gplot function. You can add in vertex labels and change the size and color of these labels:

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75, label.col="black")

For directed graphs, you can turn off the arrows. This tends to be especially helpful for large graphs with a lot of nodes and edges.

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75, label.col="black", usearrows = FALSE)

It’s easy to differentiate nodes based on their attributes. You can color the nodes based on their attributes. The code below colors attributes based on their department. Each color represents a different department. I also include code here for adding a legend that tells us what each color represents.

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black",vertex.col=att$DEPT)
legend("bottomleft",fill=0:4,legend=paste("DEPT",0:4),cex=0.75)

blog_post2_rplot

Here, I changed the shape of the nodes based on the level of their positions. You change the shape of the nodes by specifying the number of sides you want the shape to have using the vertex.sides argument. For instance, vertex.sides = 4 will result in squares. If you want circles, set vertex.sides = 50.

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black",vertex.cex = 2, vertex.sides=(att$LEVEL+2))

blog_post3_rplot

You can also change the size of the nodes based on an attribute. In the examples below, I do this for both tenure of the employees and the indegree (nominations received) by each employee. I divide the size values by 6 so they can be reasonably scaled. In the graph shown below, the nodes have been sized according to indegree. Larger nodes received more advice nominations.

#sized by tenure
gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", vertex.cex = (att$TENURE/6)) 

#sized by indegree
gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", vertex.cex = (degree(advice.n, cmode="indegree")/6))

blog_post4_rplot

As a default, gplot uses the Fruchterman Reingold algorithm to lay out the nodes. However, you can change this as well. For instance, you can lay out the nodes using MDS or in a circle.

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", mode = "mds") #multi-dimension scaling 
gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", mode = "circle") #circle

If you like a layout, you can save the coordinates of it and then reapply these coordinates later to preserve your same layout.

coordinates <- gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black") 
coordinates #prints the coordinates

#applying saved coordinates to a new graph
gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", coord = coordinates)

Note that you can also use the coord argument to make up your own coordinates. The Krackhardt data we’ve been working with doesn’t have any isolates (e.g. those who neither send nor receive ties), but when we do have data with isolates these can often get annoying when visualizing. As a result, you can plot a graph without displaying isolates by including the following argument: displayisolates=FALSE.

Furthermore, gplot does include an interactive function where you can move around the positioning of vertices yourself until you find a display that you like.

gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black", interactive=TRUE)

There are also several other arguments that you can include to make graphs that are both pretty and interesting.

palette(rainbow(6)) 
gplot(advice.n, displaylabels=TRUE,
 label.cex=.75,label.col="black",
 usecurve==TRUE, vertex.col=att$DEPT,
 vertex.cex = (degree(advice.n, cmode="indegree")/7),
 edge.col = "black", usearrows = FALSE,
 edge.curve = 0.5, vertex.border = "black")
legend("bottomleft",fill=0:4,legend=paste("DEPT",0:4),cex=0.75)

blog_post5_rplot

Statistical Modeling

As mentioned previously, the statnet package also includes a lot of functions for statistically modeling networks. This includes QAP correlations, MRQAP, ERGMs, and TERGMs/STERGMs. An in-depth discussion of these packages is beyond the scope of this vignette, but here are a couple additional resources that go into more detail about these functions. I’ve found these to be helpful in the past at explaining how these functions work:

  • INSNA Sunbelt 2016 statnet workshop resources (here)
  • Notes from SNA and R workshop put on by Michael Heaney (here)
  • ERGMs applied to Grey’s Anatomy hook up network example (here)

Getting Started With GERGM

Matthew Denny

2016-09-18

This vignette is designed to introduce you to the GERGM R package. GERGM stands for Generalized Exponential Random Graph Model. This class of models was developed to characterize the structure of networks with real-valued edges. GERGMs represent a generalization of ERGMs, which were developed to model the structure of networks with binary edge values, and many network statistics commonly included in ERGM specifications have identical formulations in the weighted case. The relevant papers detailing the model can be found at the links below:

  • Bruce A. Desmarais, and Skyler J. Cranmer, (2012). “Statistical inference for valued-edge networks: the generalized exponential random graph model”. PloS One. [Available Here]
  • James D. Wilson, Matthew J. Denny, Shankar Bhamidi, Skyler Cranmer, and Bruce Desmarais (2015). “Stochastic Weighted Graphs: Flexible Model Specification and Simulation”. [Available Here]
  • Matthew J. Denny (2016). “The Importance of Generative Models for Assessing Network Structure”. [Available Here]

Installation

The easiest way to do this is to install the package from CRAN via the standard install.packages command:

install.packages("GERGM")

This will take care of some weird compilation issues that can arise, and is the best option for most people. If you want the most current development version of the package, you will need to start by making sure you have Hadley Wickham’s devtools package installed.

If you want to get the latest version from GitHub, start by checking out the Requirements for using C++ code with R section in the following tutorial: Using C++ and R code Together with Rcpp. You will likely need to install either Xcode or Rtools depending on whether you are using a Mac or Windows machine before you can install the GERGM package via GitHub, since it makes use of C++ code to speed up inference. That said, the development version often has additional functionality not found in the CRAN release.

install.packages("devtools")

Now we can install from Github using the following line:

devtools::install_github("matthewjdenny/GERGM")

Once the GERGM package is installed, you may access its functionality as you would any other package by calling:

library(GERGM)

If all went well, check out the vignette("getting_started") which will pull up this vignette!

Basic Usage

We begin by loading in some example network data. In our case, these data are (logged) aggregate public and private lending volumes between 17 large countries from 2005. The data are included in the GERGM package and were used in the Wilson et. al. study listed at the beginning of this vignette. In addition to the network (a square matrix) we are also going to load in some node-level covariate data, and a network covariate: the normalized net exports between these countries in 2005. WE will make use of this data in fitting our example GERGM model.

The GERGM package provides a plot_network() function, which we can use to visualize the network as follows:

library(GERGM)
set.seed(12345)
data("lending_2005")
data("covariate_data_2005")
data("net_exports_2005")
plot_network(lending_2005) 

Alternatively, if we prefer a white background, and no legend, we can select options for this as well. Typing ?plot_network into the console will pull up a manual for this function.

Having plotted the raw network data, we can now proceed to model it using the gergm() function. Detailed documentation for this function (along with a large number of advanced options) can be accessed by typing ?gergm into the console. We are going to focus on a simpler version of the application from the Wilson et. al. paper, that will highlight creating a formula object with node and network level covariates, as well as endogenous (network) effects. While this model will not provide a perfect fit to the data, it serves to illustrate a number of key concepts. If we look at the first couple of rows of the covariate_data_2005 object, we can see that it include information about each country’s log GDP and whether it was a member of the G8.

head(covariate_data_2005)
##              GDP  log_GDP  G8
## ARG 2.229108e+11 26.13004  No
## AUS 6.933386e+11 27.26478  No
## BRA 8.921068e+11 27.51685  No
## CAN 1.164144e+12 27.78301 Yes
## PRC 2.268594e+12 28.45018  No
## FRA 2.203679e+12 28.42115 Yes

To model this network, we are gong to include an edges term, which functions similarly to an intercept term in a regression model and parameterizes the density of the network. We are also going to include sender and receiver effects for a country’s GDP. These effects are designed to capture the effects of having a large economy on the amount of lending a borrowing a country does. We are also going to include a Nodemix term to capture the propensity for members and non-members of the G8 to lend to each other, compared to the base case of non-G8 to non-G8 member lending. The final covariate effect we are going to include in the model is a netcov, or network covariate term, capturing the effect of the structure of the international trade network on the international lending network. Finally, we are going to include one endogenous statistic in the model, to capture the degree of reciprocal lending in the network. For this endogenous statistic, we are also going to include an exponential down-weight. this means that when the value of the network statistic is calculated, it will then be raised to the power of (in this case) 0.8. This will have the effect of reducing its value, but more importantly of smoothing out statistic values as the GERGM parameter controlling the propensity for mutual dyads in the network carries. Practically, this can make it easier to get starting values for the mutual dyads parameter that are in the right ball park, aiding in the estimation process. The formula object is defined below:

formula <- lending_2005 ~ edges + mutual(alpha = 0.8) + sender("log_GDP") + 
  receiver("log_GDP") + nodemix("G8", base = "No") + netcov(net_exports_2005) 

Note that the terms used in GERGM formulas are analogous to those used in the ergm package, and are documented in greater detail in the ?gergm help file.

If you are interested in experimenting, try setting alpha = 1 and rerunning the model. You will see lots of error messages, and output indicating that your parameter estimates have zoomed off to infinity. If you are familiar with ERGMs (for binary network data), you may have heard of an issue these models can run into called “degeneracy”, which can make certain models impossible to estimate. In this particular example, as with all GERGM specifications we have tried so far, the GERGM does not seem to suffer from this issue. However, as the experiment described above can attest, GERGMs can still be difficult to estimate. This is primarily due to challenges in getting good starting values for our model parameters. The current implementation of the GERGM software does so using maximum pseudo likelihood (MPLE), which does a pretty good job in many cases. However, in some cases, such as the example here, it can be enough off the mark that the initial parameter guesses from MPLE simulate networks that look a lot different from the observed network. This can cause the optimizer in R (which is used to update our estimates of the model parameters) to zoom off to infinity.

If this happens to you, do not (immediately) panic! This usually means you are dealing with a tricky network, or a tricky specification (typically one with lots of endogenous statistics included). The first thing to do is try to use alpha weighting. A good rule of thumb is to set alpha = 0.8 for all of the endogenous statistics included in the model. Note that these currently include: out2stars, in2stars, ctriads, mutual, and ttriads (or just twostars and ttriads if your network is undirected). If this does not work, you can try cranking down the weights to around 0.5. If this still does not work, you will need to explore the theta_grid_optimization_list option in the gergm documentation, which should always work if given enough time (although this could be weeks, depending on how complex your model is). A fuller example is provided at the end of this vignette.

Having addressed the challenges that come with estimating a GERGM model, lets try and example!

test <- gergm(formula,
              covariate_data = covariate_data_2005,
                number_of_networks_to_simulate = 40000,
                thin = 1/100,
                proposal_variance = 0.05,
                MCMC_burnin = 10000,
                seed = 456,
                convergence_tolerance = 0.5)

The output displayed in this vignette only includes diagnostic plots, and not all of the information that would be spit out by the gergm() function if you were to run this code on your computer. All of that output is meant to help you track the estimation process (which can take days or weeks for larger networks), and diagnose issues with the estimation. Note that if you wish to tweak some of the parameters in the diagnostic and estimate plots, you may do so and regenerate the plots after estimation is complete using the following functions:

# Generate Estimate Plot
Estimate_Plot(test)
# Generate GOF Plot
GOF(test)
# Generate Trace Plot
Trace_Plot(test)

In particular, we might want to make a nicer looking estimate plot. We can do this using the following block of code, where we leave out the intercept estimate, and provide a custom list of parameter names to produce a publication quality plot:

Estimate_Plot(test,
              coefficients_to_plot = "both",
              coefficient_names = c("Mutual Dyads",
                                    "log(GDP) Sender",
                                    "log(GDP) Receiver",
                                    "Non-G8 Sender, G8 Receiver",
                                    "G8 Sender, Non-G8 Receiver",
                                    "G8 Sender, G8 Receiver",
                                    "intercept",
                                    "Normalized Net Exports",
                                    "Dispersion Parameter"),
              leave_out_coefficients = "intercept")

In order to verify the claim made earlier in this vignette that the current model is not degenerate, just hard to fit, we can generate a hysteresis plot for this model using the hysteresis() function. This function simulates large numbers of networks at parameter values around the estimated parameter values and plots the mean network density at each of these values to examine whether the model becomes degenerate due to small deviations in the parameter estimates. See the following reference for details:

  • Snijders, Tom AB, et al. “New specifications for exponential random graph models.” Sociological methodology 36.1 (2006): 99-153.

So long as we see a smooth upward sloping series of points, we have strong evidence that the specification is not degenerate.

# Generate Hysteresis plots for all structural parameter estimates
hysteresis_results <- hysteresis(test,
                                 networks_to_simulate = 1000,
                                 burnin = 300,
                                 range = 8,
                                 steps = 20,
                                 simulation_method = "Metropolis",
                                 proposal_variance = 0.05)

As we can see this specification does not display signs of degeneracy, even though we needed to use exponential down-weighting in order to fit the model.

Edge Prediction

Following on from the example above, we can also predict individual edge values, conditioning on the rest of the observed edges and estimated parameters. We can then calculate the mean edgewise mean squared error (MSE) for these predictions, and compare it against the MSE from a null model with no parameters included. First we generate the conditional edge predictions:

test2 <- conditional_edge_prediction(
  GERGM_Object = test,
  number_of_networks_to_simulate = 100,
  thin = 1,
  proposal_variance = 0.05,
  MCMC_burnin = 100,
  seed = 123)

Next we can calculate the MSE of these predictions and compare it to the null model predictions.

MSE_results <- conditional_edge_prediction_MSE(test2)
## Mean MSE for Predicted Edge Values: 8099.79 
## Mean MSE for Max Ent Predicted Edge Values: 42927.12 
## This represents a 81.13 percent reduction in the average edgewise MSE when using the GERGM model.

As we can see, this model does significantly better in terms of conditional edgewise predictive performance than the null model.

Bonus: A More Complex Model

Here we have included code to run the full model which appears in the Wilson et al. paper. It requires a 30 core machine to run as currently specified, and can take several days to weeks to run, depending on your computer setup. We include this more complex specification to highlight the flexibility the GERGM package gives users to deal with more difficult to model data. These more advanced features are covered in the ?gergm documentation, and will be the subject of future vignettes.

formula <- net ~  mutual(0.8) + ttriads(0.8) + out2stars(0.8) + 
  sender("log_GDP") + netcov(net_exports) + 
  receiver("log_GDP") + nodemix("G8", base = "No")


result <- gergm(formula,
              covariate_data = covariate_data_2005,
              number_of_networks_to_simulate = 400000,
              thin = 1/100,
              proposal_variance = 0.05,
              MCMC_burnin = 200000,
              seed = 456,
              convergence_tolerance = 0.8,
              hyperparameter_optimization = TRUE,
              target_accept_rate = 0.25,
              weighted_MPLE = TRUE,
              theta_grid_optimization_list = list(grid_steps = 2,
                                                  step_size = 0.1,
                                                  cores = 30,
                                                  iteration_fraction = 1))

Introduction to using R for Webmaps and Spatial Data Analysis



Introduction to using R for Webmaps and Spatial Data Analysis






R Spatial Vignette

This is an R vignette to introduce spatial data analysis. Spatial data can be stored as and comes in many formats. The first part of the vignette will introduce how spatial data can be visualized in web-based platforms through Google Visualisation API, the use of basemaps, selecting areas, and plotting spatial data into a web map.

The next part of the vignette will give an overview of how to use R to load geographic data of vector (1) points and (2) polygons into a spatial dataframe, then analyze in R and a webmap.

Section 1: Create a map from R for web-based platforms

R Spatial packages for introduction to Basemaps and Webmaps

library(sp)  # classes for spatial data
suppressPackageStartupMessages(library(rgeos)) # needed for maptools
suppressPackageStartupMessages(library(maptools)) # has spatial data
library(raster) #needed for dismo
library(dismo) # retrieving base maps from Google with gmap function
library(RgoogleMaps) # web based map abilities through Google server
suppressPackageStartupMessages(library(googleVis)) # interactive web based maps with data frames
## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'

Create a basemap in R; zoom/select areas and plot.

# BasemapJapan <- gmap("Japan")
# plot(BasemapJapan)

Change the basemap, options: ‘roadmap’, ‘satellite’, ‘hybrid’, ‘terrain’

# SatJapan <- gmap("Japan", type = "satellite", exp = 1, filename = "Japan.gmap")
# plot(SatJapan)

Manually choose a spatial area of interest:

# select.area <- drawExtent()
# SelectArea <- gmap(select.area)
# plot(SelectArea)

Visualise data in a web browser using Google Visualisation API.

Note: gvisGeoMap needs Chrome set as default browser, example from R Documentation.

# WorldPopulation=data.frame(Country=Population$Country,
#         Population.in.millions=round(Population$Population/1e6,0),
#         Rank=paste(Population$Country, "Rank:", Population$Rank))
# PopMap <- gvisGeoMap(WorldPopulation, "Country", "Population.in.millions", "Rank",
#         options=list(dataMode="regions", width=600, height=300))
# plot(PopMap)

Section 2 Load spatial data vectors in dataframes and visualize as map

Install R packages for spatial vectors and list projections

library(sp) # classes and methods for spatial data: points, lines, polygons and grids
suppressPackageStartupMessages(library(rgdal)) # GIS functionality

CRS.WGS84 = CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
CRS.UTM   = CRS("+proj=utm +zone=54 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")

Load spatial points to map and plot radiation data onto webmap

# safecast.file   = "data/Safecast-2011.Rdata"
# load(safecast.file)
# sc.full <- sc.full[sample(1:nrow(sc.full), 2e4, replace=FALSE),]

Take time slice and set values

# sc.time          = as.Date(sc.full$`Captured Time`)
# begin.date       = "2011-05-01"
# decay.date       = "2011-08-01"
# timeFrame = sc.time > as.Date(begin.date) & sc.time < decay.date
# sc   = sc.full[timeFrame,]  # only keep the data for the valid times
# time = sc.time[timeFrame]  # recall to only keep the times the 'valid' dates
# 
# 
# safe.coords     = cbind(sc$Longitude,sc$Latitude)
# cpm             = sc$Value
# safe.spdf       = SpatialPoints(coords = safe.coords, proj4string=CRS.WGS84)
# 
# safe.data = data.frame(time=time, rad = cpm, radLog = log(cpm) )

Create a new spatial data frame with the spatial/temporal subset

# safe.in.spdf = SpatialPointsDataFrame(coords = safe.coords,
#                                       data = safe.data,
#                                     proj4string=CRS.WGS84)
# 
# plot(safe.in.spdf,cex=.1, pch=3)
# 
# writeOGR(safe.in.spdf, ".", paste(basename(safecast.file),"_",decay.date,sep=""), driver="ESRI Shapefile",overwrite_layer=T)


# fn = "rad"
# 
# safe.in.spdf.utm         = spTransform( safe.in.spdf, CRS.UTM )
# safe.in.raster.utm       = raster(safe.in.spdf.utm, resolution=c(50,50))
# safe.in.raster.utm.max   = rasterize(safe.in.spdf.utm, safe.in.raster.utm, field=fn, fun=max)
# 
# safe.in.raster           = raster(safe.in.spdf, resolution=c(50,50))
# safe.in.raster.max       = projectRaster(safe.in.raster.utm.max, safe.in.raster)

Plot radiation values to web

# scWeb <- sc[sample(1:nrow(sc), 50, replace=FALSE),]
# 
# scWeb$locationvar = paste(scWeb$Latitude, scWeb$Longitude, sep = ":")
#  
# Webmap <- gvisMap(scWeb, locationvar = "locationvar", "Value", 
#                 options=list(showTip=TRUE, showLine=F, enableScrollWheel=TRUE,  mapType='satellite', useMapTypeControl=TRUE, width=400,height=800))
# plot(Webmap)

R spatial data mapping for polygons

Install packages for chropleth mapping of polygon areas

suppressPackageStartupMessages(library(choroplethr))
library(choroplethrAdmin1)
library(ggplot2)
suppressPackageStartupMessages(library(data.table))

Load adminstrative polygon data

data(admin1.map)
japan.map = get_admin1_map("japan")

Build polygon based map

ggplot(japan.map, aes(long, lat, group=group)) + 
  geom_polygon() + ggtitle("Japan")

Plot population data into map

data(df_japan_census)
df_japan_census$value=df_japan_census$pop_density_km2_2010
PopJapan = admin1_choropleth("japan", df_japan_census, 
          num_colors = 5, title = '2010 Population Density of Japan', 
          legend = "Population Estimate")
plot(PopJapan)

This vignette serves as an overview of how R can be used for spatial data. Of course, R documentation is useful to learn how to use packages and many tutorial are availible, such as the recommend:

Reference for details of how spatial data are loaded into R and visualised in popular packages such as ggplot – “Introduction to visualising spatial data in R” https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf Leaflet, an R interface with JavaScript -“The leaflet package for online mapping in R” http://robinlovelace.net/r/2015/02/01/leaflet-r-package.html. Cheatsheet for Data Visualization in ggplot https://alyssasfu.files.wordpress.com/2015/04/screen-shot-2015-04-01-at-10-15-47-pm.png. Excellent overview on R Spatial packages availible and their usefulness https://cran.r-project.org/web/views/Spatial.html. This is the end of the R Spatial Vignette.