Finding artefacts in archived data

Introduction

As of now, the official Search for Extraterrestrial Intelligent Life, or SETI, has been on-going for nearly 60 years now. This search, initiated as searching for intentional signals in the radio and eventually at other frequencies, has been far from exhaustive, as a signal could occur anywhere, on any frequency, at any time. In addition to this, SETI that is undergone assuming a beacon, an intentional signal, assumes that whomever is out there is actively attempting to get our attention. This assumption, mixed with the fact that light takes time to travel and mankind has only been technologically advanced enough to receive signals for a short period of time (compared to the lifetime of the galaxy) is sometimes seen as a downside to SETI, and sometimes even as a reason to not do SETI.

One way to resolve (or at least limit) both the issues of the assumption above and the time itself is artifact SETI. Artifact SETI, sometimes known as Dysonian SETI  (see Bradbury et al. 2011), resolves to search for traces of a civilization that are not intentional beacons to outsiders. Rather, these traces could be technological products or artefacts. Assuming an alien species to be roughly human-like in motivation and reasoning, Davies & Wagner 2013 listed and expanded on four different classes of artefacts that could be detected: messages, scientific instruments, geo-engineering structures, or trash.

To this list, we would like to add a fifth category: structures such as buildings or roads, that exist during the lifetime of the civilization and sometime after it. Such structures are often lost and re-discovered here on Earth, centuries after the civilization responsible for them had fallen. Such structures, as well as the other artefact classes described by Davies & Wagner 2013 might now be visible in the high resolution data we have of nearby terrestrial bodies.

On Earth, archaeologists have re-discovered numerous “lost cities” that had been abandoned and then buried in sand, covered in soil, or obscured by foliage. Many of these discoveries–including those of cities in Mexico, Egypt, Guatemala, and Cambodia, Roman farms from 400 AD, and Roman roads from 43 AD–occurred within the past few years. This is because the field of archaeology is diverging from traditional methods of on-foot excavations to the use of LIDAR. LIDAR, which stands for Light Detection and Ranging, is a remote sensing method that uses a pulsed laser to measure variable distances to the Earth, typically at visible or near-infrared wavelengths. Many countries started initiatives to map all of their land using LIDAR around 2009, for a variety of reasons, resulting in these numerous and recent discoveries. More on LIDAR.

Unfortunately, the field of archaeology is different from astronomy in that discoveries are often not published in journals or conference proceedings; instead, they are sold as stories or documentaries. Because of this, the data from the LIDAR survey as well as methods or algorithms used for the discovery are not typically available to the public.

The few studies that have been published are not completely comparable to a search for ET on another planet. One reason for this is that we do not have LIDAR data for any planet other than Earth. LIDAR data, although similar to RADAR data, is of a much higher resolution than RADAR, allowing for the detection of finer or less visible structures. Archaeological searches on Earth typically knew where to look for these cities. Either buildings were already visible in the area or some text described the location of the city (e.g., Comer & Blom 2006, Rowlands & Sarris 2007, Iorio et al. 2008, Corrie 2011). Archaeological searches on Earth also have the benefit in that archaeologists know what they are looking for. Even if they do not have the exact layouts of the cities, as some archaeological LIDAR searches do (e.g., Iorio et al. 2008), mankind has built the same types of structures (buildings, roads, farm land) in similar manners and sizes across the continents and across the centuries. ET might only create structures that are round, or roads instead of tunnels, or might choose to only build sub-structures.

Methods

Given the lack of available publications and open source code, we were unable to apply any of the techniques used in structure detection on Earth to our data. Instead, we search the data numerically for lines. We apply a Hough transform to the data (Hough 1959), finding lines, and overplotting them onto the data for visual inspection. Before a Hough transform can be performed, the image must first undergo edge detection so that the result is a binary image. Below, we describe the edge detection algorithm used, as well as the Hough transform.

Canny Edge Detection

One multi-step algorithm that has been used frequently over the past few decades is the Canny Edge Detection algorithm (Canny 1987). Originally written for an IEEE conference, the Canny Edge Detection algorithm is a computationally unintensive, fast algorithm that takes a greyscale images and returns a binary image, with edges in white. The algorithm contains multiple steps, each of which we describe below.

Initially, we apply a Gaussian filter to the image in order to smooth it, reduce the noise, and remove unwanted details and texture.  Once the data have been smoothed, we calculate the intensity gradient of the image and then suppress pixels if they do not correlate to a maximum value using a predefined threshold. Each pixel above the threshold is then evaluated to see if it is greater in magnitude than its neighbors along the gradient direction. If it is, the value remains unchanged and if not, the value is set to zero.

This process is then repeated with two different thresholds, the second greater than the first, where  T2 > T1 to obtain two binary images. The second binary image has far less noise and fewer false edges but contains greater gaps between edge segments. These gaps are then bridged with coinciding edges in the lower threshold binary image. Once these gaps are bridged, the secondary image is the final binary image.

We use the the function Canny in the python module OpenCV (Bradski 2000) to perform our edge detection.

Hough Transform

To isolate lines in our data, we perform a Hough Transform on the data, an algorithm originally described by Hough 1959. When a Hough Transform is performed on an image, each point in image space is analyzed, considering all lines which could pass through that point at a set of angles. For each line, the closest distance between the origin and the line is then discretized and is denoted as the distance. These discrete sets of angles and distances correspond to a discrete grid in the  space known as Hough space. Each grid space is called a Hough accumulator. Every time a line in image space is considered, the corresponding accumulator in Hough space is incremented. After all lines through all data points are considered, the Hough accumulator with the highest values, often called votes, probably corresponds to a line.

Once Hough space is fully explored, we apply a threshold number of votes for a point in Hough space to be considered a line. We also apply a threshold for the shortest line allowed and a maximum number of pixels between line segments for them to be considered a single line.

Below, we show an example of a few values of distance and angle applied to an image containing three discrete points:

The top panels show the six lines chosen to go through each point; the middle panels show the angle  and the distance corresponding to each of the six lines; and the bottom panel shows the corresponding curves in Hough space, one for each point in image space. This figure, with its three points, corresponds to just one line and therefore one point in Hough space.

Below, we show an image of two lines, and the corresponding distribution in Hough space:

We use the function HoughlinesP in the python module OpenCV (Bradski 2000) to perform our Hough Line detection. Once a line is detected, it is plotted over the original image and followed up with visual inspection.

Data

We perform the analysis described above on a variety of images to test the algorithms and corresponding code. We do this as a proof of concept for this study, but will later apply the algorithms and subsequent analysis on all data from HiRISE (High Resolution Imaging Science Experiment). HiRISE was a camera aboard the Mars Reconnaissance Orbiter which launched in 2005 and is, to this day, is the most powerful camera that has ever orbited another planet. The resolution capability allowed imaging up to 30 centimeters per pixel.

Of the data we analyzed, we pulled images from the internet, available under the Creative Content License. We chose images with obvious lines (visually) as a first step to verify the code. After these tests, we tested the algorithm on LIDAR images of Earth, images of a higher resolution than RADAR data from other planets yet of a more similar style and resolution than the previous tests. We then tested the algorithm on a few HiRISE images, randomly selected.

Results

Below, we describe the data used to test the algorithm and the results from each image, split into three categories of Every Day Images, Earth LIDAR data, and Mars HiRISE data.

Every Day Images

We test the algorithm described above on a variety of images taken from the internet under the Creative Commons License. In the three figures below, we show the success of the algorithm with increasingly complex and difficult images, with the original image on the left or top panel, and the overplotted image on the right or bottom panel. We dedicate some discussion to each image below.

We test the algorithm on a simple logic grid:

The image consists of only black and white, and only straight lines. Given the simplicity of the image, the algorithm was able to identify each line, even the lines that were thicker. This image required no fine tuning of the parameters.

We test the algorithm on a grid that has been altered in such a way that “waves” have been introduced into the data:

The algorithm was successful in picking up a large fraction of the straight lines, although a few were missed. Since the image is purely black and white, this oversight was not due to the edge detection but must be a fault of the Hough Transform algorithm. This image required no fine tuning of the parameters, although decreasing the threshold required for a line to be kept led to spurious “lines” being accepted. Therefore, we set the threshold at such a level where no spurious lines were kept.

We test the algorithm on an areal image of farm fields in India:

These fields are laid out on an almost grid basis and the image is in color, providing a good opportunity to test a greyscale image. This image required a fair amount of fine tuning of all of the parameters, in particular the threshold, the spacing of distances in Hough space, and the two thresholds used for edge detection. We fine tuned until the majority of the field markers had been identified while trying to minimize spurious “lines.”

One interesting thing to note about this test was how the algorithm handled buildings. Instead of identifying the edges of the buildings, lines of a variety of distance and angle values were chosen. Attempting to eliminate these lines led to the proper lines separating the fields not being detected. A metric to measure how well the algorithm performed, such as the percentage of lines correctly identified or the ratio of true positives to false positives, would be valuable in such a study as this, but is currently beyond the scope of this work.

Earth

We test the algorithm described above on a variety of LIDAR images taken of Earth from archaeological discoveries. In the three following figures, we show the success of the algorithm on images more difficult than the every day test images, with the original image on the left panel, and the overplotted image on the right panel. We dedicate some discussion to each image below.

We test the algorithm on a LIDAR image of fossilized field systems discovered in Cambridgeshire from 410 AD :

Just as for the fields in India, we tested this image for its grid structure. This image was already greyscale, but the contrast between lines and background was greatly reduced, leading to a reduction in the number of lines detected. This problem could possibly be alleviated or the detection improved if the contrast of the image were artificially increased, or if a different edge detection algorithm were used. A few of the longer and more obvious roads were identified, and the algorithm labeled building foundations in the same way it had in the previous test, by choosing seemingly random lines on top of the buildings instead of the building edges.

We test the algorithm on a LIDAR image of Mahendraparvata, a 1200 year old lost city in Cambodian, discovered via LIDAR due to the heavy foliage on top of the ruins:

This image shows clear roads (lines), as well as has some of the sites labeled, with lines pointing to them, and a legend for scale. Interestingly enough, although the algorithm identified a fair fraction of the roads present, it did not identify any of the label lines nor the legend lines.

We test the algorithm on a LIDAR image of medieval coal pits in Staffordshire, UK:

The main features in this image, although a few roads are present, are the coal pits, which look like very large and tidy ant hills. The algorithm was able to identify a fair fraction of the roads, as well as mark some of the coal pits, a promising result for geo-engineering structures.

Mars

We test the algorithm described above on a variety of RADAR images taken of Mars by the HiRISE camera. In the figures below, we show the success of the algorithm on images from the collection that the algorithm will eventually be applied to, with the original image on the left panel, and the overplotted image on the right panel. The parameters were fine tuned for the first image such that some of the image border was detected. After this, the same parameters were applied to each image.

We visually analyzed each line in each image, isolated the line and comparing it to the raw data. Every line corresponded to a crater rim, folded or fractured crust, a shadow of a large rocky feature, or spotting in the regolith. We did not find evidence of extraterestrial artefacts.

Where are the lines??

As mentioned throughout the analysis sections above, many very obvious lines were missed by the algorithm, meaning that a human search would have been more fruitful (even though more time consuming). Since the aim of this study is to determine an automated method of line detection that requires very little human input, this problem of missing lines needs to be identified and corrected.

It is possible that the contrast between the lines and the surrounding background is so low that the edges that would later correspond to lines are simply not being detected. The edge detector algorithm used has multiple steps, each of which could be modified to potentially improve the result. For example, the first step, applying a Gaussian filter to minimize noise, could be skipped all-together. This particular edge detector algorithm could also be replaced with another algorithm, possibly one that is stronger in low-contrast situations. Another way around this potential problem would be to artificially increase the contrast in an image beforehand.

It is also possible that the reason so many lines were missed was because of the way we applied the Hough Transform. Instead of applying the transform to every point, we instead use a variant of the Hough Transform called a Probabilistic Hough Transform. This algorithm chooses a random subset of points to transform instead of the entire image. In theory, this leads to the same Hough space that corresponds to image space as a full Hough Transform, with just a decreased number of votes in each accumulator. This means that the same number of lines should still be detected, as long as the threshold for what determines a line is lowered appropriately. Given that each test image in this paper was fine tuned such that a variety of thresholds were sampled, we do not believe this to be the reason for our missed lines. However, it still warrants additional study.

Another problem we faced in this testing was the high level of fine tuning required to pull out lines. Although this might be alleviated if all lines are detected by implementing some of the changes suggested above, it is still worth mentioning as a problem. The aim of this study is to determine a way to find artefacts that does not require a lot of human time (nor computing time), and this is not currently being achieved. A possible way to help this would be to artificially add a line into an example image from a dataset at a resolution and size that would be reasonable and recoverable, and find a set of parameters that can recover that line. Then, this set of parameters could be applied to all images in a dataset, requiring fine tuning of only one image.

Another way to reduce the amount of fine tuning and human time required would be to implement some variant of machine learning. A training set of Earth LIDAR images could be made and used, scoring a detected line as a real or false detection. Constructing the training set would require a fair bit of work initially, but once it was constructed, the level of fine tuning required could potentially be decreased.

Another aspect of this process that currently takes too much time is the vetting of images. Since the detected lines were plotted over the original data, we had to visually inspect each image ourselves, identifying what laid under each line and its origin. Although not too bad for the subset of images tested herein, this is far too time-consuming and not ideal for full datasets. A possible solution to this would be creating some metric of image importance, so that only the most probable candidates would need be visually inspected. This metric could be somehow based on the performance of the algorithm on the training set or on the data with the line artificially added in, possibly including the percentage of lines correctly identified or the ratio of true positives to false positives. The metric could also take into account the distribution of lines, their lengths, the number of lines found, and the location of lines. If the lines are clustered in one area, this could be indicative of a building. If the lines are short, corresponding to a few kilometers at most, this could be indicative of roads instead of larger-scale geological features. If a large number of lines are found, this could mean that the image was just too noisy for real lines to be discovered. If the lines are only around the edges of the image itself, as they are in the HiRISE data, then these images would not be worthy of visual inspection.

We applied the above described algorithm algorithm to LIDAR and RADAR data since such images can penetrate foliage and sand on Earth and potentially regolith on other bodies. However, we have large amounts of high resolution data of Solar System bodies in the visible. Although these data are not able to see potentially buried structures, they do still warrant searching as they could unveil evidence of a current or past ET civilization.

Conclusions and Future work

The Search for Extraterrestrial Intelligence has just recently started to include artefact SETI, the search through (often archival) data for structures left by other lifeforms. These arefacts could exist for a variety of reasons, and might be visible to us with recent high resolution RADAR data. Artefact SETI alleviates the temporal issues of another civilization existing at the same time as us, as well as eliminating the possibly incorrect assumption made by traditional SETI-ists that ET want to communicate with us. The use of RADAR specifically allows us to see possible structures that have been there for years, even if they are slightly buried in regolith.

Using an combination of Canny Edge Detection and Hough Transforms, we motivate a search through HiRISE data of Mars, looking for lines in the data that might result from cities, buildings, roads, or other structures. We test our algorithms on a variety of data, starting with easy images of grids, moving on to LIDAR data on Earth, and finally finishing with a few HiRISE images. Many of these tests uncovered potential difficulties with the algorithm, particularly the inability to detect obvious lines. We discuss a few possible causes for this, as well as a few solutions that will reduce and possibly eliminate this issue.

Once the algorithm is detecting all lines in test images, or at least most of them, we will run it on all HiRISE data, after we acquire said data, of course. We also plan to run this on other RADAR data available for the Solar System, such as for Venus and the Moon, and possibly on other sets of high resolution data, even if they are in the visible. Although the visible will not display artefacts that have been covered with regolith, it should still be explored.