Florida Institute for Human & Machine Cognition, Pensacola, FL
This summer, I spent nine weeks working at the Florida Institute of Human and Machine Cognition (IHMC) in Pensacola, Florida. I worked as an intern under Larry Bunch and Michael Vignati, both researchers heading the Cybersecurity team at IHMC. I also worked closely with another ASIP intern, Quan Do ’19. I lived in off-campus apartment-style housing at nearby University of West Florida.
The mission of IHMC’s Cybersecurity team is to develop a biologically-inspired Security Infrastructure to enforce and maintain security policies, configuration, and application integrity in networked tactical systems. This approach combines previous research in Artificial Immune Systems with new theories and techniques involving Danger Theory (DT), autonomous damage detection and context-dependent reputation systems. The CyberLab’s research in the areas of Security and Information Assurance also includes the protection of the Nation’s critical infrastructure and cyber infrastructures. For that domain, IHMC is creating distributed security frameworks that use intelligent mobile agents as permanent roaming security guards for plant and network monitoring and protection.
One of the IHMC CyberLab’s main projects is the Observatory, a data visualization system which specifically caters towards large, time-sensitive data sets with a flow aspect to them. For example, one configuration had been setup to display IHMC’s network data. An outline of a world map is projected against a solid black background, and square “nodes” are constructed in space in front of it, with labels corresponding to different categories of network traffic. Then, as time progresses, items flow from nodes to geographic locations on the map outline, where each item is a clickable square. Once clicked, information about one specific network element is displayed. This information indicates the website address, IP address, and other data. The Observatory is thus used in this way to display different sets of data. Another configuration illustrates a multi-year collection of monetary loan data from around the United States. Loans are categorized by purpose and payoff status (fully paid, partially paid, and unpaid).
Given the previous introduction to the Observatory’s purpose, an explanation of the architecture of the Observatory’s software is necessary to understand my work from this summer. The Observatory’s architecture consists of two distinct pieces of software with which we are concerned. The first is a collection of “data-source” files, and the second is a collection of “configuration” files. The data-source is what extracts and collects raw data from databases or web sources and organizes it into a specific format, specified by the Observatory. The configuration files then specify what and how objects are to be displayed in the visualization, as well as which pieces of data from the data-source correspond to what objects.
In other words, the data-source and configuration files perform what is known in computing as the Extract, Transform, Load (ETL) cycle. Data extraction is where data is extracted from homogeneous or heterogeneous data sources; data transformation is where the data is transformed for storing in the proper format or structure for the purposes of querying and analysis; data loading is where the data is loaded into the final target database, more specifically, an operational data store, data mart, or data warehouse. In our case, the final target database is the Observatory’s visualizations themselves.
My project was an addition to the Observatory; a new data source and corresponding configuration which performed the ETL cycle on Bitcoin transaction data. The ultimate goal of this data manipulation was to de-anonymize the Bitcoin system by identifying single Bitcoin users or “wallets” and assigning arbitrary identities to them, along with a rough geolocation. So, in short, my project for the summer was to extract, transform, and load Bitcoin transaction data in order to de-anonymize Bitcoin, a decentralized and supposedly anonymous electronic currency system.
To understand in higher detail precisely how our data-source was able to identify otherwise anonymous Bitcoin users, one must first acquire a solid grasp of how Bitcoin itself works.
Bitcoin is cryptocurrency. It is completely electronic and decentralized, so there is no central bank or single administrator. What is specifically important to note in understanding our project is the architecture of a single transaction. A transaction consists of two sets of “addresses”: the input addresses and the output addresses. The input addresses are combined to make up the amount of Bitcoin to be sent to the output addresses, while the output addresses each receive a portion of the total amount. Each input addresses is actually a reference to a previous output address, from a previously made transaction. For example, if I want to send eight Bitcoins to some address “Address1,” I must use some number of previously received Bitcoins as the input of this transaction (call it “T0”). So, perhaps at some point in the past, in some transaction “T1,” I received six Bitcoins. Thus, “T1” has an output, the address of which belongs to me, and which contains six Bitcoins. Then, perhaps at some other point in the past, in some transaction “T2,” I received two Bitcoins. Thus, considering both of these transactions, I have two outputs which belong to me, one of two Bitcoins and one of six. To send eight Bitcoins to “Address1,” I must then use these two outputs as the input of “T0.” So, the inputs field of “T0” will contain two references, one to the output of “T1” and one to the output of “T2.”
So, using a popular heuristic called “Address Clustering,” we were able to assign arbitrary identities to single Bitcoin users. The principle behind address clustering is that in any transaction, the input addresses that are used must all belong to the same user. Since addresses can be used more than once, in multiple transactions, as inputs, provided they still have a Bitcoin balance, this principle can be used, in theory, to compile all of the addresses which belong to any single Bitcoin user. In other words, all the addresses which a user has ever used can be clumped together.
To assign a rough geolocation to each user, we sourced IP address data from BlockCypher.com, an online Bitcoin data service. To put it briefly, we converted IP addresses, which are per transaction, into geographical coordinates, and took an educated average of those values as the rough geolocation of the user.
So, in higher detail: we utilized the address clustering heuristic to assign arbitrary identities to Bitcoin users and assigned geolocations to each user using an average of the locations corresponding to multiple IP addresses.
IHMC is an unexpectedly robust institution: on the outside, it seems that their research is limited to those topics only pertinent to robotics, cybersecurity, computing, and the like. However, as I’ve come to know the researchers who work at IHMC, I’ve come to realize that IHMC hires people who do good work; they do not hire people who simply do the work they want done. Thus, IHMC is filled with researchers who have a passion to dive deeply into subjects with which perhaps they are the only ones who have an interest in, and this dynamic creates an environment of great diversity. My summer at IHMC has been enlightening; I not only learned deeply technical skills and came to understand dangerously complex systems, but I also experienced the atmosphere of a genuine research institution, where curiosity and learning are the driving forces.
I have none other than the ’68 Center for Career Exploration and the Class of 1972 to thank for the opportunity to have spent a summer at IHMC. Without their support, none of the above would have been possible.