Controlled Vocabulary Tool
2017 - 2018
A multi-threaded Python CLI tool to create a controlled vocabulary.
I was inspired to start this project while reading Data+Design by Trina Chiasson and Dyanna Gregory (and over 50 global contributors). Chapter 4 contains a section named Controlling for Inconsistencies. Upon reading it I figured it was a good simple problem that can benefit from a multi-threaded solution. I decided to work in Python for quick and easy development.
The main idea of the project was to create a CLI tool that, using multiple threads, can convert a dataset into a new dataset with a controlled vocabulary (see image below). The multiple threads should utilize the fact that the user giving a mapping for a value takes a few seconds. During this time the tool can read input, convert values for which it already has a mapping, and output mapped values.
With this goal in mind I first drafted a single-threaded script that read input from a file, prompted the user for a conversion and displayed the resulting dataset in the console. The next step was to do this on three different threads: one to read, one to convert, and one to output. To let the threads communicate with each other I created a few global variables accompanied by mutexes to align usage of the global variables between the threads, see the diagram below.
Finally I separated the conversion thread into a conversion and prompting thread, so the tool can convert newly read values while the user is being prompted for a mapping. And also added an option to output to a file, fuse multiple input files into one output file, output the conversion map created, and to use an existing mapping for conversion.