Name Classification

Gender

To examine the gender diversity of State Department officers, we use the open source Python package gender-guesser which classifies first names as male, female, mostly_male, mostly_female, androgynous, or unknown. We conduct our analysis by grouping male or mostly_male and female or mostly_female together, and limiting our dataset to known names only (omitting unknown names). Though it doesn’t publicly report its accuracy level, this study evaluating various name-gender classifiers gives it a 2.6% misclassification rate, the lowest of the free classifiers.

It is important to note that the gender-guesser package only classifies first names, whereas our dataset includes full names. We use open source Python package nameparser, which uses a “simple, rule-based approach” to parse out first names for classification.

We also perform analysis of gender diversity in the State Department by job “cone,” or officer function. After scraping and cleaning the Abbreviations and Symbols section of the officer lists, we classify each position as management, consular, economic, political, public diplomacy, principal officer, security, or non-FSO (Foreign Service Officer). Principal officers include leadership positions such as ambassadors and directors; security officers include specialists assigned by the Department of Defense, such as regional security officers.

We then carefully match our classified dataset with our “key” that translates abbreviated positions and offices to job cones.

Ethnicity

To examine the ethnic, national, and cultural diversity of State Department officers, we use two tools. The first is open source Python classifier ethnicseer, a pre-trained model handling 12 ethnicities: Middle-Eastern, Chinese, English, French, Vietnamese, Spanish, Italian, German, Japanese, Russian, Indian, and Korean. In its original proposal paper, it boasts an 84% accuracy on its test data set.

The second is publicly available API name-to-ethnicity classifier. Though the API allows for training custom models, we opt for the pre-trained “8_nationality_groups” model, which classifies names as African, Celtic, East Asian, European, Hispanic, Muslim, Nordic, or South Asian with a reported 84% accuracy. Since the classifier also provides an estimated confidence level (maximum 100%) with each classification, we limit our analysis data sample to only classifications with over 50% confidence.

To simplify our analysis we group our ethnicities and nationalities into five groups: White, Asian, Black, Hispanic, and Middle-Eastern.