Determining Gender of a Name with 80% Accuracy Using Only Three Features

Categories Uncategorized

Introduction

I thought an easy project to learn machine learning was to guess the gender of a name using characteristics of the name. After playing around with different features by encoding characters of the name, I discovered you only needed THREE features for 80% accuracy which is pretty impressive. I am by no means an expert at machine learning, so if you see any errors, feel free to point them out.

Example:

Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M

(Click here for Source: IPython Notebook)

Dataset

The dataset used for getting names was from SSN’s baby names dataset for the year 2014.
https://www.ssa.gov/oact/babynames/names.zip

Methodology

I took all the baby names from the dataset that had at least 20 people for male and female since I found many names were low quality when they are least used (for example, there are a few guys named Amy born in 2014).

Loading

Code for loading data from dataset into numpy arrays ready for machine learning

import numpy as np from sklearn.cross_validation
import train_test_split, cross_val_score from sklearn.ensemble
import RandomForestClassifier from sklearn
import svm my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xlist = name_map(my_data['name'])
X = np.array(Xlist.tolist())
y = my_data['gender']

X is an np.array of N * M, where N is number of names and M is number of features
y is M or F
name_map will be a function that converts a name (string) to an array of features

Fitting and Validation

We will be splitting the data into training and testing for cross-validation and using RandomForrest for classification since it performs well at classifying data.
for x in xrange(5):
 Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
 clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
 clf.fit(Xtr, ytr)
 print np.mean(clf.predict(Xte) == yte)
By default, RandomForest will set max_features(number of features to look at before split) = n_features which is recommended for classification problems (http://scikit-learn.org/stable/modules/ensemble.html#parameters). We will be using n_estimator (number of trees) of 100 and a min_samples_split (the minimum number of samples required to split an internal node) of 2 which we will tune when we determine a good feature set.

Picking Features

Character Frequency

My first attempt at features was the frequency of each character:
def name_count(name):
 arr = np.zeros(52)
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 return arr
Example:
aaabd
freq: [a:3, b:1, d:1]
* Note that we encode freq as an array using index of letter. e.g.: [3, 1, 0, 1, 0, 0, …. 0]. Most of the array will be zeroes.
Accuracy:
0.690232056125
0.692390717755
0.693739881274
0.688073394495
0.694819212089
Not bad for simple features.

Character Frequency + Order

Second attempt at features is frequency + ordering:
def name_count(name):
 arr = np.zeros(52)
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 return arr
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]

We can combine these encodings by adding the two arrays together and offsetting the second array

Accuracy:
0.766864543983
0.760388559093
0.766864543983
0.76740420939
0.759848893686
We are getting somewhere!

Character Frequency + Order + 2-grams

Let’s trying adding all the 2-grams in the name as features to see if we can get more info.
def name_count(name):
 arr = np.zeros(52+26*26)
 # Iterate each character
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 # Iterate every 2 characters
 for x in xrange(len(name)-1):
 ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a'))
 arr[ind] += 1
 return arr
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]
We can encode 2-grams by converting from base 26, e.g.-> aa = 0, bc = 26 + 2 = 28

Accuracy:

0.78548300054
0.771451699946
0.783864004317
0.777388019428
0.77172153265

We get a slight increase in accuracy, but I think we can do better.

Character Frequency + Order + 2-grams + Heuristics

Examining the names more in depth, I hypothesized that the length of name and last and second character of the name could be important.
def name_count(name):
 arr = np.zeros(52+26*26+3)
 # Iterate each character
 for ind, x in enumerate(name):
 arr[ord(x)-ord('a')] += 1
 arr[ord(x)-ord('a')+26] += ind+1
 # Iterate every 2 characters
 for x in xrange(len(name)-1):
 ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
 arr[ind] += 1
 # Last character
 arr[-3] = ord(name[-1])-ord('a')
 # Second Last character
 arr[-2] = ord(name[-2])-ord('a')
# Length of name
arr[-1] = len(name)
return arr
Example: aaabc
freq: [a:3, b:1, c:1]
ord: [a:6, b:4, c:5]
2-gram: [ aa: 2, ab: 1, bc: 1]
last_char: 3
second_last_char: 2
length: 5
Accuracy:
0.801672962763
0.804641122504
0.803022126282
0.801672962763
0.805450620615

Fine-tuning

After playing around with n_estimators and min_samples_split, I found good values:
clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
which gives the accuracy:

0.814085267134
0.821370750135
0.818402590394
0.825148407987
0.82245008095

Which gives us a small accuracy increase.

Feature Reduction

Let’s look at the 10 most important features as given by clf.feature_importances:
[728  26 729   0  40  50  30 390  39  37]
[728  26 729  50   0  40  37  30  34 390]
[728  26 729  50  40   0  37  30  39 390]
[728  26 729   0  50  40  30  37 390  39]
[728  26 729   0  50  40  30  37  39  34]

These numbers refer to the feature index by most importance.

728 – Last character
26 – Order of a
729 – Second last character
0 – Number of a’s
50 – order of y

40 – order of o

It looks these 6 features are consistently good.
Let’s see how good the top feature is
def name_count(name):
 arr = np.zeros(1)
 arr[0] = ord(name[-1])-ord('a')+1
 return arr

Accuracy:

0.771451699946
0.7536427415
0.753912574204
0.7536427415
0.760658391797

Wow! We actually get 75% accuracy! This means the last letter of a name is really important in determining the gender.

Let’s take the top three features (last and second last character  and order of a’s) and see the importance of these. (But if you already read the title of this blog post, you should know what to expect.)

def name_count(name):
 arr = np.zeros(3)
 arr[0] = ord(name[-1])-ord('a')+1
 arr[1] = ord(name[-2])-ord('a')+1
 # Order of a's
 for ind, x in enumerate(name):
 if x == 'a':
 arr[2] += ind+1
 return arr

Accuracy:

0.798165137615
0.794117647059
0.795736643281
0.801133297356
0.803561791689

I would say 80% accuracy for 3 features is pretty good for determining gender of a name. Thats about the same accuracy as a mammogram detecting cancer in a 45-49 year old woman!

Sample Example

We can sample random datapoints to see how well our model is performing:
def name_count(name):
 arr = np.zeros(3)
 arr[0] = ord(name[-1])-ord('a')+1
 arr[1] = ord(name[-2])-ord('a')+1
 # Order of a's
 for ind, x in enumerate(name):
 if x == 'a':
 arr[2] += ind+1
 
 return arr

my_data = np.genfromtxt('names/yob2014.txt', 
 delimiter=',', 
 dtype=[('name','S50'), ('gender','S1'),('count','i4')],
 converters={0: lambda s:s.lower()})
my_data = np.array([row for row in my_data if row[2]>=20])
name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xname = my_data['name']
Xlist = name_map(Xname)
X = np.array(Xlist.tolist())

y = my_data['gender']

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
clf.fit(Xtr, ytr)

idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False)
xs = Xname[idx]
ys = y[idx]
pred = clf.predict(X[idx])

for a,b, p in zip(xs,ys, pred):
 print a,b, p

Output:

Name Actual Classified
shea F F
lucero F M
damiyah F F
nitya F F
sloan M M
porter F M
jalaya F F
aubry F F
mamie F F
jair M M

Conclusion

Many features are good, but finding important features is better.
If you are unsure of a gender of a name, just look at the last letter which gives you a 75% chance of getting it.

I hope you have learned something from reading this blog post as I did writing it!(Click here for Source: IPython Notebook)