## Determining Gender of a Name with 80% Accuracy Using Only Three Features

## Introduction

I thought an easy project to learn machine learning was to guess the gender of a name using characteristics of the name. After playing around with different features by encoding characters of the name, I discovered you only needed THREE features for 80% accuracy which is pretty impressive. I am by no means an expert at machine learning, so if you see any errors, feel free to point them out.

Example:

Name Actual Classified shea F F lucero F M damiyah F F nitya F F sloan M M porter F M jalaya F F aubry F F mamie F F jair M M

## Dataset

## Methodology

### Loading

Code for loading data from dataset into numpy arrays ready for machine learning

import numpy as np from sklearn.cross_validation import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn import svm my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()}) my_data = np.array([row for row in my_data if row[2]>=20]) name_map = np.vectorize(name_count, otypes=[np.ndarray]) Xlist = name_map(my_data['name']) X = np.array(Xlist.tolist()) y = my_data['gender']

X is an np.array of N * M, where N is number of names and M is number of features

y is M or F

name_map will be a function that converts a name (string) to an array of features

### Fitting and Validation

for x in xrange(5): Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33) clf = RandomForestClassifier(n_estimators=100, min_samples_split=2) clf.fit(Xtr, ytr) print np.mean(clf.predict(Xte) == yte)

## Picking Features

### Character Frequency

def name_count(name): arr = np.zeros(52) for ind, x in enumerate(name): arr[ord(x)-ord('a')] += 1 return arr

freq: [a:3, b:1, d:1]

0.690232056125

### Character Frequency + Order

def name_count(name): arr = np.zeros(52) for ind, x in enumerate(name): arr[ord(x)-ord('a')] += 1 arr[ord(x)-ord('a')+26] += ind+1 return arr

freq: [a:3, b:1, c:1]

ord: [a:6, b:4, c:5]

We can combine these encodings by adding the two arrays together and offsetting the second array

0.766864543983

0.760388559093

0.766864543983

0.76740420939

0.759848893686

### Character Frequency + Order + 2-grams

def name_count(name): arr = np.zeros(52+26*26) # Iterate each character for ind, x in enumerate(name): arr[ord(x)-ord('a')] += 1 arr[ord(x)-ord('a')+26] += ind+1 # Iterate every 2 characters for x in xrange(len(name)-1): ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) arr[ind] += 1 return arr

ord: [a:6, b:4, c:5]

2-gram: [ aa: 2, ab: 1, bc: 1]

Accuracy:

0.78548300054 0.771451699946 0.783864004317 0.777388019428 0.77172153265

We get a slight increase in accuracy, but I think we can do better.

### Character Frequency + Order + 2-grams + Heuristics

def name_count(name): arr = np.zeros(52+26*26+3) # Iterate each character for ind, x in enumerate(name): arr[ord(x)-ord('a')] += 1 arr[ord(x)-ord('a')+26] += ind+1 # Iterate every 2 characters for x in xrange(len(name)-1): ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52 arr[ind] += 1 # Last character arr[-3] = ord(name[-1])-ord('a') # Second Last character arr[-2] = ord(name[-2])-ord('a') # Length of name arr[-1] = len(name) return arr

ord: [a:6, b:4, c:5]

2-gram: [ aa: 2, ab: 1, bc: 1]

last_char: 3

second_last_char: 2

length: 5

0.801672962763 0.804641122504 0.803022126282 0.801672962763 0.805450620615

## Fine-tuning

0.814085267134

0.821370750135

0.818402590394

0.825148407987

0.82245008095

## Feature Reduction

[728 26 729 0 40 50 30 390 39 37] [728 26 729 50 0 40 37 30 34 390] [728 26 729 50 40 0 37 30 39 390] [728 26 729 0 50 40 30 37 390 39] [728 26 729 0 50 40 30 37 39 34]

These numbers refer to the feature index by most importance.

40 – order of o

def name_count(name): arr = np.zeros(1) arr[0] = ord(name[-1])-ord('a')+1 return arr

Accuracy:

0.771451699946 0.7536427415 0.753912574204 0.7536427415 0.760658391797

Wow! We actually get 75% accuracy! This means the last letter of a name is really important in determining the gender.

Let’s take the top three features (last and second last character and order of a’s) and see the importance of these. (But if you already read the title of this blog post, you should know what to expect.)

def name_count(name): arr = np.zeros(3) arr[0] = ord(name[-1])-ord('a')+1 arr[1] = ord(name[-2])-ord('a')+1 # Order of a's for ind, x in enumerate(name): if x == 'a': arr[2] += ind+1 return arr

Accuracy:

0.798165137615 0.794117647059 0.795736643281 0.801133297356 0.803561791689

I would say 80% accuracy for 3 features is pretty good for determining gender of a name. Thats about the same accuracy as a mammogram detecting cancer in a 45-49 year old woman!

## Sample Example

def name_count(name): arr = np.zeros(3) arr[0] = ord(name[-1])-ord('a')+1 arr[1] = ord(name[-2])-ord('a')+1 # Order of a's for ind, x in enumerate(name): if x == 'a': arr[2] += ind+1 return arr my_data = np.genfromtxt('names/yob2014.txt', delimiter=',', dtype=[('name','S50'), ('gender','S1'),('count','i4')], converters={0: lambda s:s.lower()}) my_data = np.array([row for row in my_data if row[2]>=20]) name_map = np.vectorize(name_count, otypes=[np.ndarray]) Xname = my_data['name'] Xlist = name_map(Xname) X = np.array(Xlist.tolist()) y = my_data['gender'] Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33) clf = RandomForestClassifier(n_estimators=150, min_samples_split=20) clf.fit(Xtr, ytr) idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False) xs = Xname[idx] ys = y[idx] pred = clf.predict(X[idx]) for a,b, p in zip(xs,ys, pred): print a,b, p

Output:

Name Actual Classified shea F F lucero F M damiyah F F nitya F F sloan M M porter F M jalaya F F aubry F F mamie F F jair M M

## Conclusion

I hope you have learned something from reading this blog post as I did writing it!(Click here for Source: IPython Notebook)