Hello World for PyML

After a while concentrating on more abstract stuff I thought I would return to Support Vector Machines. These are primarily classifiers which assign data to one of two categories. E.g. in the picture below, red or blue. Having read up on elementary vector geometry, and more optimisation stuff (through economics) I found the subject much more penetrable.

PyML is pretty easy to get hold of and install. Don't expect much in the way of documentation though. These are my notes on how to wring something visible out of it as a 'Hello World' use of SVM. Now after writing some code to generate a data set (more on that below), the following few lines get us to a visible output:

from PyML import *
from PyML.demo import demo2d
mu=[array([1,-2,3,-3]),array([-2,3,-1,3])]
data = transformed_gauss( 150, f2, mu )
data.attachKernel( 'Gaussian' )
s = SVM()
s.train( data )
demo2d.setData( data )
demo2d.decisionSurface( s )

 

You can then use cross validation method to get an estimate on the classifier's performance:

In [36]: s.cv(data)
[...]
Confusion Matrix:
      Given labels:
       0    1
    0  52   23
    1  22   53

Here's the code used to generate the data. I wanted something a bit messier than the inbuilt data, and something amenable to 2d visualisation.  The code is to generate a set of 4-dimensional Gaussians, and then map them on to two dimensions. My thought was to take a data set that is linearly separable in its original dimensionality, distort it down, then see how easily SVM can restore the separation.

from pylab import *
from PyML import *
from PyML.demo import demo2d

def gauss_data( N, mu ):
    p = mu[0].shape[0]
    N = N + N%2 #make N divisible by two
    X = []
    Y = []
    for i in range(N):
        #select a random class
        class_index = randint(0,2)
        #create a new X point -> a p-dimensional Gaussian with mean of that class
        X.append( randn(p) + mu[class_index] )
        #create the new Y point -> 1 if from mu[1], -1 if from mu[0]
        Y.append( str(class_index) )
    return X, Y

f2 = lambda x: array([ sin(x[0] + x[1] ), cos( x[2] + x[3] ) ])

def transform( X, f ):
    return [f(x) for x in X]

def transformed_gauss( N, f, mu):
    X, Y = gauss_data( N, mu )
    Z = transform( X, f )
    D = VectorDataSet( Z, L=Y )
    return D

My other tip on PyML is that it likes to contstruct DataSet instances with the label list as strings.

Some pictures of simple differential equation systems

This week I have been mostly wondering what simple systems of differential equations look like. The pictures in the books often either have too many arrows or too few.  There's something aesthetically unappealing about it. Being a humanities person, I want to see the stories of individual points. Just thought I'd share them.

Here's a stable set of equations:

\begin{equation} \dot{x}=-2x+y \\ \dot{y}=x-2y\end{equation}

The lines on the graph represent \dot{x}=0, \dot{y}=0 . Hence, where they meet is an equilibrium point, which may be stable or unstable.

Then there's it's more awkward cousin:

\begin{equation} \dot{x}=10+2x-3y \\ \dot{y}=9+x-2y\end{equation}

Zooming out a bit (not using arrow method), but before it turns into a straight line zipping off to infinity:

I particularly like this damnably simple, stable spiral:

\begin{equation} \dot{x}=-x+y \\ \dot{y}=-x-y\end{equation}

And here's its badly behaved cousin, flinging points out all over the show:

\begin{equation} \dot{x}=-2x+y \\ \dot{y}=x+2y\end{equation}

Code on github, should you wish to play.

Acme Rink Company Ltd

Continuing on my quest to find start-up fads of times past. I wanted to move past the tactic of just ranking based on keyword count. Here's the metric I'm testing:

\begin{equation}\ M= \frac{(N-1)^b}{(H+0.001)^a}, 0<a,b<1 \end{equation}

Where H is the information entropy of the probability distribution of the keyword conditioned on the year of incorporation. So H is smaller if the event is more predictable, say you know there was a massive boom in chip shops during 1922. Hence it is dividing, as we want low entropy keywords to rank high (for now).  N is the keyword count, so the rank metric is increasing with N.

Using a=1, b=0.5, these are the twenty highest ranked keywords:

0) Rink
1) Skating
2) Coy
3) Greyhound
4) (1920)
5) Tavern
6) Wireless
7) Cinema
 8) Sailing
9) Radio
10) Exploring
11) Oilfields
12) Columbia
13) Aircraft
14) Picture
15) Golden
16) Son,
17) Mines,
18) Theatres
19) Reefs

Investigating 'Rink', it's clear when the skating expansion takes place:

For those interested here's some of the names.  Prize goes to Acme Rink Company Limited, 1892

The Sports Arenas and Ice Rinks Construction Corporation Ltd, 1928
Rink Equipment Company Ltd, 1926
Acme Rink Company Ltd, 1892
Crystal Ice Rink, Ltd, 1891
Brighton Rink Syndicate Ltd, 1896
Savoy Rink Ltd, 1928
Palais Roller Rink (Hull) Ltd, 1929
Keighley Skating Rink Ltd, 1929
Hinckley Roller Skating Rink Ltd, 1930
Rink (Bishop Auckland) Ltd, 1930
Billy-Jeans Ice Rinks Ltd, 1972
Edinburgh Skating Rink Company Ltd, 1908
Sunderland Skating Rink Company Ltd, 1908
Belfast Skating Rink Company Ltd, 1908
Leeds Skating Rink Company Ltd, 1908
Glasgow Skating Rink Company Ltd, 1908
Dublin Skating Rink Company Ltd, 1908
Birmingham Skating Rink Company Ltd, 1908
London Olympia Skating Rink Company Ltd, 1908
St James's Hall, Manchester, Skating Rink Company Ltd, 1908

Black Swan Gold Mine Ltd

At Hack on the Record I took a big chunk of Board of Trade data on historic company incorporations (hat-tip @Baloun).

The end goal is to identify past start up fads and bubbles by looking at keywords. Probably through rankings based around minimum information entropy. Thought it could provide a interesting way to do a spot of economic history.

So, I was pretty pleased when, taking the first 5000 incorporations off the top, the term 'gold' appeared high in the ranks, with a spike in 1896.

It's the mixed effects of the Klondike gold rush, Cecil Rhodes, and hunting the golden bunyip in Western Australia. Sitting there in the data like a pure nugget in a clear mountain stream. No fancy smelting processes needed.

Here's the unfiltered keywords for 1896:

Company, 375
and, 240
Syndicate, 139
Gold, 97
Mines, 51
Mining, 34
Corporation, 25
London, 23
Club, 23
Development, 22
New, 18
Exploration, 18
Investment, 17
Steamship, 16
Cycle, 15
British, 15
of, 14
Publishing, 14
General, 13
Association, 13
Steam, 12
United, 11
South, 11
Brick, 11
Zealand, 10
Manufacturing, 10
African, 10
Works, 9
W, 9
Trust, 9
J, 9
Explorers, 9
City, 9
West, 8
Patent, 8
Laundry, 8
Gas, 8
Finance, 8
F, 8
Brothers, 8
Universal, 7
Tile, 7
Supply, 7
Sons, 7
Mine, 7
James, 7
H, 7
Creek, 7
Colonial, 7
Colliery, 7

Here are the names of the companies involved, some good fun in here:

Cobar Gold Mines Ltd
Kootenay Gold Fields Syndicate Ltd
New Zealand Gold Development Syndicate Ltd
Rooderand Main Reef Gold Mining Company Ltd
Seine River (Ontario) Gold Mines Ltd
Chili Gold Gravels Ltd
Towranna Gold Mines of Western Australia Ltd
Hannans "Empress" Gold Mining and Development Company Ltd
Candelaria Gold Mines Ltd
Lucky Guss Gold Mine Ltd
Hauraki (N Z) Associated Gold Reefs Ltd
Hannan's Premier Gold Mines Ltd
Summit Flat Gold Mines Ltd
Good Luck Gold Properties Ltd
Associated Southern Gold Mines (W A) Ltd
Gold Securities Ltd
Rockhampton (Queensland) Gold Estate Ltd
Truer River Gold Mining Company Ltd
Antenior (Matabelle) Gold Mines Ltd
90-Mile Proprietary Gold Mines Ltd
Captain Robinson's Gold Reefs Ltd
Waitekauri Cross Gold Mining Company Ltd
New Zealand Gold Investment Company Ltd
Gullewa Gold Mines Ltd
Merced Monster Gold Mines Ltd
Waihi Consolidated Gold Mines Ltd
Irassu Gold Exploration Syndicate Ltd
Menzies Golden Rhine Gold Mines (WA) Ltd
Hannans Gold Hill Ltd
Lady Margaret Gold Mining Company Ltd
Princess Alix Gold Mines Ltd
Renmark Gold Mines Ltd
Morris Ravine Gold Mines Ltd
Lone Ridge Gold Mine Ltd
Easter Gift Proprietary Gold Mines Ltd
White Flag Consols Gold Mines Ltd
Selukwe Gold Mining Company Ltd
Gold Reefs of Western Australia Ltd
Gold Mines Corporation Ltd
Hannans Mount Ferrum Gold Mines Ltd
Westralia and New Zealand Gold Explorers Ltd
Nil Desperandum Gold Mines Ltd
Regina (Canada) Gold Mine Ltd
Lake View and Boulder Junction Gold Mines Ltd
Kinsella Gold Mines Ltd
Armadale Gold Mining Company Ltd
Lynx Creek Gold Mining Company Ltd
Kurnalpi Gold Exploration and Development Company (W A) Ltd
Oliphants' Olei Gold Mining Company Ltd
Universal Gold Syndicate Ltd
Norseman Gold Mines Ltd
Pinnacles Gold Mine Ltd
Hannan's Queen Gold Mines Ltd
City of London Gold Mines Ltd
Lady Maude Gold Mines Ltd
Bingham's Randfontein Gold Mining Company Ltd
Shamrock Gold Mining Company Ltd
Herbert Gold Ltd
Utah Consolidated Gold Mines Ltd
General Gordon Gold Mines Ltd
Rose-Hill United Gold Mines Ltd
Joker (Yalgoo) Gold Mines Ltd
Lady Emily Gold Mining Company Ltd
Bellibetta Gold Company Ltd
Victoria Reef Gold Mine Ltd
African Daspoort Gold Mines Ltd
Lochinvar Gold Mines Ltd
Golconda Gold Mines Ltd
Seven Sisters Gold Mines Syndicate Ltd
Moel Offrwm Gold Mining Company Ltd
All Nations Gold Mines Ltd
Cripple Creek Gold and Exploration Ltd
Lady Evelyn Gold Mines Ltd
Hannans United Gold Estates Ltd
Western Star Gold Mining Company Ltd
Corsair Consolidated Gold Mines Ltd
Elandsfontein No 2 Gold Mining Company Ltd
Sunbeam and Vigilant Gold Mines Ltd
Black Swan Gold Mine Ltd
"Hesperus" Gold Mining Company Ltd
Gold Mining Association Ltd
Mount Hepburn Gold Mine Ltd
Unionist Gold Mining Syndicate Ltd
Rhodesian Gold Properties Ltd
Hauraki Gold Properties Ltd
Santa Anna Gold Mining Company Ltd
Menzies Gold Development Company Ltd
British Columbia Gold Syndicate Ltd
Woodleys Reward Gold Mines Ltd
Huttons (Bechuanaland) Gold Reefs Development Company Ltd
Anglo-Rhodesian Gold Mining and Engineering Company Ltd
Mount McDonald Gold Mines Ltd
Great Victoria Gold Mining Company Ltd
Bunyip Gold Mines Ltd
Hikutaia Gold Syndicate Ltd
Dorothy Gold Mining Company Ltd
New Alburnia Gold Mining Company Ltd

Can't wait to see what stories are lurking in the other 175k records!

China seems to have been long stationary, and had, probably, long ago acquired that full complement of riches which is consistent with the nature of its laws and institutions. But this complement may be much inferior to what, with other laws and institutions, the nature of its soil, climate, and situation, might admit of.

Adam Smith in the 'Wealth of Nations'.

I knew the book was a classic, but reading it I discover it is absolutely top-tier. Can't believe it ever took me so long to get round to it. A huge slice of modern economics, up to an intermediate level, consists of formalising concepts contained in it. Smith sets them out intuitively, and in context. It's free on Project Gutenberg.

Primary literature

I've just started using Mendeley to keep track of the primary literature I read. Good interface which picks up a lot of metadata from academic sites.  Much better than delicious which gets more annoying with every upgrade.

There's a big benefit comes from enabling me to move on quickly: stuff that's half-way interesting/relevant can be saved for later.  When I know I can find it again any time I have no qualms about putting a paper to one side.

Delving in to the primary literature is something I'm really starting to appreciate.  First perception of a field is complete chaos: unknown names, jargon, many different journals.  The stuff textbooks try (for better or worse) to shield you from.

It's unfortunate really.  Journal articles don't spoon feed the read.  That makes them harder work.  They do tend to be short, and to the point though. You can read mounds of popular prose on a topic, and encounter it many times, but often taking a look at a few of the fundamental articles can clear things up much more quickly.

Three of my recent favourites:

It's such an obvious thing to do, but reading 'the literature' has the aura of something difficult.  Often it's easier because first hand accounts of well contained topics over 10-20 pages are surprisingly easy to get into.

So my exhortation of the week is get onto jstor some how and read some journals!

 

"Since then I have lived without following any particular Way. Thus with the virtue of strategy I practise many arts and abilities - all things with no teacher." Shinmen Musashi, The Book of Five Rings.

The book is about Musashi's martial art. As a duellist undefeated in over sixty fights, his advice on sword technique is no doubt sage.

Of greater importance is his advice on life and approach to any art or skill. It is difficult to write of in detail, but this book is deeply connected with Zen and texts such as the Dao De Jing. Musashi's Way of Walking Alone sets out his precepts concisely.

It is also still relevant to game theory: the study of strategic interactions. The concept of best response is intimately connected with putting yourself in your opponent's mind:

"To become the enemy" means to think yourself into the enemy's position. In the world people tend to think of a robber trapped in a house as a fortified enemy. However, if we think of "becoming the enemy", we feel that the whole world is against us and that there is no escape. He who is shut inside is a pheasant. He who enters to arrest is a hawk. You must appreciate this.

There're also a host of analogies for successfully executing mixed strategies, forcing another party to accept your moves as given, diversion and the like. It's worth listening to just for those.

Taking up two of Sebastian Marshall 's recommendations at the same time, I listened to this version from Audible. Excellent translation and well read.  If so inclined you can get a free audiobook with this offer.

Prospect theory with hyperbolic discounting

Reading Ainslie's Breakdown of Will made me wonder what prospect theory type utility functions would look like under hyperbolic discounting.

Hyperbolic discounting says that we give great weight to imminent events. It explains why we eat cookies when we're trying to diet.

Essentially prospect theory says that we like certain gains; and that we detest losses and will take risks to avoid them.

Put the two together and you have a utility function in two dimensions utility = hyperbolic( sigmoidal( reward ), time ).
Here's a very sketchy visual of what you might get from doing this.  There are two rewards, red is half the sigmoid reward of blue, but delivered three units earlier.


Code is on githhub if you want to play in 3d.

There does seem to be theory on the combination of the two approaches knocking around already - just need to get stuck into it, and see if the sketch-anticipation matches up. (E.g. This doesn't include asymmetry of gains and losses).

Thought experiment: alien assault on the financial system?

How happy would you be in utopia? I'm guessing you'd be pretty happy. Imagine some mischievously benificent aliens came across our little planet and thought so too. The next day they would drop off numerous copies of the plans for uber-duper technology, together with working replicators, cold fusion, intelligent nanobots, et cetera ad fantasium.

The cost of producing food and goods would drop to near zero. The barrage of productivity would bring prices crashing down around us. No more reliance on intensive farming or factories. Wage slaves would be freed from their cubicles to emerge, blinking, into the light of a new dawn. People would wander lonely as clouds in pursuit of their talents. Devote themselves to science and the arts. Or just plain hedonism.

Of course, the global financial system would collapse. With limitless energy, we could kiss goodbye to pension funds stuffed with BP shares. Who'd care about AstraZeneca when their body was defended by legions of nanoscopic robots? With replicators a clutch of pork belly futures would mean about as much as a High Score on Super Mario Bros 3.

If GDP measures financial transactions, and prices dropped to zero, growth would be seriously screwed. The labour market would contract massively in real terms: why work when everything's free? Some might pursue lofty goals, others might prefer to sit back and chillax by a pool somewhere warm.

Would we actually be worse off? It's doubtful.

Sure, we'd have a new set of trade-offs. Study painting or astronomy? Watch TV or play games? There'd be competition for attention and status in creative pursuits. People always operate at the margin, and I'm sure we'd find something else to fall out about and rank ourselves by. But those arguments would take place on a higher plateau of well-being.

There'd probably be all kinds of battles to be fought over the remaining limited resources, such as land. I'm sure our ingenious selves could find some new solutions there: sea steading, levitating cities, moon bases, Martian terraforming, extra-solar planets whatever.

What's my point? It's just a reminder from a traditional framework: utility is based on consumption of goods, not the associated spending. You derive satisfaction from eating apples, not price(apples)*apples. Technology and other innovations may increase the former and yet decrease the latter. Welfare is not identical to total financial exchange.

General heuristics 1: What happened last time?

The world's a complex place. Full of arbitrary rules and chance. Or deterministic processes pretending to be chance. This presents a general class of problems along the lines of: given the state of the world today what's going to happen tomorrow? Well, one general heuristic is to recall what happened the day after the last day a bit like today.

That sounds hopelessly vague.  But it does submit to formalisation. In machine learning this is the basis for the nearest neighbour technique I've talked about before. Take the n components that make up a state of the world. Your training set is some collection of points in this n-dimensional space, together paired with an endogenous, outcome variable.

Confronted with some new state of the world, look which point in your training set is closest to the new point. Hypothesise that the outcome for your new point will match the outcome for the past point.

In the 'real' world this is what lawyers do when looking at precedents. What happened last time a case had X, Y and Z features? It's also a big chunk of what you get from studying history. You're not smoothing the data out at all.  Not saying 'well, this point looks like an outlier so lets weight it down'. Literally, what's the closest situation that happened before, and what was the outcome? Taking into account the full complexity of the situation.

The vulnerability of this strategy lies in data density.  You need to have a hell of a lot of precedents to get this working on any kind of complex system. No good to read a couple of cases on BAILII and expect to best the barristers.

But when it gets working it can beat the pants off parametric solutions. Those are solutions which try to identify the principles at work and  extrapolate logically what will happen. But how many rules can you remember at any one time? What's your processing capacity - can you apply the rules and not just enumerate them? When you dial a phone number every piece of technology between you and the recipient operates by the rules. But do you know all the rules, and how they interact? Or just that button pressing leads to voicey noises on the plastic banana?