Rational Girl Attempting to be rational while dreaming of 3.141592653589793... en-us Wed, 08 May 2013 00:00:00 -0700 <![CDATA[profiling code]]>

profiling code

Ipython is my favorite interactive environment. So lets set up profiling in Ipython. You need to install three key packages (Im assuming you have ipython installed, this example is based on ipython 0.13). For each package I show three possible ways to install

  • using easy_install on a machine where you have sudo
  • using easy_install on a machine you do NOT have sudo, into your own local directory
  • using pip

ipython version 0.13

install line_profile


sudo easy_install line_profiler

easy_install --prefix /path/to/users/local line_profiler

pip install line_profiler

install psutil


sudo easy_install psutil

easy_install --prefix /path/to/users/local psutil

pip install psutil



sudo easy_install memory_profiler

easy_install --prefix /path/to/users/local memory_profiler

pip install memory_profiler

Create ipython profile


To generate the default configuration files, do:

$> ipython profile create profiler

$> mkdir ~/.config/ipython/profile_profiler/extensions

ipython profile create profiler

creates ipython_config.py in ~/.config/ipython/profile_profiler

The location may be different for you, but when you call the command it tells you where it put the new profiler directory (cause thats the right thing to do)...

If you dont see this you can locate your ipython config directory which will contain your profiles

ipython locate

in the ~/.config/ipython/profile_profiler/extensions directory create two new files


import line_profiler

def load_ipython_extension(ip):
    ip.define_magic('lprun', line_profiler.magic_lprun)


import memory_profiler

def load_ipython_extension(ip):
    ip.define_magic('memit', memory_profiler.magic_memit)
    ip.define_magic('mprun', memory_profiler.magic_mprun)

Next update ipython_config.py to use these extensions

c.InteractiveShellApp.extensions = [

c.TerminalIPythonApp.extensions = [

Now you can call your profiler profile

ipython --profile=profiler

Python 2.7.3 |EPD 7.3-2 (64-bit)| (default, Apr 11 2012, 17:52:16)
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

IPython profile: profiler

In [1]: %memit?
In [2]: %lprun?
In [3]: %timeit?

Lets look at a simple example profiling MaskedArray methods

Time Profile

In [11]: from numpy.ma import MaskedArray
In [12]: jnk = np.random.random(100)
In [13]: jnk = MaskedArray(jnk, jnk>.6)

In [14]: %timeit jnk.mean()
10000 loops, best of 3: 29.1 us per loop

In [15]: %timeit np.mean(jnk[jnk <= .6])
10000 loops, best of 3: 150 us per loop

Line Profile

Usually you run this on a file to profile the code line by line

%lprun -f <filename>.py <method>

Memory Profile

%memit -r <R> statement

this gets the max memory usage over 100 loops

In [22]: %memit -r 100 jnk.mean()
maximum of 100: 50.273438 MB per loop

In [23]: %memit -r 100 np.mean(jnk[jnk <= .6])
maximum of 10: 50.273438 MB per loop
Wed, 08 May 2013 00:00:00 -0700 <![CDATA[Errors Cluster]]>

Errors Cluster

Software Facts and Fallacies

Robert Glass

I recently did a quick slide stack looking into Fact 49: Errors Cluster


Loved the book, but was a little light on research. Luckily I was pointed to this book.

` Making Software: What Really Works, and Why We Believe It by Oram and Wilson http://www.amazon.com/Making-Software-Really-Works-Believe/dp/0596808321/

Sat, 04 May 2013 00:00:00 -0700 <![CDATA[masked arrays]]>

masked arrays

Masked arrays can be handy

  • to isolate important data (grab ROI from a dataset, eg 4D fMRI)
  • or remove invalid entries from a data set ( missing or NAN )

Methods not using maskedarray

Find indicies matching logical condition (find all the values in a random matrix > 0.3)

In [1]: import numpy as np
In [2]: data = np.random.random((2, 3, 4))
In [3]: ind = np.where(data < 0.3)

In [5]: ind
(array([0, 0, 1, 1, 1, 1, 1]),
 array([0, 1, 0, 0, 1, 1, 2]),
 array([0, 2, 0, 2, 0, 1, 1]))

This can give you a flat array of this data

In [10]: data[ind]
array([ 0.22709003,  0.05852228,  0.09808892,  0.02375857,  0.20972922,
        0.00940121,  0.04954049])

This is another way to quickly get that flat array

In [11]: sample = data[data < .3]

In [12]: sample
array([ 0.22709003,  0.05852228,  0.09808892,  0.02375857,  0.20972922,
        0.00940121,  0.04954049])

In [13]: np.allclose(sample, data[ind])
Out[13]: True

But sometimes we want a slightly different behavior, this is where masked arrays turn out to be quite handy.

However, when working with a masked array, setting ismasked is a logical True, so we can get unexpected behavior.

In [14]: import numpy.ma as ma
In [16]: mdata = ma.MaskedArray(data, data < .3)

[19]: mdata.data
array([[[ 0.22709003,  0.45430854,  0.79467758,  0.93408729],
        [ 0.96287444,  0.35465963,  0.05852228,  0.78461907],
        [ 0.45160738,  0.93019333,  0.33235178,  0.55158758]],
       [[ 0.09808892,  0.63505732,  0.02375857,  0.50787565],
        [ 0.20972922,  0.00940121,  0.87705912,  0.6506519 ],
        [ 0.88436047,  0.04954049,  0.42319976,  0.42837593]]])
In [24]: mdata.data.shape
Out[24]: (2, 3, 4)

In [25]: mdata.min()
Out[25]: 0.33235177579218356

Note that we have the values that are > 0.3, not < 0.3. This is because the mask puts True values where the condition we pass is True, so to get the same behavior as above, we pass the negation of our logical mask.

In [40]: mdata = ma.MaskedArray(data, (data < .3) == False )

In [41]: mdata.min()
Out[41]: 0.0094012111913013285

In [42]: mdata.max()
Out[42]: 0.22709002636526754

The beautiful thing about this is you can work on the data in the array, without reshaping, or dealing with mapping a flat array back to a nd array....which turns out to be useful.

But if you do need the masked data, it is also simple

In [45]: mdata.compressed()
array([ 0.88658832,  0.61704277,  0.82931015,  0.52735743,  0.96018349,
        0.43557534,  0.52140037,  0.74956647,  0.33020142,  0.35757183,
        0.94524188,  0.39382885,  0.67163231,  0.54989501,  0.82149955,
        0.7622946 ,  0.81627146,  0.97645765,  0.87780762,  0.84369126])

Though do not mistake this with the compress method which returns a MaskedArray. This requires a flat mask, but you can pass a new mask to the data, which will ignore the old mask and return values in the new mask. NOTE: it will only return the data in the mask, and by default, its mask is all False.

In [17]: simple = MaskedArray([1,2,3,4], [0,0,1,1])

In [18]: simple.data
Out[18]: array([1, 2, 3, 4])

In [19]: simple.compressed()
Out[19]: array([1, 2])

In [21]: newsimple = simple.compress([1,1,0,0])

In [22]: newsimple.data
Out[22]: array([1, 2])

In [23]: newsimple.mask
Out[23]: array([False, False], dtype=bool)
Mon, 29 Apr 2013 00:00:00 -0700 <![CDATA[Pandas: create dataframe]]>

Pandas: create dataframe

I often find myself needing to extract data from disparate data sources and combine them into a csv or excel file for others to then work with. Of course I fall back on pandas as it is a data wrangling guru...

to create an empty dataframe:

import numpy as np
import pandas
columns = ['some', 'column', 'headers']
index = np.arange(103) # array of numbers for the number of samples
df = pandas.DataFrame(columns=columns, index = index)

Now I can populate my new data frame:

myarray = np.random.random((10,3))
for val, item in enumerate(myarray):
    df.ix[val] = item

If my data is already in a list of dicts, it is even easier:

mydata = [{'subid' : 'B14-111', 'age': 75, 'fdg':1.78},
          {'subid' : 'B14-112', 'age': 22, 'fdg':1.56},]
df = pandas.DataFrame(mydata)
Mon, 15 Apr 2013 00:00:00 -0700 <![CDATA[Award IPython]]>

Award IPython

Kudos to the ipython team (esp Fernando) for a well deserved award.

Free Software Award

While at pydata, the percent of talks that used ipython notebook @ 98% For the other talks I saw, if not done in ipython, I remember feeling a bit sad.

For a great collection of notebooks, including my favorite

XKCD plots in matplotlib

check out http://nbviewer.ipython.org/

Sat, 23 Mar 2013 00:00:00 -0700 <![CDATA[pydata 2013]]>

pydata 2013

Peter Norvig - Learning Python

I attended this amazing talk at pydata2013

Learning Python by Peter Norvig (vimeo)

Heres a few of the key points I noted.

Major errors The hard parts of learning python, common mistakes

  • create list where each item is a reference
    • change one -> change all
  • setting value vs checking equality
    • x = 1 vs x == 1
  • understanding pointer to reference vs copy:

a = 1
b = a
a = 2

# what is b = ?

Compare this to this behavior

a = [1, 2, 3, ]
b = a
a[0] = 9

b[0] # what is b[0]
  • implicit operators ( x + 1 )( x * 2 )
    • understanding why this fails

The 2 Sigma Effect

Benjamin Bloom wikipedia: Bloom 2 sigma problem

one-to-one tutoring students using mastery learning results in performance of 2 standard deviations better than control (classroom instruction) class.

Effective Measures for MOOC (Massive Open Online Courses)

  • willpower / due dates
  • peer support / peer grading
  • Faculty Engagement ( email), personal feel to contact
  • Authenticity (reputable teachers, institution )

Try to influence how a student learns

Introduce the problem, then the explanations

Rate of Learning

  • mastery vs tutoring, hold student back until 90% mastery
  • interactive assignments, use tools to see how students solve, and struggle
  • student control over rewind (esp video which allows questions without
  • self-consciousness
  • changing opinions: value struggling through problems to find solution
  • flipped classroom video at home, discussion in class

Hal Varian

get together problem sets and exams you want them to be able to solve, and then write book that teaches them how to solve

simliar to test driven development

Software Carpentry

Greg Wilson has been at the forefront of this issue in teaching computing to scientists, check out Software Carpentry

Tue, 19 Mar 2013 00:00:00 -0700 <![CDATA[large correlation]]>

large correlation

I recently needed to run cross correlations on some largish data sets, and the standard numpy.correlate was a bit too slow.

In [30]: import numpy as np
In [31]: sample = np.random.random(1000)
In [32]: sample_same = np.correlate(a,a, mode='same')
1000 loops, best of 3: 801 us per loop

So moved to use fftconvolve in scipy.signal

In [34]: from scipy import signal
In [35]: # need to pad second array with zeros (twice size of a)
In [36]: padded_sample = np.zeros((1000*2))
In [37]: start = 1000 /2
In [38]: stop = 1000 /2 + 1000
In [39]: padded_sample[start:stop] = sample
In [40]: # note , we need to flip a to do correlation (instead of convolution)
In [41]: timeit signal.fftconvolve(padded_sample, sample[::-1],mode = 'valid')
1000 loops, best of 3: 548 us per loop

Results should be the same, though the fft_convolve pads

In [53]: sample_same.shape
Out[53]: (1000,)

In [54]: sample_fft.shape
Out[54]: (1001,)

In [55]: np.allclose(sample_same, sample_fft[:-1])
Out[55]: True
Sun, 17 Mar 2013 00:00:00 -0700 <![CDATA[Pi Day]]>

Pi Day

March 14 pi day

I love Pie (strawberry rhubarb, blueberry) and \(\pi\) . So this is a favroite day.

And on this day, many wonderful people were born (not that it really means anything)

I have taught a few elementry classes on Pi day. The best exercise involved having a set of colored paper strips (each color specific to a digit 0-9), and having the kids make a paper chain by adding one digit at a time. We tried to find patterns in paper-chain code. (It was exciting to see kids find pseudo patterns and get excited, the human brain loves patterns). But then we would add more and another pattern would be ruined. I left it as an excercise for the kids, and their teacher said thay had a blast adding paper all through the day, they were disappointed when they ran out of paper.

And now, for no reason:

../../../_images/Random_Sierpinski_Triangle_animation.gif ]]>
Thu, 14 Mar 2013 00:00:00 -0700 <![CDATA[Setting up Vim]]>

Setting up Vim

I recently found myself using VIM more and more. This, along with some OS and software upgrades had me revisit my .vimrc

There were some really nice examples on Sontak’s Humble Abode: vim as python ide

And, as I use ReST so often, I found this little keybinding useful, and added it to my .vimrc

" reST macro. Copy and paste a header, then select it
" and replace with the next character I type.
" For easily generating those section headers.

let @h = "yypVr"

A number of people have asked me if they should learn vim or emacs. IMHO, either can be useful, just as well as gedit , text wrangler, and many others. If you ae new to coding, it doesnt really matter, use something intuitive to you, with syntax highlighting for your coding language(s). Later, I find that you will want to install add-ons, for greater functionality like code folding , auto completion, and default templates for classes and funtions and other standard structures. Until then, an intuitive editor is the best.

Vim is useful, as the command keys can help you avoid using a mouse (RSI , and has the same key-mappings as unix man pages, and programs such as less.

Best advice:

Keep the configuration files (.vimrc, .emacsd ) for your editor in a repository somewhere accessible (bitbucket or github, or even in your dropbox.)

When you are on a new system, it makes it easy to grab your configuration and work in a familiar environment.

Wed, 27 Feb 2013 00:00:00 -0800 <![CDATA[Sorting structured array]]>

Sorting structured array

Here is a nice numpy quickie that can be handy.

Lets say you have some data of mixed type, and you need to do some fancy sorting

import numpy as np

subid = ['B12-01', 'B12-05', 'B12-10', 'B13-02', 'B13-05']
age = [ 65, 72, 62, 88, 71]
biomarker_index = [1.312, 2.335, 1.122, 1.091, 3.005]
values = [(a,b,c) for a,b,c in zip(subid, age, biomarker_index)]
dtype = [('subid', 'S5'), ('age', int), ('biom', float)]
sarray = np.array(values, dtype=dtype)

You now have a structured array that allows you to easily index, the data, but it also makes sorting the data easy For example, you could sort by age, then by biomarker,and get your subids in this new order.


age_biom_sorted = np.sort(sarray, order=['age', 'biom'])
Wed, 27 Feb 2013 00:00:00 -0800 <![CDATA[Ipython Notebook]]>

Ipython Notebook

Teaching quick courses can be difficult, and I find that the ipython notebook can be a great tool, it allows you to interlace text (using my favorite fall-back Markdown with an ipython interpreter.

These can be shared on github as gists nbviewer

One of my favorite recent uses of the ipython notebook refers to the xkcd comic, and shows how to generate hand drawn looking plots. hand drawn plots in matplotlib

Keyboard Shortcuts

Control-m s

Save notebook


Execute current code


execute in current node, but stay in node (for quick clcs you dont wnt to keep)


execute and open new cell
Sun, 24 Feb 2013 00:00:00 -0800 <![CDATA[Command Line Basics (cd, pwd, ls)]]>

Command Line Basics (cd, pwd, ls)

I love the command line, it is a direct interface for talking to your OS.

In addition it does not require using a mouse, and is like a well stocked toolbox, has everything you need to do so many things.

Lets start with a couple simple ones...

  • cd (change directory)
  • pwd (print working directory)
  • ls (list directory)

And a directory structure...

Directory File Structure

graph ER {           node [shape=box]; root [label = "/"]; home; tmp; datadir; local; bin; user;  node [shape=ellipse]; file1; file2; file3; hidden [label = ".hidden"]; Bee; FileTwo;  node [shape=diamond,style=filled,color=lightgrey]; ls; which; cat;   root -- local -- bin;         bin -- ls;         bin -- which;         bin -- cat;         root -- tmp;         root -- home -- user --datadir;         datadir -- file1;         datadir -- file2;         datadir -- file3;         datadir -- hidden;  datadir -- Bee;  datadir -- FileTwo;  fontsize=20;         rankdir = "LR"; }

cd or “Change directory” and pwd “Print Working Directory”

When you want to move around the file structure, cd is your tool

Used alone, it will take you to your home directory.

We will use pwd (print working directory) to see where it takes us


When using cd it is relative to your current location, so use pwd to know where you are, and then use cd to get to where you want to be.

In the example below, we move to the root directory /. since home is relative to this directory, we can just move to home/user.

cd /

cd home/user

special options

cd .. (moves you backward a directory)


cd ..


cd ../../tmp

List files and directories

Change directory to user

cd /home/datadir/user

List everything in the current directory (alphabetical order, Capital comes before lower case))


Bee FileTwo file1 file2 file3

Notice we do not see the .hidden file. Hidden files by default are not displayed. To ist hidden files, use the -a flag.

ls -a

. .. .hidden Bee FileTwo file1 file2 file3

the -l flag allows you to see more information about your files and directories

ls -al

total 0
drwxr-xr-x  6 user  group  204 Feb 17 22:11 .
drwxr-xr-x  5 user  group  170 Feb 17 22:00 ..
-rw-r--r--  1 user  group    0 Feb 17 22:10 .hidden
-rw-r--r--  1 user  group    0 Feb 17 22:00 Bee
-rw-r--r--  1 user  group    0 Feb 17 22:00 FileTwo
-rw-r--r--  1 user  group    0 Feb 17 22:09 file1
-rw-r--r--  1 user  group    0 Feb 17 22:10 file2
-rw-r--r--  1 user  group    0 Feb 17 22:11 file3

List items in directory in order of creation time (most recent first)

ls -t

Bee FileTwo file3 file2 file1

List items in reverse order they were created (most recent last)

ls -rt

file1 file2 file3 FileTwo Bee
Sun, 17 Feb 2013 00:00:00 -0800 <![CDATA[pandas: some basics]]>

pandas: some basics

PANDAS is a data analysis library, that I love.

For one, while I dislike excel files, I find myself having to deal with them on a regular basis. Luckily PANDAS provides some nice tools for pulling them into python

import pandas
myfile = 'some_excel_file.xls'
Read more...]]>
Fri, 08 Feb 2013 00:00:00 -0800 <![CDATA[starting a blog]]>

starting a blog

I was looking to use Sphinx and ReST to start a blog, not because I have so much to say, but as a way to keep track of daily code solutions. I found Tinkerer and so here it is.

Set-up was trivial.

Read more...]]>
Tue, 05 Feb 2013 00:00:00 -0800