Logically correct operationally horrible

A med school classmate who graduated from the University of Chicago was fond of saying — that’s how it works in practice, but how does it work in theory?

Exactly the opposite happened when I had to do some programming. It shows the exact difference between computer science and mathematics.

Basically I had to read a large textEdit file (minimum 2 megaBytes, Maximum 8) into a Filemaker Table and do something similar 15 times. The files ranged in size from 20,000 to 70,000 lines (each delimited by a carriage return). They needed to be broken up into 1000 records.

Each record began with “Begin Card Xnnnn” and ended with “End Card Xnnn” so it was easy to see where each of the 1000 cards began and ended. So a program was written to
l. look for “Begin Card Xnnn”
2. count the number of lines until “End Card Xnnn” was found
3. Open a new record in Filemaker
4. Put the data from Card Xnnnn into a field of the record
5. Repeat 1000 times.

Before I started, I checked the program out with smaller sized Files with 1, 5, 10, 50, 100, 200, 500 Cards.

The first program used a variable called “lineCounter” which just pointed to the line being processed. As each line was read, the pointer was advanced.

It was clear that the runTime was seriously nonLinear 10 cards took more than twice the time that 5 cards did. Even worse the more cards in the file the worse things got, so that 1000 cards took over an hour.

Although the logic of using an advancing pointer to select and retrieve lines was impeccable, the implementation was not.

You’ve really not been given enough information to figure out what went wrong but give it a shot before reading further.

I was thinking of the LineCounter variable as a memory pointer (which it was), similar to memory pointers in C.

But it wasn’t — to get to line 25,342, the high level command in Filemaker — MiddleValues (Text; Starting_Line ; Number_of_Lines_to_get) had to start at the beginning of the file, examine each character for a character return, keep a running count of character returns, and stop after 25,342 lines had been counted.

So what happened to run time?

Assume the LinePointer had to read each line (not exactly true, but close enough).

Given n lines in the file — that’s the sum of 1 to n — which turns out to be n^2 + n. (Derivation at the end)

So when there were 2*n lines in the file, the runtime went up by over 4 times (exactly 2^2 * n^2 + 2n)

So run times scaled in a polynomial fashion k * n lines would scale as k^2 * n^2 + k * n

At least it wasn’t exponential time which would have scaled as 2^k

How to solve the problem ?

Think about it before reading further

The trick was to start at the first lines in the file, get one card and then throw those lines away, starting over at the top each time. The speed up was impressive.

It really shows the difference between math and computer science. Both use logic, but computer science uses more

Derivation of sum of 1 to n.

Consider a square n little squares on a side. The total number of squares is n^2. Throw away the diagonal giving n^2 – n. The number of squares left is twice the sum of 1 to n – 1. So divide n^2 – n by 2, and add back n giving you n^2 + n

Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: