On the importance of backups
I like to keep my stuff backed up. I've got a huge external drive next to my laptop that I use as a mac time machine. For stuff like tax backups, I also throw a copy in my dropbox. And another on my web server. I used to put a copy on a thumb drive. I wish I could find the source of a piece of advice I once read, but my google skill is lacking today. Basically, the jist was: "If you have n
copies of something, you really only have n - 1
copies for all practical purposes." If you only have one copy of anything, you could find yourself without that thing any moment.
Making bunches of copies may also appeal to my inner hoarder sensibilities without taking up extra space in my house, but I'm going with the "just being safe" argument. At least publicly.
Either way, I realized just the other day that, to my horror, the only copy I have of my blog is the single database at my web host. I know for a fact that they do regular backups, but for my own piece of mind, I'd like to have copies myself.
So I wrote a little script that hooks into my django models, converts my posts to markdown formatting, and saves them to flat files. It also saves some meta data about each post in a related json file. At the moment it's nothing to write home about really, and it messes up the markup that I use for code blocks, but it's a good enough backup for most purposes. I plan on extending it a bit so that I can write posts as flat files and have them converted to html and synced back out to the live site, automatically push the copies to github, etc.
That's the other advantage. Now that my blog posts exist outside of a database, I can push the markdown files out to github. That way, if my grammar sucks or something, anyone can correct it and submit a pull request. That's assuming anyone cares to correct my grammar, but let's not get all academic here.
You can check out my backup script here
, and the repository with copies of all my blog posts here
A little more Quicksorting
This has been quite a week. I've made the decision to take a new job and move my fiancee and myself to Buffalo from Brooklyn. We think that moving there will give us a better opportunity to live the lifestyle we want. A shorter commute, more space and proximity to family are among a whole host of reasons we've decided to do it. It was a difficult decision, and we'll both miss a lot of things about Brooklyn, but I think its for the best, even though I feel a little overwhelmed right now.
Anyway, I was poking through my work computer to see if there was any code that I didn't want to leave behind, and I found something that played off last week's interest in Quicksort. It was a piece of code that I had written when I first taught myself the algorithm. And, unlike most of your old code you run into, this was more nicely written than what I wrote just recently
#what I wrote last week
if len(set) <= 1:
high = 
low = 
pivot = set.pop(len(set) / 2)
for i in set:
if i >= pivot:
return quicksortLoop(low) + [pivot] + quicksortLoop(high)
#what I wrote a long time ago
if len(list) <= 1:
pivot = list.pop(len(list) / 2)
less = [i for i in list if i < pivot]
more = [i for i in list if i >= pivot]
return quicksort(less) + [pivot] + quicksort(more)
Instead of looping through the set and putting values into two arrays, this implementation uses python's list comprehensions to build the more and less lists. I think this approach is much more elegant, yielding easier to read code.
This lead me to the question of whether or not the list comprehensions incurred a signifigant amount of overhead. To find out, I stripped away all the parts that distiguished the two different implementations from one another with the exception of the loop vs list comprehension pieces. Then I wrote some code to the run the two different function through the same paces.
from time import time
from random import randint
#function definitions here
if __name__ == "__main__":
loopTimes = 
compTimes = 
for i in range(1, 1000):
testSet = [randint(0, 100) for x in range(1000)]
cSet = list(testSet)
loopSet = list(testSet)
compTime = time()
compTime = time() - compTime
loopTime = time()
loopTime = time() - loopTime
print "Avg with loop: ", sum(loopTimes)/1000
print "Avg with list comprehension: ", sum(compTimes)/1000
What this is does is sort a thousand lists, each containing a thousand numbers between one and one hundred, using both sorting functions. It counts the time each function takes on each list, and then calculates an average.
The results are sort of interesting.
[snpxw@PWadeiMAC:random-bits ]$ python quicksortTimer.py
Avg with loop: 0.00623641085625
Avg with list comprehension: 0.0061008477211
List comprehensions beat the loop every time, which is the opposite of what I expected. I'll speculate that the difference is either some internal optimization python makes for list comprehensions, or the pre-allocation of the low and high arrays in the loop version. I'm going to have to do a bit more research about what happens inside python to figure it out.
Regardless, the difference is negligible. Even if the list being sorted was five thousand times larger (that is, five million elements), the difference in the two implementations would be about 0.5 seconds. Not really enough to bother most people.
(The full code from this post is here
I've spent a bit of time this week thinking about QuickSort
. More specifically, I've been thinking about the complexity of the Quicksort worst case. On an already sorted set, Quicksort has a complexity of O(n^2), which I guess I understand on a basic level. Generally speaking, when the set is already sorted, Quicksort needs to compare every element to every other element each time it partitions the set. Partition n times, doing n comparisons, then you have n^2 comparisons. Makes sense, but instead of just leaving it, I've been trying to grasp it more definetly. In my mind, the comparisons on subsequent sets should be getting smaller. ie n-1, n-2, n-3 comparisons and so on.
I thought perhaps coding up a measurement of the complexity would help me wrap my head around it.
self.depth = 0
self.complexity = 0
def quickSort(self, set):
self.depth += 1
self.complexity += 1;
if(len(set) == 1 or len(set) == 0):
high = 
low = 
pivot = set
for i in set:
self.complexity += 2
if i > pivot:
return self.quickSort(low) + [pivot] + self.quickSort(high)
if __name__ == "__main__":
q = QuickSorter()
runset = [1, 2, 3, 4, 5, 6, 7]
After a bit of messing around trying to decide what should be part of the complexity measurement, you'll see I decided on counting the check on the length of the array, and the greater than and less than checks on values. It all seemed to make perfect sense, but my numbers kept coming out wrong.
On an already sorted array of length 3, instead of getting back 9 operations, I got 11. A set of 6? 41 operations. It was making me insane, but then I realized that the actual complexity of the worst case ended up being O(n^2 + (n - 1)). The n-1 comes from the length checks on all of the empty sets. This promptly made me remember the way big O notation is handled, where the slowest growing elements of the complexity are discarded. What is the growth of n-1 in comparison to the speed of n^2? Insignificant.
It was a nice bonus to figure that all out, but I still don't think I'm satisifed. The issue with n-1 should be a lesson that understanding the complexity of these algorithms is less about actual exact measurement, and more about mathematical understanding of the situation. So, while I think I "get" it enough to get by, I'm going to spend a bit of time re-familiarizing myself with the math.
Anyway, I put the code up here.
There is also an implementation of the in-place version of Quicksort there as a bonus.
As a quick aside, I realize that picking the first element in the array as a pivot is bad practice, as it often triggers worst case behavior. That was my intention here.
I've spent a bit more time reading after writing this post, and I've come to the following understanding of Quicksort: in the best case, we evenly divide the array into roughly half the size, so we recursively build log n levels before we reach our base case, a list of size one. On each level, there is O(n) amount of work done, which gives us the case, O(n log n). By the same token, when we encounter the worst case, we must recursively divide the set n times before we get to the base case, giving us n levels with O(n) complexity, coming out to the worst case O(n^2).
Whew. That feels much better.
Instead of a blog post here, this week I opted to write about bread over at my (and Hillary's) cooking blog: Us Versus Dinner
. If you're into that sort of thing, or just like looking at pictures of tasty bread, give it a look. I'll be back here next week with something more appropriately nerdy.
Why bitbucket is great
Version control is the software miracle that nobody seems to know about until they get their first job. I've read plenty of forum posts from computer science students asking "What don't they teach you in school?", and inevitably the first response is about version control. It was true for me.
Even for solo development VC is great, and since I've known about it, I've hosted all of my code in a private Subversion repository. This makes working on my code from anywhere easy, and has the added benefit of keeping a backup of my work.
Just as I was becoming comfortable with Subversion, newer distributed version control systems popped up. As the hype around both Git and Mercurial grew, I decided it would be in my best interest to bring myself up to speed. At work we considered switching to Mercurial, mainly on the grounds that it had wider operating system compatibility, so I created myself a Bitbucket account to aid in learning. At the time, Github was exclusively git, and Bitbucket exclusively mercurial.
Both Bitbucket and Github brand themselves as "social coding" websites, but even as I began to use my Bitbucket for all my non-private projects, I never really thought about it. My main reasoning for using it was as an added facet to my resume. Because of the nature of programming, it is sometimes difficult to express your ability level, especially in an interview. Being able to point someone at a big pile of code you've written is helpful.
I really discovered why the "social" aspect of these sites is a big selling point a couple of weeks ago. I posted about GreyHolder
, a little jQuery plugin that I wrote and released on my Bitbucket. What I didn't realize was that is contained a bug that could potentially throw away user input. Not very cute behavior for something that is supposed to make life easier on the user. I didn't notice the bug, but my friend Dan
did. Because the code was on Bitbucket, he was able to grab it, fix the bug and then send me a pull request to grab the fix. That meant that all I had to do to fix the problem was click a button accepting his new code.
I don't know if you've ever fixed a bug with the press of a button, but it's great, and it should be (and is) the main selling point for code hosting sites. That capability is the reason I'm going to be using code hosting for all my future projects, even the private ones.