jump to navigation

A positive day 2006 July 13 21:00

Posted by diamond in : Cycling, Work , trackback

For those who aren’t interested in cycling, skip the next paragraph. For those who aren’t interested in tech details of a bug i fixed today, skip the rest of the post. You’ve been warned -)

Yesterday, i went for a cycle and hit a new personal best average speed of 28.56kmph over 21.76km. Today, i went out for a recovery spin; take things nice and easy, take extra care to stretch out properly, avoid pushing. I think i might have done it wrong though: 24.89km @ 28.73kmph. A new record, over a longer distance too. Woo -) I also remembered something i decided on a few weeks ago: the difference between road biking and mountain biking is that the latter is about conquering your environment, and the former is about conquering yourself.

Over the last 6 weeks or so, since i had to basically re-work my code for the shannon portal project from the ground up, there’s been a very subtle and impossible to reproduce bug occuring every now and then. I didn’t focus too much on it before this week because it was very rare, and there was always a good chance that i had fixed it without noticing, and i had lots and lots of more pressing issues. However, for whatever reason, as soon as we installed the portal in shannon, it started occuring constantly.

The portal software is a message-passing system with a central state-machine daemon (monitor.py). It receives the events from the other pieces of software, and sends out the appropriate replies based on the state it’s currently in. In order to make sure that the links are still working, all daemons send out ‘ping’ events every 10 seconds. This, after a fair bit of initial work, works well and reliably.

However, there is also a cgi script that provides all the web logic for the front end. When certain pages are loaded it needs to query monitor.py to find out where it should redirect to. So, before outputing anything to the browser, it connects to monitor.py using a standard comms library i’d written, sends an event to say what page has just been called, and expects a message in return.

The bug caused the cgi script to not always receive a reply. I only managed to pin this down yesterday. The code would loop waiting for a message, discarding any ‘ping’ messages that it might receive in the meantime. The issue lay in the fact the comms.handle() function that it was using to wait for a message also took care of sending pings, and would return after sending or receiving one message. So, every now and then, comms.handle() would send a ping instead of receiving the message, and the cgi script would close the connection thinking it had what it needed, instant crash.

So, adding a parameter to comms.handle() to say if you wanted pings to be sent or not fixed the issue, and relieved me greatly. I was hoping to go to waterford this weekend for a birthday party, but if i couldn’t track down this bug, my weekend would be gone. As a bonus, fixing this also cleared up a bunch of other mysterious issues that had also been previously unresolved, lowering my stress levels considerably -)

Comments»

no comments yet - be the first?