o11y.guru the history of the first bug

07 Nov 2019 · Four minute read · on Gianluca's blog · Subscribe via RSS

Introduction

This article is part of a series I am writing about a side project I made called o11y.guru. Who knows what this series or the project itself will become. The reason why I started it was to have my wonderland. A place where I was able to do my mistakes without any intermediary.

This series is about my journey there. I will keep the this Introduction common to all the articles part of this series and I will keep a Table of Content up to date. The best way to follow my journey here is to subscribe to the RSS feed or to follow me on @twitter.

what is this project?

I had this idea to create a mechanism to enable other people that uses twitter to follow a group of leaders in a particular space. I decided to start from the observability (#o11y) because it is the area where I am in at the moment.

o11y.guru is pretty simple, a list of faces and a button, they you press it and if you will authorize the twitter application you will get to follow them.

Table of Content

  1. First day and first set of iterations
  2. Build process and automation driven by simplicity
  3. Monitoring and instrumentation with Honeycomb
  4. The history of the first bug
  5. OpenTelemetry it is time to embrace a unicorn standard
  6. The magic of structured logging
  7. Infrastructure monitoring with InfluxCloud
  8. Infrastructure as code with Terraform and CherryServer. First deploy.
  9. FAQ

The history of the first bug

After the first deploy I used my twitter account @gianarb and @devcampy to try the application. I have also asked a friend to try it out.

So far so good, the way I coded the following workflow is very basic, and probably it will quickly reach its scalability limit. It is a loop with a time.Sleep(5 * time.Second) break between each account to avoid the Twitter rate limit.


for _, guru := range gurus {
    time.Sleep(5 * time.Second)
    err = newFriendship(ctx, twitterClient, guru)
    if err != nil {
        logger.Warn(err.Error(), zap.Error(err))
    }
}

No retry or things like that for now. Very simple. I hope to iterate on it in the future when it will start to not working well enough anymore.

It does not report any error if the Twitter API request to follow a person fails, it just go to the next one. All three tests went well for what I was able to say, all three accounts followed the gurus.

One of the first benefit about using HoneyComb is that out of the box they are able to detect errors looking at the events you return and the graph is made by them. Just clicking around to their UI I ended up with weirdness like this graph:

Requests break down by HTTP Status

I noticed some 500 error page, and I do not like that. As you can see there is an Error tab, built by Honeycomb again and this is what it showed to me:

Span with an error

At this point it is clear to me where the problem is: “You can’t follow yourself”. It sounds reasonable.

I changed the code and I added a simple if statement to skip the guru if it is the person actually following all the other people.

// me comes from above when I validate that the token behaves to a user.

if guru == me.ScreenName {
    continue
}

It does not sound trivial at all but when I tried the fix didn’t work.

I decided to face up the problem differently. Spoiler alert: I didn’t write any unit test yet. Feel free to leave now.

Looking at the trace I knew I had set for every following request the guru name to follow and at the root span I had who required to follow the gurus. In practice, I had in the root span required_by=me.ScreenName, and for every guru its span with their name. The next image has those two spans side by side:

Looking at this span the situation is clear, I was comparing GianArb the required_by variable that you see in the right with gianarb, the follower_screenname you see at the left span.

At the end of the story the check needs to be case-insensitive. And that’s how it is now:

if strings.EqualFold(guru, me.ScreenName) {
    continue
}

This is the history of the first but I randomly discovered and I had to fix twice for o11y.guru.

Something weird with this website? Let me know.