Bandit: An A/B Testing Alternative for Rails
Posted 12 Nov 2011 to rails, vanity, statistics, testing and has CommentsIn a typical A/B test, two alternatives are compared to see which produces the most “conversions” (that is, desired results). For instance, if you have a website with a big “Sign Up” button that you want visitors to click, you may wish to choose different background colors. Under typical A/B testing guildlines, you would pick a number (say, N) of users for a test and show half of them one color and half of them another color. After users are shown the button, you record the number of clicks that result from viewing each color. Once N users view one of the two alternatives, a statistical test (generally categorical, like a Chi-Square Test or a G-Test) is run to determine whether or not the number of clicks (aka, “conversions”) for one color were higher than the number of clicks for the other color. This test determines whether the difference you observed was likely due simply to chance or whether the difference you saw was more likely due to an actual difference in the rate of conversion.
This method of testing is popular, but is fraught with issues (practical and statistical). The bandit gem provides an implementation of an alternative method of testing for Rails that solves many of these issues.
Issues with A/B Testing
There are a number of issues with A/B testing (some of which have been described in more detail here):
- You can’t try anything too crazy without having to worry about half of your users not converting. For instance, you may want to try a horrendous color for your “Buy Now” button but are too afraid about potentially harming sales if your users hate it. In this case, the risk of a big change may outweigh the possible benefit if your users like it.
- A/B testing provides a way of only testing two alternatives at once. Pick two, wait, pick two more, wait - this is not the easiest workflow if you want to test 50 options.
- With A/B Testing, you need to have a fixed sample size to make the test valid (otherwise, you run the risk of repeated significance testing errors, as described in more detail here).
- Due to the fixed sample size requirement, you may have to wait a while before you get any results from your test (especially if the expected improvement is marginal, in which case your sample size would need to be larger). This problem can be compounded if you don’t get much traffic.
- Designers and developers generally don’t want to (and shouldn’t have to) understand statistical concepts like power, p-values, or confidence when creating and evaluating tests.
- There are no good answers for what you should do when A performs just as well as B. Was the sample size just too small (implying you should try again with a large sample)? Go with A? Go with B? Does it matter? The reality is it may matter - but you won’t know.
The Bandit Method
The ultimate goal of A/B testing is to increase conversions. The problem can be described terms that differ greatly from the multitude of questions A/B testing brings (i.e., “Is A better than B?” followed by “Is B better than C?” followed by “Is C better than D?” ad infinitum). Instead, imagine you have a multitude of possible alternatives, and you want to make a decent choice between alternatives you know perform well and alternatives you haven’t tried very often each time a user requests a page. With each page load, pick the best alternative most of the time and an alternative that hasn’t been displayed much some of the time. After each display, monitor the conversions and update what you consider the “better” alternatives to be. This is the basic method of a solution to what is called the multi-armed bandit problem.
With a bandit solution, there is no concept of a “test”. At no point does the system announce a winner and a loser. Alternatives can be added or removed at any time. The better performing alternatives will be displayed more often, and the worst alternatives will rarely be displayed. At any point, if one of the poorly performing alternatives begins to perform better it will be shown more often. This provides solutions to all of the problems listed above:
- Go ahead and try something crazy. If it performs poorly, it won’t be shown very often.
- Pick as many alternatives as you’d like and add them.
- There’s no “test”, and no minimal sample size needed before optimization can start.
- Information about conversions is utilized as users convert or do not convert. There is no pause before results can be immediately used in selecting the next alternative to display to a visitor.
- Designers and developers can add alternatives or remove them at any time. The system will adjust immediately. If an alternative seems to be consistently performing poorly, it can be removed at any time. Alternatively, it can just be left forever. The best option will always be displayed the most often. There are no complicated decisions that have to be made up front or requirements that designers or developers know anything about proper statistical hypothesis testing.
- If one alternative performs the same as another, they will both be displayed with the same regularity. There would be no need to choose one over the other or remove either of them.
Bandit Gem
While there are a few A/B testing libraries for Rails out there, the preeminant one (Vanity) has statistical issues and is unreliable in a production environment. Bandit was created to test the feasibility of a multi-armed bandit based alternative to A/B testing and to solve the issues with the Rails based A/B testing gems. It is still in development, though - use at your own risk.
Resources
- bandit gem
- http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-like-a-bandit
- http://en.wikipedia.org/wiki/Multi-armed_bandit
- http://www.evanmiller.org/how-not-to-run-an-ab-test.html
Campfirer.com - A Jabber to Campfirenow.com Gateway
Posted 13 Sep 2011 to campfirer project, campfirenow, jabber and has CommentsCampfire is a web-based group chat service that is directed at businesses. Rather than using a standard protocol, the folk at 37 Signals decided to invent their own. This has led to the necessary creation of a number of custom clients to interact with the API using their unique, one-of-a-kind protocol (for those who don’t want to have to chat in a browser window).
I heart Jabber (XMPP). There’s a good reason Google and Facebook chose that protocol to power their chat. I have no idea why 37 Signals didn’t use Jabber too. Maybe they’re mavericks.
Naturally, I’d like to be able to use one of many Jabber clients to access Campfire, along with all of my other Jabber based accounts. To do this, I wrote a Jabber Component. It provides Multi-User Chat (MUC) support for Jabber servers that utilizes Campfire’s API, so you can “join” a room, “talk”, and see other posts by other users. It’s called Campfirer (campfire + jabber = campfirer).
I’ve set up a running instance of the service at campfirer.com. A description of how to download / set up the code for your own Jabber server can be found there.
The code and more info can be found on the github project page. Pull requests welcome.
Incr/Decr Counters Using memcache-client
Posted 13 Aug 2011 to memcache, ruby and has CommentsBased on some recent changes in the memcached library, the incr method in the memcache-client gem no longer works as expected. For instance, the following:
require 'rubygems'
require 'memcache-client'
m = MemCache.new 'localhost'
m.set('counter', 0)
m.incr('counter')
will result in the following error:
MemCache::MemCacheError: cannot increment or decrement non-numeric value from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:926:in `raise_on_error_response!' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:831:in `cache_incr' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:865:in `call' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:865:in `with_socket_management' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:827:in `cache_incr' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:342:in `incr' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:886:in `with_server' from /usr/lib/ruby/gems/1.8/gems/memcache-client-1.8.5/lib/memcache.rb:341:in `incr' from (irb):5 from /usr/local/lib/site_ruby/1.8/rubygems.rb:123
This is caused by the memcache-client gem marshalling everything before it’s stored in memcache. Memcache needs the actual, unmarshalled, integer value to be stored. The code above should be changed to:
require 'rubygems'
require 'memcache-client'
m = MemCache.new 'localhost'
# set the raw value initially by passing in a fourth argument of true
m.set('counter', 0, 0, true)
# increment the raw integer value
m.incr('counter')
# you can now decrement the raw integer value as well
m.decr('counter')
The fix is simple, but not noted anywhere (I can find it) in the memcache-client documentation. Besides a few mentions on Google-groups sans solution, I couldn’t find any references to this issue elsewhere on the world wide intertubes. I find the atomic incr/decr functionality in memcache to be quite useful; I hope this can help alleviate any issues others might be having with this problem.
HBaseRB: A Ruby HBase Library
Posted 01 Aug 2011 to hbase, ruby, hadoop and has CommentsI recently upgraded the HBaseRb library I wrote a few months ago. HBaseRB provides a means for Ruby to interact with HBase using a Thrift interface. Most other libraries (like hbase-ruby, for instance) use the REST interface provided by HBase. This may work in many situations, but for our applications at LivingSocial we wanted the benefit of using a binary protocol without the overhead of XML parsing.
Some Google searching elucidated the fact that HBaseRb is a bit hard to find, so I thought I’d mention it here.
Changing Namenode Hostname Breaks Hive
Posted 18 Jul 2011 to hive, hadoop and has CommentsHive is a great piece of software - but there are still some major issues. I ran into one recently when I changed the hostname of the Hadoop namenode. I couldn’t figure out why hive was using the old hostname, even after changing all of the config files in the $HADOOP_HOME to use the new one and testing other map/red jobs.
Apparently, Hive stores all partition information with full references to the location (for instance, hdfs://host:9000/user/hive/warehouse/some/path). This makes lookups faster in the metastore, but makes it impossible to easily change the hostname of your namenode.
The best way I could find to do this was the following:
- mysqldump the metadata database to a local file
- Edit the dump and do a global search and replace on any instances of the old hostname
- Reimport the dump
If the location was saved in a separate table (w/ a one to many relationship between partitions and hosts / locations) it would make this process quite a bit easier.
Good DC Coffee Shops
Posted 19 Jun 2011 to dc, coffee and has CommentsI moved to DC about four months ago, and since then, my weekends have been frequently occupied with one quest: find the best DC coffee shop. When I was in Charleston, the answer used to to be easy (Kudu Coffee, if you’re wondering). In Baltimore, it was even easier (Red Emma’s Bookstore Coffeehouse). In The District, however, I’ve had a much harder time. There are many meretricious options to choose from, and few are real winners. There are quite a few convenience stores, bars, and restaurants that call themselves a “cafe” and really shouldn’t.
What’s a winner? Admittedly, it has a lot to do with a place that I can break out a laptop, drink some coffee, do some work, and be just distracted enough by nearby conversation that I don’t mind the fact that I’m working. Here are the metrics I take into consideration:
- outdoor seating
- free wifi
- ample seating
- power outlets
- eavesdropping payoffs (audible interesting conversations, often philosophical in nature)
- quality music or live performances
- good collaborative space (big tables, etc)
- proximity to public transportation
So here are some top performers on this list, with a final entry of what I believe to be the winner.
Ebenezer’s Coffeehouse
This is the first place I went to in DC. It’s right next to Union Station, so it’s quite accessible. I didn’t realize it at first, but this establishment is owned and operated by a Christian church. This, naturally, leads to a rather homogeneous clientele makeup, which often consists of small Bible study groups and prayer groups. Seating is generally available, the coffee is alright, and there is free timed wifi (with a purchase) - but don’t expect an interesting space, interesting characters, or any stimulating conversation.
Tryst
This place is more of a restaurant / cafe. It’s generally completely packed on the weekends with hungover college students looking for food and coffee. This is not a good work place, even if you decide to wait for a seat.
Big Bear Cafe
A great location with plenty of hits on my list of important qualities. There’s outdoor seating, free wifi, good collaborative space, great music, and more. The disadvantages are major, though - seating is impossible on the weekends and there’s no nearby metro stop.
Chinatown Coffee Co.
Excellent coffee can be found here. There’s generally enough seating, free wifi, and it’s right next to the Chinatown metro stop. You’re not likely to overhear any juicy conversations though, most stick to themselves at tables meant for one or two.
Filter Coffeehouse and Espresso Bar
Great coffee here, too, and it’s a short walk from the Dupont metro stop. There’s outdoor seating as well, though that and all of the few seats indoors are generally taken. With better seating options or fewer patrons, this place would be a real winner.
MidCity Caffe
The winner at this point is MidCity. They always have enough seating (though all seats are really close to each other, so you’ll probably make a friend), free wifi, great coffee, and excellent music. I’ve even seen a live performance or two there. It’s not too far from the U St metro stop. Another great thing about this place is the owners have made a special effort to put power strips everywhere.
There are plenty of mediocre places I’ve left off (Jolt n Bolt Coffee & Tea House, Windows Cafe & Market, and many more not worth mentioning), so this short list is by no means comprehensive. I’ll add to it if I find any other locations worth a plug.
Asynchronous MySQL in Python: Twistar 1.0
Posted 18 Jun 2011 to python, twisted, twistar project and has CommentsAfter a few more updates and contributions, I’ve finally decided to release version 1.0 of Twistar. The recent work and contributions have brought it in line with what I consider to be a feature rich enough library ready for a version one release.
Description from the website:
Twistar is a Python implementation of the active record pattern (also known as an object-relational mapper or ORM) that uses the Twisted framework’s RDBMS support to provide a non-blocking interface to relational databases.
Twistar currently features:
- A thoroughly asynchronous API
- Object validations (and support for the easy creation of new validation methods)
- Support for callbacks before saving / creating / updating / deleting
- Support for object relational models that can be queried asynchronously
- A simple interface to DBAPI objects
- A framework to support any relational database that has a module that implements the Python Database API Specification v2.0 (MySQL, PostgreSQL, and SQLite are all supported now)
- Support for object polymorphism
- Unit tests
For more information, check out the website or the github page.
Fun with Ruby Symbol Expressions
Posted 24 May 2011 to ruby, metaprogramming and has CommentsGroupon released an interesting extension to the Symbol#to_proc method named symbol_expressions over a year ago (I didn’t notice it until recently). It allows you to compose procs based on combinations of existing methods. For instance, to split and then join strings:
["foo", "bar"].map(&:split['']+:join['_'])
# => ["f_o_o", "b_a_r"]
I thought this was nifty, but the syntax is a bit odd (brackets are not generally used as argument list boundaries). Additionally, this sort of Proc composition is something a Proc should know how to create, but it doesn’t make sense to have a Symbol keeping track of a list of other Symbols that have been “added” to it (especially via an internal array class). It just seems like a bit of a hack to have Symbols acting as lists of other Symbols.
Based on these ideas, I reduced the symbol_expressions lib to the following lines:
With this little bit of code (which simply prefixes argument lists with a | symbol), you can now do stuff like this:
# composition using Proc (rather than Symbols that have lists of Symbols in them)
splitjoin = Proc.from_sym(:split | '', :join | " ", :upcase)
splitjoin.call "what"
# => "W H A T"
["foo", "bar"].map(&splitjoin)
# => ["F O O", "B A R"]
["foo", "bar"].map(&:split | '')
# => [["f", "o", "o"], ["b", "a", "r"]]
Fun stuff. Ruby consistently amazes me with its expressiveness.
HiveSwarm: Additional User Defined Functions for Hive
Posted 09 Apr 2011 to hive, hadoop and has CommentsThere are a number of user defined functions that would be quite useful in Hive but that have not been created and added to the library. Hive does provide the ability to define custom functions, but, as I’ve noted before, the documentation is sparse and sometimes simply wrong. For instance, the instructions for createing a user defined table generating function (found here) incorrectly show the close method calling forward which will cause an error when you try to run the function in even Hive 0.5.0.
In an effort to both collect useful functions that we are writing at LivingSocial as well as to make the compiling process easier, we’ve created a new open source project on Github called HiveSwarm. There are only a few functions there now, but more and more functions will be added over time.
| server | page_load |
| 10.0.0.1 | 2011-04-01 10:01:01 |
| 10.0.0.1 | 2011-04-01 10:01:05 |
| 10.0.0.1 | 2011-04-01 10:03:00 |
| 10.0.0.2 | 2011-04-01 10:01:02 |
| 10.0.0.2 | 2011-04-01 10:01:05 |
One of the most useful new functions is called intervals. The function will generate a table with the intervals between values in an input table. For instance, let’s say you have a table that has one column for server IP addresses and another that has dates and times for page loads (shown in the table on the left). Imagine you wish to know the intervals between page loads per server.
After compiling HiveSwarm, you can load the jar and add the function:
add jar /path/to/HiveSwarm.jar;
create temporary function intervals as 'com.livingsocial.hive.udtf.Intervals';
Then, to select the intervals, just specify the grouping column and the column you wish to get intervals from:
select intervals(server, page_load) as (server, intervals) from server_page_loads;
This will produce the results shown in the second table (with intervals in seconds).
| server | intervals |
| 10.0.0.1 | 4.0 |
| 10.0.0.1 | 115.0 |
| 10.0.0.2 | 3.0 |
The column to pull intervals from can be either numeric or a string type. If it is a string, then it will be converted into a timestamp (so the resulting difference will be calculated in seconds). All numberical types (including timestamps from strings) will be converted into floats.
Pull requests are welcomed if you have a function you’d like to see added.
Additional information can be found on the github page.
Statistical Analysis and A/B Testing (Correctly)
Posted 17 Mar 2011 to vanity, statistics, testing and has CommentsWe’ve been playing with Vanity recently at LivingSocial and have found it to be generally useful. During a recent test, however, we saw the option listed as the “best choice” change with almost every dashboard page load. This should not generally be happening if the test for significance is implemented correctly. The first thing we did was add some numbers to the dashboard showing total views for each option and the number of track events for our conversion metric. That’s when I noticed a problem: what Vanity was claiming as a significant difference (at a 95% confidence level) wasn’t actually significant (based on a G-test). After some digging in the source, I found the following issues.
Issue One: Wrong Number of Tails
The first two issue relates to the way in which the two-proportion Z-test is implemented. Vanity links to this instructional post on their result interpretation page, and I assume it was used as the basis for Vanity’s implementation. While I think there are a few things wrong with the post (see the next issue), I believe one of the biggest issues in Vanity is the impropper use of a one-tailed test. The instructional post on stats and ab-testing describes the correct use of the one-tailed test in the case where you have identified a “control” (presumably the original page) and a “experiment” page and want to only test whether the new page performs better than the old one. One-tailed tests are used in this sort of case, when one wants to know if a statistic from one defined group is greater than another defined group (say, case over control proportion).
Vanity, however, picks the second best performing group and then uses it as the “control” group in a one-tailed test to see whether the best group’s proportion is greater than the second best. This “control” group may be a different group on each dashboard page load. The result is a test to see whether the proportions are equal or not equal, as opposed to a test to see whether or not one specific proportion is greater than another specific proportion. Essentially, a one-tailed test is being used for a two-tailed hypothesis.
Why does this matter? Well, in our case, it mattered quite a bit. Vanity was calling a difference significant when it shouldn’t have been. The counts are in the following table.
| Group | Viewed | Converted |
| A | 409199 | 22399 |
| B | 409351 | 22779 |
Vanity’s conclusion was:
With 95% probability this result is statistically significant.For a one-tailed test, this conclusion is correct. For a two-tailed test, however, the confidence level is only 92.5% and is not significant. To see how far off the result is, the results of my G-test produced a p-value of 0.0721, which is not significant. Based on Vanity’s conclusion, though, we might have assumed a difference and then put in effort into making changes that would not have actually mattered.
Ultimately, what you generally want to know in A/B testing isn’t just want the post Vanity links to claims, i.e., “does A perform better than B.” What you actually want to know is “Does A perform better or worse than B”. These questions might seem equivalent, but they have very different implications in terms of choosing a statistical hypotheses and resulting test. The one-tailed test chosen by Vanity is only applicable when you want to specifically test whether or not some well-defined A performs better than a well-defined B. Not only is that not what an A/B tester probably wants to know (rather, they want to know “better or worse”), but the test itself is implemented incorrectly because the A vs B groups can flip back and forth depending on which is currently performing better at the time the dashboard is loaded.
These combined problems result in false positives in terms of identifying significant differences between proportions and can lead to wasted development time in terms of making unnecessary changes. Additionally, because the rate of false positives is high due to the incorrect implementation of a one-tailed test, Vanity will vacillate between calling an option significantly different and not.
Issue Two: Wrong Test Application
The second issue is related to the Z-test itself. The implementation used in Vanity does not pool the sample proportion, which is necessary to produce the best estimate for sample variance. I’ll leave out an explanation as to why pooling the proportion produces a more accurate result (it’s rather involved), but I will say that it is trivial to modify existing code to use a pooled method. For those interested in learning more about the reasoning behind pooling proportions, more information can be found here.
Fix
To fix the above issues, I’m going to fork Vanity and switch to a completely different test. Since the result of an A/B test is categorical data, it’s perfect for a Pearson’s chi-square test of independence or better yet a G-test. Such a test will show the amount of difference, if any, between any number of testing variations. In addition, we will be adding more information about the extent of the difference, with a recommendation noting whether or not a user should continue running a test.
Our modifications will be available on github.