# SiteCatalyst de-duplication and American football

One of the most difficult concepts to explain happens to be one of the most frequently asked questions that my colleagues and I receive from SiteCatalyst users. There are several iterations of this question, each with roughly the same answer:

• Why doesn’t the sum total of visits from each line item in the Pages report add up to the visit total at the bottom of the report?
• Why doesn’t the sum total of orders from each line item in the Products report add up to the order total at the bottom of the report?
• Why doesn’t the sum total of [any success metric] from each line item in my merchandising eVar report add up to the total at the bottom of the report?

When users ask this question (in any of its forms), I typically explain that the report in question involves a one-to-many relationship between the metric being viewed and the line items in the report. But this explanation can be difficult to grasp. I have been trying to come up with an analogy to help explain these phenomena, and I think I’ve got one. I’m hoping it will clarify this behavior for you, or will help you better explain it to the users at your organization.

There are few things that I enjoy more than relaxing on my couch on an autumn weekend (in between home improvement tasks requested/mandated by my wife, of course) and watching football from all across America. Being the sports geek that I am, I often have my laptop by my side so I can check scores and stats from games that I’m not watching. The reason I mention this, and as the title of this blog post suggests, there’s an apparent statistical anomaly in football that parallels this behavior in SiteCatalyst.

When a quarterback throws for a touchdown, someone has to catch the pass—usually a wide receiver. When this happens, the quarterback’s numbers reflect that he threw for a touchdown. At the same time, the wide receiver’s numbers show one touchdown:

6/9
83
9.2
1
0
R. Moss
2
27
13.5
1
14

If you didn’t know better you would see a touchdown tallied on both the quarterback’s stat sheet and on the wide receiver’s record and conclude that the team must have scored two touchdowns, and therefore that this represents 14 points (two separate touchdowns). Of course, this isn’t actually the case. There is simply a one-to-many relationship between touchdowns and players involved. There is no way for these statistics to show both players involved in the touchdown without showing a touchdown associated with each of them.

Hopefully I haven’t confused you. (If you’re a hockey fan, there’s a similar analogy in there, where two players can be credited with an assist on a single goal. And if you’re not a sports fan at all, hopefully the rest of this post will still make sense!) Consider the order discrepancy described above in the list of questions. Here’s what a typical order might look like.
`s.events="purchase" s.products=";Macbook 13.3-inch;1;1199.99,;Adobe Photoshop CS4;1;799.99,;Kingston DataTraveler 16GB Flash Drive;1;39.99" s.purchaseID="220236197"`

Based on this order, in the Products report you’d see something like this:

The line items add up to three orders, but there was really only one order—you saw it above—so how should SiteCatalyst handle this?

Show 0.3 orders for each product, so that the line items add up to one? Well, that wouldn’t be quite right, because then you would see a bunch of strange numbers that wouldn’t give you a real sense of how popular an item is; its popularity would be determined in part by how many products belonged to the order (e.g., an order with five products would assign 0.2 orders to each product, but an order with 10 products would only assign 0.1 orders to each product).

Show the summed total of the orders from each line item at the bottom of the report? That might lead users to think that your site had a lot more orders than it really did.

Instead, SiteCatalyst shows the site-wide total regardless of the sum of the line items.

Saying that you need to add up the metric totals for various line items for any reason is really just a way of saying that you need classifications around your report values so that they’re grouped appropriately. For example, why would you add up the orders for all product names containing the word “shoes” other than to get a sense of the total orders for shoe-related products (taking into account that some orders may involve multiple such products)? This can be accomplished using SAINT classifications and Omniture Discover.

After all, it might be hard to make sense of exactly how many points a football team scored just by looking at the individual players’ statistics, but the handy “scoring summary” that you’ll see in newspapers and on web sites will correctly pair quarterbacks with receivers to give you a better sense of how much scoring really went on:

NE
5:07
Randy Moss 14 Yd Pass From Tom Brady (Stephen Gostkowski Kick)
7
0

Now that clarifies what happened! Randy Moss, a receiver, caught a 14-yard pass from the quarterback, Tom Brady. And we can see that the New England Patriots scored once, not twice. We’ve de-duplicated the number of touchdowns that the team has scored.

To de-duplicate your data and show the exact total number of [metric] that occurred across multiple line items in a report, create a classification category to meet your needs and apply the necessary classifications to group line items in the parent report however you need them.

When these classifications sync into Discover, you’ll be able to go to the associated report, add the Orders metric, and see de-duplicated orders within that category as a line item. To continue with the example above (products containing the word “shoes”), if an order contained five product IDs that involve the word “shoes,” and you add a category classification and label each of these product IDs as belonging to the “Footwear” category, then you will correctly see one order for the “Footwear” category in Discover when you run the report that corresponds to the classification you’ve set up (e.g., Product Category)—even though this order involved five products.

And there you have it. Hopefully this clarifies (at least a little bit) what can be a confusing situation for many users. And unfortunately for me, I’ll never again be able to look at football statistics without thinking about de-duplication.

As always, please feel free to follow me at OmnitureCare on Twitter and/or FriendFeed. I’m also available by e-mail at omniturecare@omniture.com and would love to hear from you via any of these channels!