Crowdsourcing for Data Enrichment

Data problems – whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment – are age-old.

IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today’s information-driven businesses, this issue is more critical than ever.

Technology, however, has not done much to help us solve the problem – in fact, technology has resulted in the increasingly fast creation of mountains of “bad data”, while doing very little to help organizations deal with the problem.

One “technology” holds much promise in helping organizations mitigate this issue – crowdsourcing.

Crowd Computing – for Data Problems

The Human “Crowd Computing” model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:

 

    • Verification of correctness

 

    • Data conflict and resolution between different data sources

 

    • Judgment calls (such as determining relevance, format or general “moderation”)

 

    • “Fuzzy” referential integrity judgment

 

    • Data error corrections

 

    • Data enrichment or enhancement

 

    • Classification of data based on attributes into categories

 

    • De-duplication of data items

 

    • Sentiment analysis

 

    • Data merging

 

    • Image data – correctness, appropriateness, appeal, quality

 

    • Transcription (e.g. hand-written comments, scanned content)

 

    • Translation

 

In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data – this approach is ideal – or any time that business data needs to be enriched as part of a business process.

Human Crowd Computing is NOT Outsourcing or “Hiring Temps”

Human Crowd Computing is completely different than outsourcing the problem or hiring a large number of temporary workers.

Human Crowd Computing is instantly scalable – up and down. Outsourcing is the equivalent of renting some other company’s data center. And “hiring temps” is the equivalent of bringing in a temporary data center. Both approaches take time to “turn on”. They can’t scale “up” very well. And they’re not elastic. And you pay for the resource whether you use its full capacity or not.

A Leading Online Marketplace and Human Crowd Computing

A second proof point of this type of technology is an example of an implementation at a leading online marketplace, which has hundreds of millions of listings live at any given moment.

This marketplace has an incredible variety of items listed – in the past, those items have included old gum, entire towns, and even spouses. The fact that anyone can list almost anything makes this marketplace the place to go to find rare or outlandish items.

Major Product Categorization Problems

It’s no doubt, then, that one of the biggest challenges this marketplace faces is product categorization. Product categories are a key way that people search for items.
Depending on the month, this marketplace requires upwards of 100,000 new products to be categorized into something called a Global Trade Item Number – a unique 12-14 digit number based on product information which typically must be gathered from multiple different sources.

Depending on the month, the number of products requiring categorization ranges from below 5,000 to close to 100,000. A scalable and elastic computing model is required to support the variations in workload.

Because judgment calls are involved, and data must be retrieved and compared from potentially many different sources, the CrowdFlower platform uses multiple humans for each judgment call to ensure high levels of accuracy. About 60% of categorizations are completed with 2 or 3 individual responses; however, particularly complex judgment calls can require 10 or more responses. I’ve confirmed that this algorithm is quite tunable – if your data needed higher levels of certainty, you would simply involve more human opinions, enabling you to achieve the goals you require.

Results Delivered

The marketplace formerly outsourced product categorization – essentially paying for a large staff of contractors which were alternately overwhelmed and then idle, depending on the day. From week to week, there could be as much as a 400% difference in workload.

With a Human Crowd Computing platform, the marketplace increased its throughput for product categorization by over 300% – from 300 per hour to 1,000 per hour. At the same time, the number of improper classifications were reduced by over 67%. To cap it off, CloudFlower claims that this solution reduced the marketplace’s costs by some 70%.

CrowdFlower has published a nice 7-page customer success brief on one of their larger customers that is worth reading. I can’t link to the report directly, but if you go to their home page and click on the “get a free report” button, you’ll get it via e-mail within about 30 seconds – after you answer 3 or 4 pesky questions.

Conclusion

Without question, this model of Human Crowd Computing will become increasingly mainstream in organizations. It’s highly appropriate for any situation involving large to huge numbers of small tasks that require human judgment. With the appropriate software platform, the internet and commonly available connectivity/interoperability software, this solution may be exactly what you need for your data problem.

  • Read to find out how you can earn money as a part of a crowd.
  • Check out the latest venture that enables you to earn money as a part of a crowd here.

 

Leave a comment