Gems Trust Scores – a diamond in the rough


Microtasking is a growing market, with some valuing the industry at $2 billion a year. Most commonly, these are small jobs that software can’t do well, like labeling images with meaningful captions, or doing usability testing for new apps. People sign up to do these small tasks, paying maybe $0.05 USD a pop.

Amazon’s Mechanical Turk is the market leader, connecting people who need things done, with often times low-skill workers who don’t mind making an average of $5/hour or less. Something else pushing the pay down is the 20-40% fee that Amazon levies on every transaction. That’s a huge market inefficiency, opening the door to competitors.

A more efficient market with blockchain

Gems is a newly proposed open source replacement for Amazon’s Mechanical Turk. It addresses a number of flaws in existing centralized microtasking services, including the massive amount of redundant tasking required to ensure quality results — who’s watching the watcher. The Gems whitepaper  proposes introducing a “verifier” role into the mix, and identifying users eligible for this role by implementing a trust model to evaluate the performance of every worker (called miners.) The paper shares the mathematical functions implemented in the smart contracts that form the basis of the platform.

Unfortunately, as a single “trust score” value, it’s pretty useless.

No objective reality

\frac{ \frac{ \hat{p} + \frac{z_{\alpha /2}^{2}}{n} + z_{\alpha /2}\sqrt{ \frac{\hat{p}(1-\hat{p})}{n} + \frac{ z_{\alpha /2}^{2} }{4n^2} } }{ 1 + \frac{ z_{\alpha /2}^{2} }{n} } + \frac{ \hat{p_{i}} + \frac{z_{\alpha /2}^{2}}{n} + z_{\alpha /2}\sqrt{ \frac{\hat{p_{i}}(1-\hat{p_{i}})}{n} + \frac{ z_{\alpha /2}^{2} }{4n^2} } }{ 1 + \frac{ z_{\alpha /2}^{2} }{n} } }{2}


  • \hat{p} is the fraction of positive task completions
  • \hat{p_{i}} is the fraction of positive task completions completed on the first try
  • n is the total number of task completions
  • z_{\alpha /2} is the (1−α/2) quantile of the standard normal distribution (like, 1.96 for 95%)

This looked unintuitive to me, so I ending up graphing it, where each line represents a correct task completion percentage from 1 (top line of graph) to 0 (below visible axis) in increments of .05 — don’t mind my upside-down legend.Gems Trust Score graph

The trust score ranges from a minimum of 0.5 to a maximum of 1.0. The score isn’t linear, so as miners perform larger and larger numbers of tasks, their trust (and hence financial value to your microtasks) is indistinguishable from someone who’s just starting out, possibly creating a new account to escape a bad reputation. Since Gems is advocating a low amount of friction in the creation of accounts, this could represent an automated vulnerability for creating a huge number of accounts, and randomly generating task completion work results in the hope of getting some through and gaining Gem credits.

If you choose a relatively high-sounding trust score, like 0.80, you’re including people who’ve only gotten 50% of tasks done correctly, on up to 64 tasks. Even if you raise the bar to a trust score of 0.95, then the system still permits someone who has done 3 tasks with 66% accuracy into your labor pool. This means there is no “objective” level of trust that can be used for a project — every project will have to experimentally determine the amount of trust that results in valid, affordable results for the project.

We can sell trust for you wholesale

Longer term Trust Scores graph

The trust score takes your entire tasking history into account. There are two big consequences to putting too little weight on recent performance:

  1. people with low scores are motivated to abandon their low scoring accounts and start new ones
  2. no one will hold onto old accounts with high trust scores, because the resale value to disreputable miners will be so high and because that experienced miner can have an equally high trust score quickly with a new account.

With a stable of accounts with high enough reputation, a malicious actor could conceivably act as a verifier for a certain percentage of its fraudulent workers, allowing a higher percentage of fraudulently awarded Gems.

You can try tweaking the parameters yourself, changing the variables relating to positive task completion, first time accuracy, and quantitative factors by using this Google Sheet: Gems Trust Score Playground 

Let me know if you come up with any of your own insights.

Blanket trust leads to Trust Farming

Trust is calculated across all task types and difficulty. Blanket trust with no weighting on difficulty of the task — which IS tracked in modules in order to automatically determine the amounts of verifier payouts over the regular mining payout — means that someone who is perfectly trustworthy on easy tasks, but not hard ones, will also be let into your pool.

It is conceivable someone could create a “trust farm” by writing a simple “Hello World” work module, with the completed task being “push this button.” For the minimum payout allowed, and no verifiers (cheaper), a malicious actor could build an army of trusted miner accounts to then be used on exploiting high-paying tasks.

Better Than Nothing

Despite the flaws of this trust model, implementing it into the system as a method of reducing redundancy in the verification step is a great idea. Some things that could maybe integrated to address some of these issues:

  • Change trust form a single numerical value to a vector, to indicate the recent trend in trust — steady, high score? Hire! Extreme acceleration? Buyer beware.
    • Increases the motivation a poorly performing miner’s has to keep their account, knowing they can turn things around as they learn the ropes, and
    • decreases resale value on highly trusted accounts, since recent performance is factored in at a more reasonable rate.
  • Consider an EigenTrust-type algorithm, which constructs trust from how much trust each microtask requester puts in each worker. This helps ensure trust farming is much, much more expensive since a single requester can’t boost the trust of a worker pool beyond believability.
  • Change the grading scale — something intuitive like letter grades A, B, C, D, and F.
    • Start new users as C-grade miners, and make them work to achieve A status, rather than starting them out up there.
  • Gamify the miner experience — trust is given by requesters, reputation is owned by the individual miner. If there are mechanisms to help build a miner’s reputation, plus mechanisms to motivate the miner to progress through reputation/trust tiers, you incentivize workers to stick with their accounts.


masyukun Written by:

Software engineer consultant.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *