AGI delayed? My recent experience with Forfeit.

As I have mentioned in another thread (Amazing Marvin Has Pledges Now), I have enjoyed using Forfeit (https://www.forfeit.app/) lately.

It’s so effective that I haven’t derailed, eh, forfeited, yet. However, as we know forfeiting isn’t failing and I created the ambitious Forfeit (that’s how they call goals) to take a camera only photo of My left hand holding three fingers up.

Here is the photo I have submitted as evidence:

my_left_hand_grey

Well, my intention was clearly to nail it by failing it, and also I just wanted to pay Forfeit some money because they currently don’t charge a subscription fee.

And you can already see where this is going; to my surprise the evidence was accepted:

image

Edit: Josh, co-founder of Forfeit, was super responsive and told me that all photos are currently validated by humans. He said strong language like it must be the right hand would have been required to elicit a request for clarification from the verifier.

I then put the same image into GPT and asked it to judge whether the image fulfills the criteria:

image

(I think I am using the unpaid version of GPT on my phone, so maybe this would yield correct results with GPT4o. In before everyone is claiming that it is judging correctly for them.)

Edit: GPT4o can in fact correctly tell that this is not a left hand per comment below by dreev.

To be honest, this has decreased my enthusiasm for Forfeit a little bit. I recognize that this is not a big deal but it would have felt much better if the evidence had been denied.

Also, I was a bit surprised because I thought vision is pretty much solved and you would think that there is enough training data labelled with pen held in a right hand and so on that this would have a high chance of being judged correctly.

1 Like

GPT4o: “False. This is a picture of a right hand holding three fingers up.”

But I think Forfeit was correct to deem that acceptable evidence. You’re submitting a photo and averring you did the thing. That’s the heart of it. It’s like signing a document. It doesn’t matter that physical signatures aren’t technically secure or that you reproduced your usual signature. What matters is you taking a specific ritualized action that amounts to “I promise this is legit” or “I officially agree to this”.

4 Likes

I should have created a fatebook prediction that I think that GPT4o is going to label it correctly.

Yes, that’s pretty much the reasoning Josh has just given me. Also, he has told me that currently all photos are validated manually by humans. Stronger wording like it must be the right hand would have been required.

(I will also add an edit to the original post.)

1 Like

I have thought about this some more and have come to the conclusion that I disagree.

If it’s just about the ritualized action, I might as well use Beeminder, where “I promise this is legit” by submitting a data point.

The value proposition of Forfeit (as I understand it) is that I am held accountable for a specific action, the one that I specify, and not some generic action like “uploading a photo” or “providing a signature.”

If my goal is to make three PowerPoint slides about Bayern Munich, and I happen to find one about Barcelona on the internet and upload that as evidence, then should it be accepted because I “followed the ritual of uploading a presentation about soccer clubs with the initials FCB”? Obviously not. Same thing with my hand. Maybe I have had an issue with my left hand, and my goal was to do physical therapy super diligently to be able to use my three fingers again. Now, the reason Forfeit motivates me is because someone (be it a human or an AI) verifies that I have achieved that specific goal. Since my right hand was working fine all along, it doesn’t help me that it has been accepted. In doubt, the verifier (be it a human or an AI) should have asked for clarification.

If everything in the previous paragraph is wrong, then Forfeit is just Beeminder with a nicer app and the ability to add data points as photos or time lapses. But I think, and I hope Josh would agree, that it is more than that. It is the human (or AI) verification that is also part of the value proposition. (Again, all of that is my opinion and I might have fully misunderstood the purpose of Forfeit.)

1 Like

Not gonna lie, I personally cant tell if that’s the right or left hand :rofl:

since, if you were taking a picture of your left hand in the mirror it would also look like that…

I think you should test this again with a more clear and obvious distinction like putting five fingers up instead of three and see what happens on forfeit.

4 Likes

I agree that’s what’s valuable about these systems (BM and Forfeit). To some extent they overlap because BM will auto check/derail things with some of its integrations (like Rescuetime) and Forfeit also lets you submit basically “I promise I did it” self-verification.

Forfeit was using the Chat GPT vision API to verify photo evidence and compare it to the description (then having humans review ones it failed/couldn’t tell) and this seemed to work pretty well. It was on the app for like a week. Not sure what happened to it but it’s my understanding they intend to re-enable/implement that at some point.

If you’re currently meeting all your forfeit’s and it’s working well for you, I’d personally prob avoid trying to test things like this in ways that might decrease your confidence in them/make it work less well for you. If it isn’t broke (and you’re not failing) don’t fix it.

But I did have a similar (but unintentional) experience, in that once I accidentally submitted the wrong evidence for a forfeit, and it was approved. I also found it troubling, because I agree that ideally it’d be catching things like that. I also messaged Josh and he said he’d reprimand the guy reviewing the evidence.

Note they also have caught stuff like this for me too – once I submitted the wrong evidence and they asked about it and had me resubmit. Another time it was the correct evidence, but the guy was unclear about a part of it, and so asked for clarification.

Also this is a good point.

1 Like

I disagree.

Forfeit’s whole promise is confirmed accountability. Whether human verified or AI assisted + human verified, the end result that they are promising is a reliable accountability system that you can’t weasel around.

It makes a huge amount of sense to stress test your systems when you’re doing well, to plan for the future version of yourself that might try to weasel.

This aligns with the whole akrasia principles beeminder was founded on !

in my opinion, the illusion of accountability can only last for so long, if it’s only an illusion. I don’t want a temporary system that’s working for me. I want something that is future proofed and can support me for the next 5 to 10 years. and it makes sense to stress test the system periodically to see if it’s robust

in my opinion, we should strive to create anti-fragile commitment device systems that actually gets stronger whenever there are derails…

To me, my system is anti-fragile because whenever I derail it forces me to come up with a few ideas about how to prevent derailing again and I’m directly incentivized to implement those. A big part of this for me is using Boss as a service, since it gives me someone to bounce ideas off of when I derail, and prevents me from cheating.

Proper use of Forfeit ideally should create a similar system as boss as a service + beeminder. if they truly aren’t verifying proofs as dutifully as possible, then that’s a fundamental issue with their platform and hopefully someone else would enter the market to fix that feature

at the same time, I understand where you’re coming from in that it can be scary, the prospect of losing confidence in a system that’s working so well for us. I myself admit I’m scared to even test forfeit right now unless I knew of a qualified competitor that I could jump to if the tests failed… maybe that shows that we need more quality entrants into this exact market that forfeit is serving

if anyone knows of a forfeit clone, that is more reliable, let us know

That’s funny. You definitely have a point there!

I guess a similar argument could be made for my soccer club example. For someone not familiar with soccer clubs, it might be hard to judge whether my slides are about Barcelona or Munich. I would still argue that a presentation about, say, dolphins should not be accepted (and apologies to Dreev if I am misrepresenting his argument here, but I think that’s effectively what he was saying, that a PowerPoint about dolphins would also be okay because of the ritualistic nature or something).

There will always be ambiguity, I guess, but I expect the evidence to be verified to a reasonable degree. Otherwise, I just don’t see the poing.

That’s a good suggestion. I might try that at some point, although, my forfeit account probably has an exclamation mark attached to it probably, so I would be really disappointed if they missed that at this point.

Probably it was too strict, if I had to guess?

I got another message from Josh and he explained that given 100 evidence submissions that do not perfectly match the description, 80 should be accepted (according to the users), and 20 really be denied. They don’t want to piss the 80 off, so they lean more towards being lenient.

Maybe an automated legitimacy check like what Beeminder does could solve that problem for them.

Thanks for sharing. I just find that a little discouraging. Beeminder is working great for me for most things but for tasks where I know myself to be a little weasel Forfeit would be a great complement given that I can trust them to judge the evidence reasonably well.

Yes, for me I kind of have crossed that threshold now. I am going to do the three finger/five finger test, and then I have another idea for a timelapse-based task that I can flunk, and then I will reassess from there.

Update, I cheated twice on my Meditation Forfeit.

The forfeit requires me to submit a 20 minute timelapse of “Myself Meditating” using forfeit’s built-in timelapse recorder (that basically doesn’t allow you to navigate away from the app while filming?).

The first cheat, was fairly vague, in that it had me sitting in bed on my laptop, reading and typing with the laptop out of view. This was approved in Forfeit.

The second cheat, was less vague. I was sitting in bed on my laptop again, reading and typing, with clear view of the laptop keyboard in the shot. This was also approved by Forfeit.

This is disappointing for me, as I at least expected them to message for clarification on the first cheat, and charge me for the second. I think the issue is that their system should err more to the side of asking for clarification…

I haven’t talked to the founders yet, but I think I may need to adjust my task to say "myself meditating with full view of my hands not doing anything.

3 Likes

I wonder if the system worked in part because of a placebo effect - the premise of we are checking and we mean it; and the corresponding oh I had better since they will know otherwise - and now that has worn off somewhat because at first glance it appears not to be what you expected smoke, mirror, and curtain.

What if the system were somehow calling out each case correctly without any false negatives or false positives. How might that change how you use it? How might that change how you meditate?

3 Likes

Well for this week I changed my goal description to “myself meditating with full view of my hands”.

this change has motivated me a lot more to not grab my laptop while meditating. however, I’m sure I will slip up and cheat sometime soon, so it will be interesting to see if they mark it as a failure just because of this more detailed description.

1 Like

Let us know how that works.

I’ve previously used “I must stay on my cushion and not interact with anything” which is pretty non-ambiguous to me but I don’t have it in my heart anymore to actually stress test it after my recent experiences and the feedback in this thread.

I really seem to fall to the typical mind fallacy here. Why would people who go so far as to use an app that has the only purpose to keep them accountable not want to be strictly challenged on their evidence?

2 Likes

I think most people would want to be challenged on the evidence. It goes both ways though, I’m curious why people would submit half assed evidence? Personally, I’ve never submitted evidence that I knew would fail my evidence standards in hopes that Forfeit wouldn’t notice.

I agree if that was common to do, the issue you’re talking about would be important, but I’m not sure it is. This thread in particular it sounds like most of the incorrect evidence submitting has been to test them.

BTW I personally submit evidence (a timelapse of me doing 12 pullups) that is technically impossible for Forfeit to verify. They can see how much time is in the video + a few stills of me hanging from a bar. But they’d have no idea if I did 12, 11 or was just hanging there for a while. Still it’s enough to get me to do it, and I’ve never cheated. I think it’s a combination of not wanting to open up that can of worms + I actually do have to at least hang on the bar for proof, and that’s enough to get me started (like the old Scott Adams anecdote of how he didn’t make himself go to the gym, but he at least had to put his gym shoes on, and that usually was enough to get him to go).

Interestingly, I respect these evidence checks even if they’re not perfect. I find it a lot more agreeable than using something like (not to pick on this) task ratchet, where – last time I checked – you just list your tasks, self report whether or not you did them, and pay if not. Forfeit’s (imperfect) checks at least seems like it’s doing something to earn the payment.

1 Like

Consider that my original intention was to “donate” $10 to Forfeit because I wanted to say thank you for the value I am getting. I don’t want to say that I had zero intention to test the service, but mostly I thought it was a foregone conclusion that they would simply collect their money. Though, I admit that three fingers versus five fingers would have been less ambiguous than right hand versus left hand.

Checks and balances are a necessary part of our society. If a software developer submits a pull request at work, they typically don’t write bad code to challenge the reviewer. Ideally, they write great code and accepting the pull request is a formality. However, if they start to get away with subpar code, they might be tempted to take shortcuts or get resentful about the reviewer. After all, if the reviewer doesn’t catch the obvious sloppiness, how can they be trusted to detect serious blunders (and what are they getting paid for?).

That’s how I feel about Forfeit at this point. I don’t trust them to review the evidence properly anymore, and it tempts me to take shortcuts. Like getting up ten minutes before my meditation session is supposed to end because I know they will call it good. I would have appreciated it if the founder had just said “got it, we’ll judge you more strictly from now on.”

For meditation, I really like the idea of somebody making sure that I am seated on the cushion for the whole hour. I can still do that by myself but I might just be a bit tempted to do ten minutes of walking meditation in between “to loosen up.” Of course, I can write that into my Beeminder fine print and enforce it that way, but having a bit of added accountability via the time-lapse just feels nice.

1 Like

So, for a solid 9 days I had this goal, and meditated almost everyday without cheating (which felt great and was really helpful for my overall mental clarity and cognition!). I think I derailed 1-2 times.

Yesterday I did submit my first cheat for the new goal with the more specific description. I was using my laptop for 15min/20min of the meditation. I did actually meditate for the final 5 minutes, but for those first 15min, my hands were not visible at all as I was using my laptop and viewing 40inch monitor it was connected to.

So, the result of my test was: the additional specificity of “full view of my hands” did not result in a more strict evaluation.

Again, this is concerning, and emphasizes the importance of more entrants into this market who can actually hold us accountable, and have integrity of proofs as a focus.

1 Like

You underestimate people’s willingness to self-sabotage and shortcut their goals. Some people are more devious than others when it comes to akrasia.

There are a lot of people who set big goals for themselves, with full confidence in the moment they can do it, and then life gets in the way. Some people are faced with these options when their goal deadline creeps up:

  1. Do the thing, and don’t get charged
  2. Don’t do the thing, and get charged
  3. Cheat, and don’t get charged

Option 1 is not always easy, if the goal was larger, let’s say it takes 1 hour and they procrastinated it to 20 minutes before the deadline (sidenote: this probably shows that small chunking goals are better for commitment devices). In that scenario, IF the pledge amount was low enough, then a lot of us would just go with Option 2. However, if someone set their pledge level too high (or high enough - more on this later), then a lot of people might not even consider Option 2. They then are faced with Option 1 and Option 3 which as you can see have the same charge result of “don’t get charged”. Yes, the less creative or most honorable of us may NEVER consider Option 3, but if you had $1000 on the line for instance, and Option 1 was difficult for you (you made it a commitment device for a reason, anyhow!) then it’s natural that some people’s brains may gravitate to Option 3, and give cheating a shot.

This is cheating. Cheating is bad ! You bad boy ! “But OMG won’t that make you feel guilty!?!?!”

Remember, this is humanity we are talking about. People cheat on their married spouses where they had a ceremony and made a commitment contract with the government and their God. So, yes they will definitely cheat on commitment devices like beeminder and forfeit haha. And, it could be argued that the people that cheat ARE THE ONES WHO NEED COMMITMENT DEVICES THE MOST !

I am one of those people. AMA. Akrasia is a bitch. As humans, our brain is always looking for efficiency (ie strategic laziness). If I can cheat the goal, and convince myself that cheating “just this once” isn’t a big deal, and saving the charge money is motivating enough, then I might try to cheat (and in the moment, cheating may seem the most efficient option, even if that is an akrasia/procrastination fueled lie).

Now, this is why the integrity of these “weaselproofing” services is so important for the people like me and many others, who actually have struggled our whole life with procrastination, ADHD, akrasia, so much so that it is destroying our life’s potential. So, with something like BAAS or Forfeit, yes just having a platform with imperfect checks IS useful in it’s own right and might help us a bit, BUT it’s not enough for those of us who have severe procrastination or ADHD or whatever is causing the severe Akrasia. You could think of it as a spectrum.

No Akrasia <----------------------------------------------------------> Max Akrasia

People closer to the left side of the spectrum might not care how well the integrity of the proof-checking system is. People closer to the right NEED things like no-excuses mode, full weaselproofing, people double-checking their proofs and checking in with them, Forfeit to actually do what it promises, etc

I’d be very interested to know what the distribution of that spectrum looks like for beeminder community and other sample selections.

Now, on a final note, I think an interesting factor in all of this is how much money you set pledge amount at. I think another spectrum could be useful here:

Zero Motivation/Stress $0.01 <--------------> Max Motivation/Stress $10,000

Now, I put 1 penny and $10,000 as the limits for this spectrum. However, I think a more useful metric for the right side of the spectrum might be % of net worth. So, something like this:

0.0000001% of Net Worth <-------------------------------> 100% of Net Worth

The reason being, that money is relative to someone’s financial situation. A billionaire will not give a fuck about $100k. And someone with a Net Worth of $100k would care a lot about their $100k. Law of Scarcity vs Law of Abundance.

Now, this makes me realize that income and living expenses is a factor as well… for instance if the person with $100k net worth, also just got a job last month that pays $30k/month (good for them!) then they might care slightly less about the $100k than the person who has $100k net worth but their job only pays $4k/month. For the sake of argument, I will ignore this distinction for now, and just use net worth to see where this leads us.

I may follow up on this later… but for now, all I want to say is that I’m not sure what the optimal pledge amount is for me or for anyone. I find personally that a $5 derail doesn’t stress me very much. My perfectionist nature makes me prefer to not lose the $5, but my brain doesn’t view it as materially affecting my bank account/cashflow.

For reference, currently my bank account has $1,000 in it, so it’s not like I’m super cash rich haha. However, I do own a house with a large amount of equity, and have a somewhat stable cashflow/profit each month from my business ~$1000-$2000/month.

I do find $20-$30 to be a bitter pill to swallow. When I derail on those, it stresses me out a decent amount. Probably about 10x the stress of a $5 derail. This is interesting because even though it’s only 4-6x the money, it causes 10x the stress, so it’s a multiplicative relationship (ie: a leverage opportunity).

However, the $30 is more tempting to cheat than the $5. And when I had pledge levels of $90 and $270 when I first started beeminder, I never derailed because I would either choose Option 1 or Option 3.

To me, the solutions that come to mind are:

  1. Raise the pledge level until it’s causing you enough stress to get you to choose Option 1 over Option 2. Even if you are at least trying to pursue Option 1 and failing, rather than just giving up and choosing Option 2 right away, that’s a big win for most people, because getting started with what you are resisting is often the hardest part (and starts building the pathways in the brain for habit formation).

  2. Eliminate The Possibility of Cheating. One great way to deal with Option 3 is just eliminate it. This is what a perfect “Forfeit” app would do. Done correctly, your brain will still think about how to cheat, and then realize it’s not feasible, and then your brain will return it’s attention towards creative ways to actually get Option 1 completed! That’s the key here, it’s all about attention. With cheating as a feasible possibility, certain people’s attention will move there, and become wasted attention. We want our attention on Option 1. There is a parallel in society, for there has certainly been tons of human ingenuity attention spent (wasted!) on things such as bank/museum robberies, building extensive drug dealing operations, big pharma, cigarette companies etc. I think of these endeavors as “Cheating” our societal system because they are profiting off of bad things. This human ingenuity attention could have been spent on goals that benefit all of humanity !

Would be interesting to know the plot points for beeminder users of where they think they are on this graph, where the x axis is how akratic they feel they are, and the y axis is where they think the sweet spot pledge level is for them:

I feel this way as well, that I take the app less seriously now…

When it really should be a serious thing due to how proper application of it can result in massive success for our goals !

Forfeit is harder to cheat yourself on than beeminder, but does not bill itself as infallible. They specifically claim

If your Forfeit remotely matches your description, we’ll approve it.

The system is operating as intended.

3 Likes

I am not sure if you have read the whole thread, and why you have only partially quoted the paragraph.

This is the whole section from the FAQ:

If your Forfeit remotely matches your description, we’ll approve it. Ie, if your description is “myself at the gym”, and you send a photo of a set of weights, we’ll approve it. If you find yourself cheating yourself, and sending a photo of your weight set at home, specify the name of the gym (ie, “Gold’s Gym”), or a specific object that is only found at the gym. It’s easy to get the hang of!

They specifically say one can fine tune the description “if you find yourself cheating,” which napkin did:

Well for this week I changed my goal description to “myself meditating with full view of my hands”.

But then they submitted evidence that did not match that description and it was still accepted.

I did actually meditate for the final 5 minutes, but for those first 15min, my hands were not visible at all as I was using my laptop and viewing 40inch monitor it was connected to.

Probably someone is going to spin this again in some way like “oh they were saying you can specify that is has to be Gold’s Gym, but it doesn’t actually mean that Gold’s Gym has to be recognizable on the photo for it to be verified it’s just the ritual of being more specific that increases your chance of doing it.”

The co-founder himself has told me that stricter wording should lead to stricter validation, and yet it hasn’t made a difference.

By the way, a little further down in the same FAQ you linked:

If your photo/timelapse doesn’t match the description of your forfeit, we’ll always reach out to you for clarification on what the image was before charging. Less than 5% of failed forfeits are due to incorrect evidence.

Based on this evidence, I just don’t think that the statement The system is operating as intended is correct. Maybe it is if we assume some intention that is not publicly documented, but at least the statement The system is operating as documented/explained is obviously false.

My point from a while ago still stands: If you want a Beeminder with a nicer app and photos as data points, then Forfeit is a good option. If you like the idea of having a human (or AI) verify your data points (beyond “the ritualized action of uploading a photo”), then Forfeit sadly fails to deliver on its promises. Which, to repeat it, is a bummer because it was really working for me.

2 Likes

Hi there,

I wanted to share my thoughts as someone who’s been a longtime user of Forfeit and recently started working here.

It seems like this thread might be focusing on issues that aren’t as big as they’re made out to be. Also, it might give new users the wrong idea that Forfeit isn’t reliable.

I’ve submitted 2931 Forfeits in 411 days. Sure, some were Self Verify, so you can trim that number down a bit. But the point is, I know the app well, I think I’m still in the top 10 users or something. I’ve seen it fail and I’ve seen it succeed.

From my experience using Forfeit and now reviewing Evidence and talking to users every day, I’d say the system works about 95% of the time. People generally love it, and it does what it’s supposed to do.

Yeah, some people try to cheat, but it’s a small group and we keep track of them over time. We also take notes on users and use them as guidelines to personalise our approach.

We often ask, “Do you want us to be stricter?” because everyone has different needs:

  • Some need close monitoring to avoid cheating.
  • Some just like the extra layer of accountability.
  • Some are in between and need different levels of strictness depending on their situation.

Right now, we spend a lot of time trying to give users a somewhat personalized experience. It’s not scalable yet, but it gives us a lot of insight. The product is still young, so hang in there, things will get better. We’re aware of the small issues you mentioned, as well as the bigger ones.

I say small issues because, unlike in video games where cheating is a minority of players but they disrupt everyone’s experience, here cheaters are a minority but only disrupt themselves. Same for bank robberies or well the whole argument basically.

The product isn’t perfect, but it works most of the time, and we’re actively working on improvements. Our team is tiny, just one reviewer, Josh, Eddie, and now me. We’ll be bringing back AI reviewing soon, which will make things easier and less faillible. That’s our top priority right now.

We’re also working on stronger appeal rules, which will help both the main reviewer, the AI, and the users.

If you’ve lost trust in the system, you can always contact us through the chatbox and we can adjust the strictness for you. Soon, with AI, this won’t even be an issue.

We welcome competitors. If we’re not working for you, we deserve to be overtaken. But we’re shifting to a human assisted AI verification system, which should be almost foolproof, and will only become better (basically 100% foolproof) by the end of the year with new released models.

We’re adding in ‘verification instructions’, which will allow you to add very detailed parameters that the AI will analyse (like keeping hands in the screen, or ensuring no laptop is in the image). This should be in-app in the next month or two.

I hope this cleared up everything.

With love,
Brice

5 Likes