AGI delayed? My recent experience with Forfeit.

As I have mentioned in another thread (Amazing Marvin Has Pledges Now), I have enjoyed using Forfeit (https://www.forfeit.app/) lately.

It’s so effective that I haven’t derailed, eh, forfeited, yet. However, as we know forfeiting isn’t failing and I created the ambitious Forfeit (that’s how they call goals) to take a camera only photo of My left hand holding three fingers up.

Here is the photo I have submitted as evidence:

my_left_hand_grey

Well, my intention was clearly to nail it by failing it, and also I just wanted to pay Forfeit some money because they currently don’t charge a subscription fee.

And you can already see where this is going; to my surprise the evidence was accepted:

image

Edit: Josh, co-founder of Forfeit, was super responsive and told me that all photos are currently validated by humans. He said strong language like it must be the right hand would have been required to elicit a request for clarification from the verifier.

I then put the same image into GPT and asked it to judge whether the image fulfills the criteria:

image

(I think I am using the unpaid version of GPT on my phone, so maybe this would yield correct results with GPT4o. In before everyone is claiming that it is judging correctly for them.)

Edit: GPT4o can in fact correctly tell that this is not a left hand per comment below by dreev.

To be honest, this has decreased my enthusiasm for Forfeit a little bit. I recognize that this is not a big deal but it would have felt much better if the evidence had been denied.

Also, I was a bit surprised because I thought vision is pretty much solved and you would think that there is enough training data labelled with pen held in a right hand and so on that this would have a high chance of being judged correctly.

1 Like

GPT4o: “False. This is a picture of a right hand holding three fingers up.”

But I think Forfeit was correct to deem that acceptable evidence. You’re submitting a photo and averring you did the thing. That’s the heart of it. It’s like signing a document. It doesn’t matter that physical signatures aren’t technically secure or that you reproduced your usual signature. What matters is you taking a specific ritualized action that amounts to “I promise this is legit” or “I officially agree to this”.

2 Likes

I should have created a fatebook prediction that I think that GPT4o is going to label it correctly.

Yes, that’s pretty much the reasoning Josh has just given me. Also, he has told me that currently all photos are validated manually by humans. Stronger wording like it must be the right hand would have been required.

(I will also add an edit to the original post.)

1 Like

I have thought about this some more and have come to the conclusion that I disagree.

If it’s just about the ritualized action, I might as well use Beeminder, where “I promise this is legit” by submitting a data point.

The value proposition of Forfeit (as I understand it) is that I am held accountable for a specific action, the one that I specify, and not some generic action like “uploading a photo” or “providing a signature.”

If my goal is to make three PowerPoint slides about Bayern Munich, and I happen to find one about Barcelona on the internet and upload that as evidence, then should it be accepted because I “followed the ritual of uploading a presentation about soccer clubs with the initials FCB”? Obviously not. Same thing with my hand. Maybe I have had an issue with my left hand, and my goal was to do physical therapy super diligently to be able to use my three fingers again. Now, the reason Forfeit motivates me is because someone (be it a human or an AI) verifies that I have achieved that specific goal. Since my right hand was working fine all along, it doesn’t help me that it has been accepted. In doubt, the verifier (be it a human or an AI) should have asked for clarification.

If everything in the previous paragraph is wrong, then Forfeit is just Beeminder with a nicer app and the ability to add data points as photos or time lapses. But I think, and I hope Josh would agree, that it is more than that. It is the human (or AI) verification that is also part of the value proposition. (Again, all of that is my opinion and I might have fully misunderstood the purpose of Forfeit.)

1 Like

Not gonna lie, I personally cant tell if that’s the right or left hand :rofl:

since, if you were taking a picture of your left hand in the mirror it would also look like that…

I think you should test this again with a more clear and obvious distinction like putting five fingers up instead of three and see what happens on forfeit.

4 Likes

I agree that’s what’s valuable about these systems (BM and Forfeit). To some extent they overlap because BM will auto check/derail things with some of its integrations (like Rescuetime) and Forfeit also lets you submit basically “I promise I did it” self-verification.

Forfeit was using the Chat GPT vision API to verify photo evidence and compare it to the description (then having humans review ones it failed/couldn’t tell) and this seemed to work pretty well. It was on the app for like a week. Not sure what happened to it but it’s my understanding they intend to re-enable/implement that at some point.

If you’re currently meeting all your forfeit’s and it’s working well for you, I’d personally prob avoid trying to test things like this in ways that might decrease your confidence in them/make it work less well for you. If it isn’t broke (and you’re not failing) don’t fix it.

But I did have a similar (but unintentional) experience, in that once I accidentally submitted the wrong evidence for a forfeit, and it was approved. I also found it troubling, because I agree that ideally it’d be catching things like that. I also messaged Josh and he said he’d reprimand the guy reviewing the evidence.

Note they also have caught stuff like this for me too – once I submitted the wrong evidence and they asked about it and had me resubmit. Another time it was the correct evidence, but the guy was unclear about a part of it, and so asked for clarification.

Also this is a good point.

1 Like

I disagree.

Forfeit’s whole promise is confirmed accountability. Whether human verified or AI assisted + human verified, the end result that they are promising is a reliable accountability system that you can’t weasel around.

It makes a huge amount of sense to stress test your systems when you’re doing well, to plan for the future version of yourself that might try to weasel.

This aligns with the whole akrasia principles beeminder was founded on !

in my opinion, the illusion of accountability can only last for so long, if it’s only an illusion. I don’t want a temporary system that’s working for me. I want something that is future proofed and can support me for the next 5 to 10 years. and it makes sense to stress test the system periodically to see if it’s robust

in my opinion, we should strive to create anti-fragile commitment device systems that actually gets stronger whenever there are derails…

To me, my system is anti-fragile because whenever I derail it forces me to come up with a few ideas about how to prevent derailing again and I’m directly incentivized to implement those. A big part of this for me is using Boss as a service, since it gives me someone to bounce ideas off of when I derail, and prevents me from cheating.

Proper use of Forfeit ideally should create a similar system as boss as a service + beeminder. if they truly aren’t verifying proofs as dutifully as possible, then that’s a fundamental issue with their platform and hopefully someone else would enter the market to fix that feature

at the same time, I understand where you’re coming from in that it can be scary, the prospect of losing confidence in a system that’s working so well for us. I myself admit I’m scared to even test forfeit right now unless I knew of a qualified competitor that I could jump to if the tests failed… maybe that shows that we need more quality entrants into this exact market that forfeit is serving

if anyone knows of a forfeit clone, that is more reliable, let us know

That’s funny. You definitely have a point there!

I guess a similar argument could be made for my soccer club example. For someone not familiar with soccer clubs, it might be hard to judge whether my slides are about Barcelona or Munich. I would still argue that a presentation about, say, dolphins should not be accepted (and apologies to Dreev if I am misrepresenting his argument here, but I think that’s effectively what he was saying, that a PowerPoint about dolphins would also be okay because of the ritualistic nature or something).

There will always be ambiguity, I guess, but I expect the evidence to be verified to a reasonable degree. Otherwise, I just don’t see the poing.

That’s a good suggestion. I might try that at some point, although, my forfeit account probably has an exclamation mark attached to it probably, so I would be really disappointed if they missed that at this point.

Probably it was too strict, if I had to guess?

I got another message from Josh and he explained that given 100 evidence submissions that do not perfectly match the description, 80 should be accepted (according to the users), and 20 really be denied. They don’t want to piss the 80 off, so they lean more towards being lenient.

Maybe an automated legitimacy check like what Beeminder does could solve that problem for them.

Thanks for sharing. I just find that a little discouraging. Beeminder is working great for me for most things but for tasks where I know myself to be a little weasel Forfeit would be a great complement given that I can trust them to judge the evidence reasonably well.

Yes, for me I kind of have crossed that threshold now. I am going to do the three finger/five finger test, and then I have another idea for a timelapse-based task that I can flunk, and then I will reassess from there.

Update, I cheated twice on my Meditation Forfeit.

The forfeit requires me to submit a 20 minute timelapse of “Myself Meditating” using forfeit’s built-in timelapse recorder (that basically doesn’t allow you to navigate away from the app while filming?).

The first cheat, was fairly vague, in that it had me sitting in bed on my laptop, reading and typing with the laptop out of view. This was approved in Forfeit.

The second cheat, was less vague. I was sitting in bed on my laptop again, reading and typing, with clear view of the laptop keyboard in the shot. This was also approved by Forfeit.

This is disappointing for me, as I at least expected them to message for clarification on the first cheat, and charge me for the second. I think the issue is that their system should err more to the side of asking for clarification…

I haven’t talked to the founders yet, but I think I may need to adjust my task to say "myself meditating with full view of my hands not doing anything.

3 Likes

I wonder if the system worked in part because of a placebo effect - the premise of we are checking and we mean it; and the corresponding oh I had better since they will know otherwise - and now that has worn off somewhat because at first glance it appears not to be what you expected smoke, mirror, and curtain.

What if the system were somehow calling out each case correctly without any false negatives or false positives. How might that change how you use it? How might that change how you meditate?

3 Likes

Well for this week I changed my goal description to “myself meditating with full view of my hands”.

this change has motivated me a lot more to not grab my laptop while meditating. however, I’m sure I will slip up and cheat sometime soon, so it will be interesting to see if they mark it as a failure just because of this more detailed description.

1 Like

Let us know how that works.

I’ve previously used “I must stay on my cushion and not interact with anything” which is pretty non-ambiguous to me but I don’t have it in my heart anymore to actually stress test it after my recent experiences and the feedback in this thread.

I really seem to fall to the typical mind fallacy here. Why would people who go so far as to use an app that has the only purpose to keep them accountable not want to be strictly challenged on their evidence?

1 Like

I think most people would want to be challenged on the evidence. It goes both ways though, I’m curious why people would submit half assed evidence? Personally, I’ve never submitted evidence that I knew would fail my evidence standards in hopes that Forfeit wouldn’t notice.

I agree if that was common to do, the issue you’re talking about would be important, but I’m not sure it is. This thread in particular it sounds like most of the incorrect evidence submitting has been to test them.

BTW I personally submit evidence (a timelapse of me doing 12 pullups) that is technically impossible for Forfeit to verify. They can see how much time is in the video + a few stills of me hanging from a bar. But they’d have no idea if I did 12, 11 or was just hanging there for a while. Still it’s enough to get me to do it, and I’ve never cheated. I think it’s a combination of not wanting to open up that can of worms + I actually do have to at least hang on the bar for proof, and that’s enough to get me started (like the old Scott Adams anecdote of how he didn’t make himself go to the gym, but he at least had to put his gym shoes on, and that usually was enough to get him to go).

Interestingly, I respect these evidence checks even if they’re not perfect. I find it a lot more agreeable than using something like (not to pick on this) task ratchet, where – last time I checked – you just list your tasks, self report whether or not you did them, and pay if not. Forfeit’s (imperfect) checks at least seems like it’s doing something to earn the payment.

1 Like

Consider that my original intention was to “donate” $10 to Forfeit because I wanted to say thank you for the value I am getting. I don’t want to say that I had zero intention to test the service, but mostly I thought it was a foregone conclusion that they would simply collect their money. Though, I admit that three fingers versus five fingers would have been less ambiguous than right hand versus left hand.

Checks and balances are a necessary part of our society. If a software developer submits a pull request at work, they typically don’t write bad code to challenge the reviewer. Ideally, they write great code and accepting the pull request is a formality. However, if they start to get away with subpar code, they might be tempted to take shortcuts or get resentful about the reviewer. After all, if the reviewer doesn’t catch the obvious sloppiness, how can they be trusted to detect serious blunders (and what are they getting paid for?).

That’s how I feel about Forfeit at this point. I don’t trust them to review the evidence properly anymore, and it tempts me to take shortcuts. Like getting up ten minutes before my meditation session is supposed to end because I know they will call it good. I would have appreciated it if the founder had just said “got it, we’ll judge you more strictly from now on.”

For meditation, I really like the idea of somebody making sure that I am seated on the cushion for the whole hour. I can still do that by myself but I might just be a bit tempted to do ten minutes of walking meditation in between “to loosen up.” Of course, I can write that into my Beeminder fine print and enforce it that way, but having a bit of added accountability via the time-lapse just feels nice.