Skip to content
This repository was archived by the owner on Apr 2, 2026. It is now read-only.

recursive-leak: fix leak by using middle-man#494

Closed
Zij-IT wants to merge 11 commits into
zesterer:mainfrom
Zij-IT:fixes/recursive-leak
Closed

recursive-leak: fix leak by using middle-man#494
Zij-IT wants to merge 11 commits into
zesterer:mainfrom
Zij-IT:fixes/recursive-leak

Conversation

@Zij-IT
Copy link
Copy Markdown
Contributor

@Zij-IT Zij-IT commented Jul 28, 2023

This introduces a middle man to prevent both memory leaks (#486) in recursive parsers, and prevents the user from foot-gunning themselves by calling the wrong clone (referring to craftspider's weak_clone) suggestion.

Copy link
Copy Markdown
Owner

@zesterer zesterer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat trick!

Comment thread src/recursive.rs Outdated
Comment on lines +159 to +162
pub fn define<P: Parser<'a, I, O, E> + Clone + MaybeSync + 'a + 'b>(
parser: P,
declaration: Recursive<Indirect<'a, 'b, I, O, E>, Declared>,
) -> Self {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be on Recursive<..., Declared> and return a Recursive<..., Defined>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could do that too. For type inference it doesn’t matter, but it could make more sense logically.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say so, yes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed with 4572bb5

@CraftSpider
Copy link
Copy Markdown
Collaborator

While this method seems nice, it also makes more complicated recursive patterns harder - you can't have a single context struct on which you both use and define the recursive parser, since the two things are separate types. While you may be able to re-architect code in most cases to avoid this, I'm curious if that's an intended limitation?

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Aug 7, 2023

While this method seems nice, it also makes more complicated recursive patterns harder - you can't have a single context struct on which you both use and define the recursive parser, since the two things are separate types. While you may be able to re-architect code in most cases to avoid this, I'm curious if that's an intended limitation?

No, it's not intended. We could wrap them both in an enum (somewhat like Cow), and have an arm for both Defined and Declared, which would mean that the user would only have to deal with Declared and Defined. The one thing I don't like about the enum is that it opens the door for the user to attempt to use define on a parser that is already been defined, which isn't possible with the way it's currently structured.

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Aug 7, 2023

Given that those checks fail for reasons outside of this PR, how would you like me to proceed?

@zesterer
Copy link
Copy Markdown
Owner

zesterer commented Aug 7, 2023

You should be able to rebase to fix the failing CI now :)

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Aug 17, 2023

Finally got around to it. My bad for the mess that is the commit history in this pull request 😅 One would think that after using it for four+ years that I would be able to keep it somewhat clean

@zesterer
Copy link
Copy Markdown
Owner

I totally forgot about this PR, I'll take a final look at it tomorrow.

@zesterer
Copy link
Copy Markdown
Owner

Was the issue mentioned by CraftSpider above resolved?

@zesterer zesterer mentioned this pull request Oct 10, 2023
35 tasks
@faassen
Copy link
Copy Markdown

faassen commented Oct 10, 2023

Is there any way I could help on this issue?

@zesterer
Copy link
Copy Markdown
Owner

zesterer commented Oct 10, 2023

I think it probably just needs a rebase and then it's ready to go. That said, I am a bit concerned about the issue @CraftSpider mentioned above: it will probably break the way some people are using Recursive. That said, arguably such uses are just wrong and should be discouraged? Interested in thoughts on this.

@CraftSpider
Copy link
Copy Markdown
Collaborator

I think that you can probably change it to always store one or the other in the state, but I haven't really gotten around to trying it out with this change. I'm not too worried personally.

@zesterer
Copy link
Copy Markdown
Owner

Right, in that case it's probably fine. I would appreciate the commit history being cleaned up a bit here before merging too (otherwise I think I'll squash).

@faassen
Copy link
Copy Markdown

faassen commented Oct 10, 2023

I need to look at my use of recursive tomorrow to see whether whether I can understand the limitation and whether it hurts my usage, I don't get it yet without staring at the code.

@zesterer
Copy link
Copy Markdown
Owner

I believe it's specifically for the case of Recursive::declare/Recursive::define. Doing my_recursive.define(...) previously had the signature fn(self: &mut Recursive, impl Parser) (i.e: it was an in-place mutation) but now has the signature fn(self: Recursive<A>, impl Parser) -> Recursive<B> (i.e: it consumes self and transforms it to a new type, thereby turning it into a strong reference, as is the case internally with the recursive function).

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Jan 28, 2024

So, my apologies for dropping off of the face of this earth for so long. If I get this rebased in the next coming days, would it be good to be merged, or is there anything else that it is needing?

@zesterer
Copy link
Copy Markdown
Owner

zesterer commented Jan 29, 2024

No worries, there's no time constraint here :) I think this change is probably the way to go, on reflection. It's unfortunate that it breaks certain uses of Recursive::declare/define but those modes of use are inherently leaky in a way that seems like it would require generalised garbage collection to disentangle. So, yes, I'd be happy to see this merged.

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Jan 30, 2024

So... I'll be honest. I tried to rebase this thing for like an hour :') Merging took all of two seconds, so obviously I suck at rebasing.

Edit: It seems to fail with an issue that is unrelated to this PR. Should I fix it up anyway and push the change?

@zesterer
Copy link
Copy Markdown
Owner

zesterer commented Jan 31, 2024

Maybe it would be better to just take the diff from main, then go back to main and apply that diff on top, creating a new commit entirely? Then you can force-push that to this branch.

Don't worry about unrelated CI stuff, I'll address that elsewhere :)

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Jan 31, 2024

Sounds good! I’ll take care of that after I get outta work!

@Flygrounder
Copy link
Copy Markdown

@zesterer Could we have a new alpha release once this gets merged?

@Flygrounder
Copy link
Copy Markdown

Maybe it would be better to just take the diff from main, then go back to main and apply that diff on top, creating a new commit entirely? Then you can force-push that to this branch.

Don't worry about unrelated CI stuff, I'll address that elsewhere :)

Wanted to ask, isn't it the same as using "Squash and merge" option when merging on GitHub? It seems like @Zij-IT got busy and it's a bit sad that this PR is essentially ready, but not merged. We want to use this library in production in our company, but this memory leak is a problem in our use case (long-running process that can parse thousands of inputs).

@zesterer
Copy link
Copy Markdown
Owner

I believe the behaviour is the same, yes: with the exception that the commit message will be a concatenation of every intermediate commit, which is undesirable. I would do it myself, but I don't want to be taking credit for @Zij-IT's work in the commit log.

@Zij-IT Zij-IT force-pushed the fixes/recursive-leak branch from b1a9004 to 32c97f9 Compare February 29, 2024 18:57
@Zij-IT Zij-IT force-pushed the fixes/recursive-leak branch from 32c97f9 to 30a48ca Compare February 29, 2024 19:13
@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Feb 29, 2024

My apologies to everyone that has been waiting for me to make the change. I just did what was recommended by @zesterer, so this should be good to roll.

The test that fails fails because what was a runtime error became a compiler error in the tests case. There isn't AFAIK a way to test that something fails to compile, so I can't write an equivalent test. As such, I just deleted the test :/

@wackbyte
Copy link
Copy Markdown
Contributor

wackbyte commented Feb 29, 2024

You can test something that fails to compile by writing a doctest:

/// ```compile_fail
/// // ...
/// ```
fn test_name() {}

A specific error code can be provided, separated by a comma: compile_fail,E9999.

There might be some more setup required, but this is a method I've used in the past.

This has the drawback of not being able to specify an expected error (other than its code, and not all errors have codes). A better solution would likely use trybuild.

(rustdoc documentation)

Comment thread src/recursive.rs
Comment on lines +157 to +160
pub fn define<P: Parser<'a, I, O, E> + Clone + MaybeSync + 'a + 'b>(
parser: P,
declaration: Self,
) -> Recursive<Indirect<'a, 'b, I, O, E>, Defined> {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for not having the declaration argument be self?

Copy link
Copy Markdown
Contributor Author

@Zij-IT Zij-IT Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly nothing in particular. I'll swap it over to having it be a method!

Edit: Actually, there is a small factor for why. If it's a method that consumes self, than you end up having to do something like:

let mut chain = Recursive::declare();

// Note the clone  vvvvvvv
let result = chain.clone().define(just::<_, _, extra::Err<Simple<char>>>('+')
    .then(chain.clone())
    .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
    .or_not()
    .map(|chain| chain.unwrap_or(Chain::End)));

// As opposed to:
let result = Recursive::define(just::<_, _, extra::Err<Simple<char>>>('+')
    .then(chain.clone())
    .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
    .or_not()
    .map(|chain| chain.unwrap_or(Chain::End)), chain);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zesterer Which one do you prefer? I'll do it whichever way you'd prefer. I personally like the symmetry between Recursive::declare and Recursive::define and the second way makes it more obvious that Recursive::define consumes the parser.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I prefer the first, personally. The extra parameter way off at the end of the definition feels very isolated and random if you're not already sure what it is. Also, putting it at the front makes clear what the target of the definition is.

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented May 4, 2024

I'm pretty sure GitHub decided the 7 commits before mine are different because the hashes are different... I try and do the right thing with a rebase I manage to create clutter :D

Anyhow, this should be the final commit for the PR to be merged, right?

@zesterer
Copy link
Copy Markdown
Owner

I'll try to find time to take another pass of this today. Hopefully, yep!

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Oct 25, 2025

So, I don't believe this solution will work in the way that I was hoping. It was intended to help situations like seen here.

However, due to the issue described jkylling in #664. I found a test case in the wild by Sail's SQL Parser. If you apply this patch to the lastest version of Chumsky, and run the tests for sail-sql-analyzer, it will fail due to a parser "not being defined before it's used". The definition of the failing parser is here.

I haven't tested if the solution in #664 can be extended to support $N$ recursive parsers as seen in the sail parser without. If so, it would be great as it prevents the foot-gun of holding onto a parser for too short a time, which this solution clearly does not.

Here is the patch for the Sail parser: patch


After toying around with ideas for a bit, I believe similarly to jkylling that it is probably worth making Recursive::declared private, and offering wrapped ways to work around it, so that the user doesn't footgun themselves. I have a similar take to the solution offered in #664, but this solution involves another trait.

Here is an example of how this function can be used:

recursive_n(|(expression, query, data_type, table_with_joins)| {
    let expression = Expr::parser(
        (expression.clone(), query.clone(), data_type.clone()),
        options,
    );

    let query = Query::parser(
        (query.clone(), expression.clone(), table_with_joins.clone()),
        options,
    );

    let data_type = DataType::parser(data_type.clone(), options);

    let table_with_joins = TableWithJoins::parser(
        (query.clone(), expression.clone(), table_with_joins.clone()),
        options,
    );

    (expression, query, data_type, table_with_joins)
})

This solution allows the user to define many queries, with the leftmost-parser of the result of recursive_n being used as the parser, and the rest being considered the dependencies. It would also be possible to define a intermediate struct like Recurse { parser: expression, dependencies: (query, data_type, table_with_joins) } so that it is explicit which parser is actually going to be run as the "primary".

I was able to using sail-sql-analyzer as a dependency in a crate and call parse_expression which constructs the above parser. Miri doesn't report a leak, and I can't think of how one would be done. Here are both patches:

Sail - fix.patch

Chumsky - recursive_fix.patch

@zesterer - What are your thoughts?

@zesterer
Copy link
Copy Markdown
Owner

After toying around with ideas for a bit, I believe similarly to jkylling that it is probably worth making Recursive::declared private, and offering wrapped ways to work around it, so that the user doesn't footgun themselves. I have a similar take to the solution offered in #664, but this solution involves another trait.

I agree with your assessment. Unfortunately I don't have the background in formal methods to prove it, but I'm beginning to suspect that solving this problem in the general case requires cycle-aware garbage collection, which is not something we have the power to apply here.

Here is an example of how this function can be used:

I do quite like this solution! My initial instinct was to use a macro, but perhaps this is entirely achievable with the type system and a generic function alone. I'd love to see what this ends up looking like.

For the sake of completeness, my macro idea looked roughly like:

parsers! {
    atom = ... ;

    // Parsers are internally declared as `Recursive`s and can reference on-another
    expr: Expr = atom.or(block);

    block = expr
        .separated_by(just(';'))
        .delimited_by(just('{'), just('}'));
}

I suspect that your generic function design is likely to be more robust though.

Is this something you were interested in pushing to the PR stage?

@Zij-IT
Copy link
Copy Markdown
Contributor Author

Zij-IT commented Oct 27, 2025

My initial instinct was to use a macro, but perhaps this is entirely achievable with the type system and a generic function alone. I'd love to see what this ends up looking like.

I had thought about a macro, but then Recursive::declare and Recursive::define would have to remain in the public API, and I don't know if that is something that is necessarily wanted. I suppose it would be possible to #[doc(hidden)] both of them. I abandoned the idea though because I had no interest in breaking my brain trying to write such a macro 😆

Is this something you were interested in pushing to the PR stage?

Yeah, I honestly don't know why I didn't just write a PR for it instead of uploading the patches 😅

Given that this solution doesn't solve the leak without introducing another problem, I am going to go ahead and close this in favor of PR #894.

@Zij-IT Zij-IT closed this Oct 27, 2025
@zesterer
Copy link
Copy Markdown
Owner

I had thought about a macro, but then Recursive::declare and Recursive::define would have to remain in the public API, and I don't know if that is something that is necessarily wanted. I suppose it would be possible to #[doc(hidden)] both of them.

A while back I polled some others in the Rust community, and the consensus seems to be that most people consider doc(hidden) to be sufficient to pull an item out of the semver & soundness guarantees of a crate, so that's the policy I'm adopting from now on (it makes my life easier!).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants