Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming example #276

Open
01mf02 opened this issue Nov 29, 2019 · 9 comments
Open

Streaming example #276

01mf02 opened this issue Nov 29, 2019 · 9 comments

Comments

@01mf02
Copy link

01mf02 commented Nov 29, 2019

I am trying to create a program that reads files (potentially not fitting into memory) and folds over items found in the file. An item can span over several lines.
My problem is that I found no example for "combine" that I could adapt to my use-case.
(examples/async.rs seems to go into the right direction, but it is too complicated for me.)

So consider the simplified assignment:
Assume we want to implement wc -l, with the little twist that the program should fail when encountering a non-alphabetic and non-newline character.
Effectively, this program should recognise the grammar ([a-zA-z]*\n)* and then output the number of newlines. The program should be able to handle files that do not fit into RAM.

A naive attempt at handling this assignment is:

extern crate combine;
use combine::parser::char::newline;
use combine::parser::range::take_while;
use combine::{sep_by, Parser};

fn main() {
  let word = take_while(|c: char| c.is_ascii_alphabetic());
  let mut lines = sep_by(word, newline());

  let result : Result<(Vec<&str>, &str), _> = lines.parse("a\naa\naaa");
  println!("result: {:?}", result);

  let n_lines = result.unwrap().0.iter().fold(0, |acc, x| acc + 1);
  println!("number of lines: {}", n_lines);
}

However, the problem here is that:

  • We do not read from a file, but from a fixed string.
  • We need to have the whole string on which we are operating in memory.

I would love a solution where I could just specify a line parser, then run fold(0, |acc, word| acc + 1) on repeated parses of line on some file to obtain the number of lines.
The file should be read lazily, e.g. by using BufRead.

@01mf02
Copy link
Author

01mf02 commented Nov 29, 2019

A Rust program that does the task I described is here:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() {
  let file = File::open("test8GB").unwrap();
  let reader = BufReader::new(file);
  let n_lines = reader.lines().fold(0,
    |n, maybe_line| {
      if let Ok(line) = maybe_line {
        if !line.bytes().all(|b| b.is_ascii_alphabetic()) {
          panic!("Encountered non-alphabetic character on line {}", n);
        }
        n + 1
      }
      else {
        panic!("Could not read line");
      }
    });
  println!("{}", n_lines);
}

@01mf02
Copy link
Author

01mf02 commented Nov 29, 2019

To generate test data, I wrote the following program:

fn main() {
  let mut str = String::from("");
  loop {
    str.push('a');
    println!("{}", str);
  }
}

You can run it with

cargo run | head -131072 > test8GB

to generate an 8GB test file consisting of:

a
aa
aaa
[...]

@01mf02
Copy link
Author

01mf02 commented Nov 29, 2019

By the way, I know that fold is overkill for the application of counting lines (in particular, because |acc, x| acc + 1 does not use x). However, in my actual application, I want to perform an operation that depends on the outcome of all items previously parsed as well as the current item. So a fold is precisely what I need.

@Marwes
Copy link
Owner

Marwes commented Nov 30, 2019

The way I have done this is like this in redis-rs https://github.com/mitsuhiko/redis-rs/blob/75bfe24f7f34faad2460343699bd65bdcfaabaf0/src/parser.rs#L213-L282 . I always meant to port that code into a more generalized form in combine but I kept forgetting.

Should have something later today or tomorrow.

@Marwes
Copy link
Owner

Marwes commented Dec 2, 2019

It is on master now, releasing in 4.0 in a few days.

@01mf02
Copy link
Author

01mf02 commented Dec 2, 2019

Hello @Marwes! Thank you for your work on generalizing the redis-rs code.
I am however still a bit puzzled as to how to implement this functionality in my own program concretely.
For example, is it a good idea to copy-paste the impl_decoder macro from tests/async.rs into my own code? (And thus maintain it, knowing that for example that the Decoder in Tokio 0.1 seems to have disappeared in Tokio 0.2?) My experience says that it's never a good idea to copy-paste. ;)

I just researched a bit to see how to adapt the examples in tests/async.rs to my needs, i.e. reading from a file instead from a string. I will share with you a small part of this experience: It seems I have to replace Cursor in run_decoder by something else to read from a file. To find this something else, I read the documentation for PartialAsyncRead in partial_io, which in turn brings me to the documentation for AsyncRead in Tokio. I'm feeling a bit "Lost in Translation" at this point.
Rereading the documentation for Cursor, I seem to understand now that I could simply replace the Cursor by a BufReader.

Long story short: For a complete Rust newcomer (but experienced functional programmer) like me, it is not obvious to see how to put things together, and what the best practices for a simple program folding over a list of items parsed from a file are. A small best practice example would tremendously help me and probably quite a few other people out there. I believe that my use-case appears sufficiently often that it could justify a small example in examples/.
(I also did simply not see tests/async.rs before, because I only looked into examples/.)

I hope that my feedback is useful to you to improve the experience of users of your library. :)

@Marwes
Copy link
Owner

Marwes commented Dec 2, 2019

Sorry, I should have linked explicitly to what I ported over. With the added decode_buf_read! macro
you just construct the added Decoder struct with a std::io::BufRead and call the macro with it and a parser and it will work. No tokio needed (but there is also a variant that reads from a tokio::io::AsyncBufRead https://github.com/Marwes/combine/pull/277/files#diff-84a79536e77b4270b1c96c11abe5c0afR1480-R1563

@01mf02
Copy link
Author

01mf02 commented Dec 4, 2019

Thanks, that makes more sense!

@Marwes
Copy link
Owner

Marwes commented Dec 25, 2019

Figured out a way to relax the BufRead requirement and making the decoding more efficient by handling all buffering internally. The decode* macros therefore loses the _buf_read suffix and you no longer need a BufReader wrapper when 4.0 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants