Streaming example #276

01mf02 · 2019-11-29T16:59:10Z

I am trying to create a program that reads files (potentially not fitting into memory) and folds over items found in the file. An item can span over several lines.
My problem is that I found no example for "combine" that I could adapt to my use-case.
(examples/async.rs seems to go into the right direction, but it is too complicated for me.)

So consider the simplified assignment:
Assume we want to implement wc -l, with the little twist that the program should fail when encountering a non-alphabetic and non-newline character.
Effectively, this program should recognise the grammar ([a-zA-z]*\n)* and then output the number of newlines. The program should be able to handle files that do not fit into RAM.

A naive attempt at handling this assignment is:

extern crate combine;
use combine::parser::char::newline;
use combine::parser::range::take_while;
use combine::{sep_by, Parser};

fn main() {
  let word = take_while(|c: char| c.is_ascii_alphabetic());
  let mut lines = sep_by(word, newline());

  let result : Result<(Vec<&str>, &str), _> = lines.parse("a\naa\naaa");
  println!("result: {:?}", result);

  let n_lines = result.unwrap().0.iter().fold(0, |acc, x| acc + 1);
  println!("number of lines: {}", n_lines);
}

However, the problem here is that:

We do not read from a file, but from a fixed string.
We need to have the whole string on which we are operating in memory.

I would love a solution where I could just specify a line parser, then run fold(0, |acc, word| acc + 1) on repeated parses of line on some file to obtain the number of lines.
The file should be read lazily, e.g. by using BufRead.

The text was updated successfully, but these errors were encountered:

01mf02 · 2019-11-29T17:00:15Z

A Rust program that does the task I described is here:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() {
  let file = File::open("test8GB").unwrap();
  let reader = BufReader::new(file);
  let n_lines = reader.lines().fold(0,
    |n, maybe_line| {
      if let Ok(line) = maybe_line {
        if !line.bytes().all(|b| b.is_ascii_alphabetic()) {
          panic!("Encountered non-alphabetic character on line {}", n);
        }
        n + 1
      }
      else {
        panic!("Could not read line");
      }
    });
  println!("{}", n_lines);
}

01mf02 · 2019-11-29T17:01:55Z

To generate test data, I wrote the following program:

fn main() {
  let mut str = String::from("");
  loop {
    str.push('a');
    println!("{}", str);
  }
}

You can run it with

cargo run | head -131072 > test8GB

to generate an 8GB test file consisting of:

a
aa
aaa
[...]

01mf02 · 2019-11-29T17:05:46Z

By the way, I know that fold is overkill for the application of counting lines (in particular, because |acc, x| acc + 1 does not use x). However, in my actual application, I want to perform an operation that depends on the outcome of all items previously parsed as well as the current item. So a fold is precisely what I need.

Marwes · 2019-11-30T14:04:14Z

The way I have done this is like this in redis-rs https://github.com/mitsuhiko/redis-rs/blob/75bfe24f7f34faad2460343699bd65bdcfaabaf0/src/parser.rs#L213-L282 . I always meant to port that code into a more generalized form in combine but I kept forgetting.

Should have something later today or tomorrow.

Marwes · 2019-12-02T09:47:21Z

It is on master now, releasing in 4.0 in a few days.

01mf02 · 2019-12-02T17:43:24Z

Hello @Marwes! Thank you for your work on generalizing the redis-rs code.
I am however still a bit puzzled as to how to implement this functionality in my own program concretely.
For example, is it a good idea to copy-paste the impl_decoder macro from tests/async.rs into my own code? (And thus maintain it, knowing that for example that the Decoder in Tokio 0.1 seems to have disappeared in Tokio 0.2?) My experience says that it's never a good idea to copy-paste. ;)

I just researched a bit to see how to adapt the examples in tests/async.rs to my needs, i.e. reading from a file instead from a string. I will share with you a small part of this experience: It seems I have to replace Cursor in run_decoder by something else to read from a file. To find this something else, I read the documentation for PartialAsyncRead in partial_io, which in turn brings me to the documentation for AsyncRead in Tokio. I'm feeling a bit "Lost in Translation" at this point.
Rereading the documentation for Cursor, I seem to understand now that I could simply replace the Cursor by a BufReader.

Long story short: For a complete Rust newcomer (but experienced functional programmer) like me, it is not obvious to see how to put things together, and what the best practices for a simple program folding over a list of items parsed from a file are. A small best practice example would tremendously help me and probably quite a few other people out there. I believe that my use-case appears sufficiently often that it could justify a small example in examples/.
(I also did simply not see tests/async.rs before, because I only looked into examples/.)

I hope that my feedback is useful to you to improve the experience of users of your library. :)

Marwes · 2019-12-02T18:51:15Z

Sorry, I should have linked explicitly to what I ported over. With the added decode_buf_read! macro
you just construct the added Decoder struct with a std::io::BufRead and call the macro with it and a parser and it will work. No tokio needed (but there is also a variant that reads from a tokio::io::AsyncBufRead https://github.com/Marwes/combine/pull/277/files#diff-84a79536e77b4270b1c96c11abe5c0afR1480-R1563

01mf02 · 2019-12-04T14:09:59Z

Thanks, that makes more sense!

Marwes · 2019-12-25T15:22:12Z

Figured out a way to relax the BufRead requirement and making the decoding more efficient by handling all buffering internally. The decode* macros therefore loses the _buf_read suffix and you no longer need a BufReader wrapper when 4.0 is released.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming example #276

Streaming example #276

01mf02 commented Nov 29, 2019 •

edited

Loading

01mf02 commented Nov 29, 2019

01mf02 commented Nov 29, 2019

01mf02 commented Nov 29, 2019 •

edited

Loading

Marwes commented Nov 30, 2019

Marwes commented Dec 2, 2019

01mf02 commented Dec 2, 2019

Marwes commented Dec 2, 2019

01mf02 commented Dec 4, 2019

Marwes commented Dec 25, 2019

Streaming example #276

Streaming example #276

Comments

01mf02 commented Nov 29, 2019 • edited Loading

01mf02 commented Nov 29, 2019

01mf02 commented Nov 29, 2019

01mf02 commented Nov 29, 2019 • edited Loading

Marwes commented Nov 30, 2019

Marwes commented Dec 2, 2019

01mf02 commented Dec 2, 2019

Marwes commented Dec 2, 2019

01mf02 commented Dec 4, 2019

Marwes commented Dec 25, 2019

01mf02 commented Nov 29, 2019 •

edited

Loading

01mf02 commented Nov 29, 2019 •

edited

Loading