Friday, 7 September 2012

Iterating over a text file

It's a simple task -- you want iterate over all the lines in a file and perform an action on each one. In Python it's as simple as:
for line in open("somefile.txt"):
    print line
How about C++11. How hard can it be? This stackoverflow post gives us a starting point. We first define an iterator:
struct line_t : std::string {
     friend std::istream & operator>>(std::istream& is, line_t& line) {
         return std::getline(is, line);
     }
};
typedef std::istream_iterator<line_t> line_iterator;
Boost.Range packages a pair of begin/end iterators into a range which makes the code more elegant. It defines all those algorithms found in std namespace in terms of ranges. So we'll give ranges a try and hope it improves our style. Since std::pair of iterators is a valid Boost.Range range, let's define a line_range in terms of that and also a helper function to construct the range:
typedef std::pair<line_iterator, line_iterator> line_range;

line_range lines(std::ifstream& is) {
    return line_range(line_iterator(is), line_iterator());
}
Now we can use Boost.Range's for_each to iterate over a file:
std::ifstream file("somefile.txt");
boost::for_each(lines(file), [](std::string const& line) {
    std::cout << line << std::endl;
});
Not bad. Not counting the very last line, it's 3 lines of code. One more than Python. Let's see if we can get it down to two lines by constructing the ifstream object as a temporary:
boost::for_each(lines(std::ifstream("somefile.txt")), [](std::string const& line) {
    std::cout << line << std::endl;
});
g++ -std=c++0x iter.cpp
iter.cpp: In function ‘int main()’:
iter.cpp:97:54: error: invalid initialization of non-const reference of type ‘std::ifstream& {aka std::basic_ifstream<char>&}’ from an rvalue of type ‘std::ifstream {aka std::basic_ifstream<char>}’
iter.cpp:82:12: error: in passing argument 1 of ‘line_range lines(std::ifstream&)’
That doesn't work. Of course -- passing a temporary into a function that takes a non-const reference will not work. Unfortunately we can't change it to a const reference since istream_iterator's constructor takes a non-const reference. Fortunately we can deal with this situation by taking it as an rvalue reference:
line_range lines(std::ifstream&& is) {
    return line_range(line_iterator(is), line_iterator());
}
This will happily compile and work but we just broke the previous usage. That is because an rvalue reference won't bind to an lvalue. We can do two things to solve this problem. First is to provide two overloads:
line_range lines(std::ifstream& is) {
    return line_range(line_iterator(is), line_iterator());
}

line_range lines(std::ifstream&& is) {
    return lines(is);
}
The second overload ends up calling the first one because an rvalue reference is itself an lvalue! See the tutorial by Thomas Becker to learn everything you'll ever need about rvalue references. The other way is to use the mechanism employed by std::forward:
template <typename S>
line_range lines(S&& is) {
    return line_range(line_iterator(is), line_iterator());
}
A call with lvalue will bind S to std::ifstream& and with && will collapse to just std::ifstream&. A call with rvalue will bind S to std::ifstream and the whole thing will be std::ifstream&&. Again, see Becker's Section 8 for more detail.
Presto! We got two perfectly good ways to make it work with temporaries or not.

Room for Improvement: Using range-based-for

As much as I love lambdas and functional programming, there is a lot to be said for a good old for loop. And with C++11 we have the new range-based-for (aka foreach). I'd rather write this:
for( auto& line: lines(std::ifstream("somefile.txt")) )
    std::cout << line << std::endl;
In order for the range-based-for to work we must expose begin/end iterators via one of two ways. Either our line_range needs to have begin() and end() methods (just like STL containers) or we need to have free functions with same names such that ADL can pick them up. Since Boost.Range defines such free functions (boost::begin() and boost::end()), all we need to do is dump them into our namespace:
using boost::begin;
using boost::end;

std::ifstream file("somefile.txt");
for( auto& line: lines(file) )
    std::cout << line << std::endl;
This works great but unfortunately this does not (although it compiles):
for( auto& line: lines(std::ifstream("somefile.txt")) )
    std::cout << line << std::endl;
The reason being is that the compiler transforms this loop [stmt.ranged/1] into:
auto&& __range = lines(std::ifstream("somefile.txt"));
// regular for loop using iterators returned by begin(__range) and end(__range)
lines() returns the instance of line_range by value but its lifetime is extended to that of a the __range reference. However the lifetime of std::ifstream temporary is not extended! It is destroyed before the loop is even started. It's a problem that didn't exist when we were using boost::for_each since the whole iteration was all one statement. See [class.temporary/5] for details.
How should we fix it? Since the line_range temporary lives on for the duration of the loop, we need to move the instance of std::ifstream into it. For the cases where std::ifstream is an lvalue rather than a temporary, we can just store a reference to it. Let's get started.

Start off by redefining line_range as a class, ditching the std::pair of iterators:
template <typename T>
class line_range {
    T istr;
public:
    line_range(T&& is) : istr(std::forward<T>(is)) {}
    line_iterator begin() { return line_iterator(istr); }
    line_iterator end() { return line_iterator(); }
};

We need to rig this up so that T=std::ifstream if a temporary is used and T=std::ifstream& otherwise. Well, we know how to this! You did read Becker's Section 8, didn't you?!
It's as simple as (albeit syntax is not simple):
template <typename S>
auto lines(S&& is) -> decltype(line_range<S>(std::forward<S>(is))) {
    return line_range<S>(std::forward<S>(is));
}
It's an example of the Perfect Forwarding problem -- we just forward rvalueness/lvalueness through to the line_range and end up either moving the object or storing the reference to it. As a very last step, we have to slightly modify the line_range to once again accommodate Boost.Range. Complete code is shown below:
#include <iostream>
#include <fstream>
#include <boost/range/algorithm/for_each.hpp>

struct line_t : std::string {
     friend std::istream & operator>>(std::istream& is, line_t& line) {
         return std::getline(is, line);
     }
};

typedef std::istream_iterator<line_t> line_iterator;

template <typename T>
class line_range {
    T istr;
    line_iterator b;

public:
    typedef line_iterator iterator;
    typedef line_iterator const_iterator;

    line_range(T&& is) :
        istr(std::forward<T>(is)),
        b(istr)
    {}

    line_iterator begin() const { return b; }
    line_iterator end() const { return line_iterator(); }
};

template <typename S>
auto lines(S&& is) -> decltype(line_range<S>(std::forward<S>(is))) {
    return line_range<S>(std::forward<S>(is));
}

int main()
{
    std::ifstream file("somefile.txt");
    for( auto& line: lines(file) )
        std::cout << line << std::endl;

    for( auto& line: lines(std::ifstream("somefile.txt")) )
        std::cout << line << std::endl;

    std::ifstream file2("somefile.txt");
    boost::for_each(lines(file2), [](std::string const& line) {
        std::cout << line << std::endl;
    });

    boost::for_each(lines(std::ifstream("somefile.txt")), [](std::string const& line) {
        std::cout << line << std::endl;
    });

    return 0;
}