Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RegEx Proposal

3 views
Skip to first unread message

Allan Odgaard

unread,
Aug 31, 2003, 8:38:08 AM8/31/03
to
I sent this to comp.std.c++ several days ago, but it never showed up,
so I'll try this group instead.

I am aware of the proposal to include boost's regex library in C++0x,
but was not entirely satisfied by the API of this library (and had
some problems using it with non char/wchar_t types, avoiding
basic_string etc.), so I present here a proposal for an alternative
API which closer resemble the existing std::find functions.

If I wish to skip characters until next space, I can already do:
first = find(first, last, ' ');

It could be I have some non-word chars, and instead I'd do:
static const char delim[] = { ' ', ',', '.', '\n' };
first = find_first_of(first, last, delim, delim + sizeofA(delim));

So in case I had a regular expression, I'd prefer if I could simply
write:
first = regex::find(first, last, "\>");

There are a few problems with this though. The pattern argument is
currently a c-string and this is no good. We could make it an
iterator-range (to support arbitrary types) similar to std::search,
but a problem remains, namely that the DFA/NFA would have to be
constructed from scratch each time. So instead let us make the
argument "nfa<T> const&" and let the NFA be constructible from both an
iterator range and a c-string, that way we can write:

// pattern is e.g. in unicode
vector<T> ptrn = ...;
first = regex::find(first, last, nfa<T>(ptrn.begin(), ptrn.end()));
or
// cache this "expensive" operation
static const nfa<char> ptrn("\>");
first = regex::find(first, last, ptrn);
or
// implicit conversion from 'char const*' to 'nfa<char> const&'
first = regex::find<char>(first, last, "\>");

This takes care of the pattern, but another problem is that sometimes
we need more info than just where the match started. For this reason
we introduce the type regex::match<_ForwardIter> which has an API very
similar to std::vector [1]:

_Iter begin () const; // start of match
_Iter end () const; // end of match
bool empty () const; // did it actually match?

To support sub-matches we provide the same functions with a numeric
argument [2]:

_Iter begin (size_t n); // start of the n'th sub-match
_Iter end (size_t n); // end of the n'th sub-match
bool empty (size_t n); // was there an n'th sub-match
size_t size (); // number of sub-matches

And to support our 'old' usage of regex::find we also introduce this
handy conversion member:
operator _Iter () const { return begin(); }

That way, we can still do:
first = regex::find(first, last, ptrn);

But if we need more info, we'd instead do:
regex::match<_Iter> const& m = regex::find(first, last, ptrn);
for(size_t i = 0; i < m.size(); i++)
copy(m.begin(i), m.end(i), ostream_iterator<char>(cout));

(when nothing is matched, size() returns 0 and begin/end returns
'last' (this also applies to begin/end for sub-matches) -- this allows
the result to be used unchecked).

It would probably be nice to also have this member:
bool operator== (std::pair<_Iter, _Iter> const& rhs) const
{
return rhs.first == begin() && rhs.second == end();
}

That way we can check for full matches without resorting to
temporary variables, e.g.:
if(regex::find(first, last, ptrn) == std::make_pair(first, last))
...

As an alternative one may however just use begin-of-buffer and
end-of-buffer codes in the pattern, which would probably be faster.

The above is really the essence of the proposal. There would also be a
regex::find_end which would find the last match, this would require
that regex::nfa can return a reversed nfa, but otherwise just be a
simple wrapper for regex::find (using reverse iterators).

To traverse all matches there is regex::find_all which takes an unary
function and calls it for each match (so again just a simple wrapper
for regex::find).

To support format strings then a functor exists which is constructed
from a format string and an output iterator. It implements operator()
with regex::match<T> as argument, and does formatted output (this
should also be trivial), so the usage would be (here I use a helper
function to hide the template arguments of the format class -- like
the nfa class, the format object can both be instantiated from a
c-string and an iterator range):
regex::find_all(first, last,
ptrn, make_fmt("$1", ostream_iterator<char>(cout)));

This however, will only format/output each match, the text between
matches are lost. To remedy this problem, we can create a
'pass_thru'-functor, which is a subclass of the 'format'-functor, but
which (in operator()) will do:

bool operator() (match<_Iter> const& m)
{
std::copy(M_previous_last, m.begin(), M_out);
M_previous_last = m.end();
return format::operator()(m);
}

Though we need to flush this functor, and it must also be initialised
with 'first, last' representing the range to be searched, so the above
will become:
regex::find_all(first, last, ptrn,
make_passthru_fmt(first, last, fmt, out)).flush();

Which I suggest we hide in "regex::replace", so that we can settle
with:
regex::replace(first, last, ptrn, fmt, out);

Similar to make_fmt and make_passthru_fmt, where we need two versions
(to support both c-strings and iterator ranges) we'd also need two
versions of regex::replace, but since there is no logic in the
function, I do not see this as such a big problem.

So that is basically it -- there are several details I have left out,
since this is merely an appetiser to see, if there is any further
interest in this API over the current regex proposal.

Even if you do not support putting forth a second regex proposal for
the LWG then I'd still be interested in your feedback regarding the
API.

Finally it should be noted that this proposal actually support
arbitrary (non char-like!) types to be searched, the code would go
along the lines of what's below (note that this code is untested, but
I've made it work earlier) -- regex::mixed_sequence is just a
std::vector subclass which add operator<< for push_back, and which
store the elements using a wrapper, which can be constructed from both
the actual type, but also 'char'. This wrapper supports operator== for
both the actual type and char (where the latter is used when the NFA
is constructed (or the format string is parsed), and the former when
it is used/visited).

int main (int argc, char const* argv[])
{
regex::mixed_sequence<char const*> ptrn, fmt;
ptrn << '(' << "-o" << '|' << "--output" << ')';
ptrn << '(' << '.' << ')';
fmt << '$' << '2';

std::vector<char const*> files;
regex::find_all(argv, argv + argc,
regex::make_nfa(ptrn.begin(), ptrn.end()),
regex::make_fmt(fmt.begin(), fmt.end(),
std::back_inserter(files)));

std::cout << "Write to these files:" << std::endl;
std::copy(files.begin(), files.end(),
std::ostream_iterator<char const*>(std::cout, "\n"));

return 0;
}

[1] The reason I have mimicked std::vector is so that one does not
have to learn a new API -- I am however sceptical about using
the size member function for 'number of submatches'.

[2] I did not make these functions 'const' because I want to
support lazy evaluation of sub-matches (which allows for
regex::find to use a DFA), but this will require that the
caller of regex::find makes a temporary copy of the result, if
he wishes to access the sub-matches (even if the implementation
doesn't use lazy evaluation).

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Beman Dawes

unread,
Sep 2, 2003, 10:21:42 AM9/2/03
to
Du...@DIKU.DK (Allan Odgaard) wrote in message news:<689e217.03083...@posting.google.com>...

> I sent this to comp.std.c++ several days ago, but it never showed up,
> so I'll try this group instead.

That really is the right place; you might try again.

> I am aware of the proposal to include boost's regex library in C++0x,
> but was not entirely satisfied by the API of this library (and had
> some problems using it with non char/wchar_t types, avoiding
> basic_string etc.), so I present here a proposal for an alternative
> API which closer resemble the existing std::find functions.
>

> ...

>
> So that is basically it -- there are several details I have left out,
> since this is merely an appetiser to see, if there is any further
> interest in this API over the current regex proposal.


The C++ Committee's Library Working Group first looked at John
Maddock's regex++ proposal a couple of years ago. They also had some
concerns about the interface, and John reworked it as a result to
reflect those concerns. The interface as accepted by the committee is
documented at http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm

As the proposal moved through the committee process, the interface was
again subject to scrutiny when Eric Niebler, the author of GRETA regex
library, suggested further changes. Although the final decision was
not to incorporate Eric's suggestions, they were quite interesting,
and so more analysis and discussion went into the interface at that
point.

Now that the proposal has been accepted, library vendors are starting
to implement it. That brings yet another level of scrutiny. While the
usual typos and minor specification glitches are turning up, there
haven't been a lot of complaints about the overall interface. That's a
really good sign given the acid bath of implementation working from
the proposal.

That's a long way of saying the train has already left the station.
While a change at this point isn't totally impossible, it isn't
likely. Can you boil your argument for change down into a paragraph or
two? That would help.

--Beman Dawes

Allan Odgaard

unread,
Sep 15, 2003, 5:47:56 AM9/15/03
to
Beman Dawes wrote:

> > I sent this to comp.std.c++ several days ago, but it never showed up,
> > so I'll try this group instead.
> That really is the right place; you might try again.

I have since sent other letters to that group, most of them replies to
posts, and none have come through. Do they have a rule about not
allowing posts coming from Google or similar?

> The C++ Committee's Library Working Group first looked at John
> Maddock's regex++ proposal a couple of years ago. They also had some
> concerns about the interface, and John reworked it as a result to
> reflect those concerns. The interface as accepted by the committee is
> documented at http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm

Yes I have read it. I must confess that my grief with boost is with
the current implementation, where some has changed in the proposal,
but it still doesn't appeal to me.

> As the proposal moved through the committee process, the interface was
> again subject to scrutiny when Eric Niebler, the author of GRETA regex
> library, suggested further changes. Although the final decision was
> not to incorporate Eric's suggestions, they were quite interesting,

Are these suggestions/comments available?

> [...] there haven't been a lot of complaints about the overall interface.

And you were the only one to reply to my 'proposal', so I guess the
silent majority have spoken ;-)

> [...] Can you boil your argument for change down into a paragraph


> or two? That would help.

My major concerns are:

o it is designed for homogeneous type usage. Often I have a non-char
buffer which needs to be searched for a hardcoded (at compile time)
pattern. If I provide this pattern as a c-string then the character
traits will be for char, thus failing to correctly identify word
characters/boundaries and similar.

o it is in effect limited to char and wchar_t. On Mac OS X the
'unichar' type is an unsigned short (16 bit), whereas wchar_t is 32
bit, so I must use unichar, but not only did that require several
changes in the current implementation (but that might just be an
implementation issue), it also required me to write a character_traits
specialisation for unichar to go together with std::basic_string
(which is heavily used by boost), and that is in fact not permitted by
the standard (and did cause cross-compiler problems).

Other concerns are:

o it doesn't try to offer any syntactical shortcuts. E.g. often I
just wish to advance an iterator to the next occurrence of a pattern,
this requires that I define a) a structure to hold the result, where I
must provide a template argument (and pass by-reference), b) a
structure for the actual pattern (and if I search a unichar sequence,
I probably need to copy my c-string into a std::vector<unichar> first,
and create the pattern object from this vector), and again, I must
provide template arguments and c) the result of the actual search
algorithm is a boolean, so I would need an if(...).

o it doesn't try to hide the degenerate cases. I.e. one of the IMHO
beautiful features of stl's iterator concept is that iterators are
always valid, e.g. I can do "v.insert(find(v.first(), v.last(),
someValue), otherValue);", and even if find() "fails", then the
statement is still valid, similar to "v.erase(v.unique(v.first(),
v.last()), v.end())", again, it could be that unique doesn't alter the
sequence, but no checks are required -- with boost lots of checks are
required, even in the general case.

Finally I would also like to point out that boost doesn't seem very
integrated with the standard library components (STL). It doesn't
inherit name terminology, syntax, nor seems very orthogonal in the way
one can combine boost with the standard classes.

For example let us say I wish to print all the unique words in a text,
something that I just did with my own library using this code:

set<match_type> m;
regex::copy_if<char>(first, last, "\\<\\w*\\>", inserter(m,
m.begin()));
transform(m.begin(), m.end(), ostream_iterator<string>(cout, "\n"),
to_string());

Here to_string is just a functor taking match_type and does a:
return std::string(m.begin(), m.end());

And I have added a global operator< for match_type which does:
return std::lexicographical_compare(m1.begin(), m1.end(),
m2.begin(), m2.end());

I made the copy_if part of my regex library, but the actual code is:
while(match_type const& m = find(first, last, ptrn))
{
*out++ = m;
first = m.end();
}

I realise that this could probably be build on top of boost in
someway, and my beloved "first = find(first, last, ptrn)" could be
implemented as a wrapper a.s.o. -- this was how I originally tried to
handle boost (in a project of mine), but eventually I grew tired of
it, so I ditched boost and wrote my own library, which tries to avoid
the set of fixed functions that boost offer, and instead do smaller
components which can be combined, and thus the end-user can
rewrite/supply just one component to adapt the library, e.g. the
format-string thing is isolated in a functor. Write a new functor and
it will work perfectly with the existing library. Or you could write
another 'matcher' and use my format-string functor with that instead
:-)

Another thing I find noteworthy is, my own library currently supports
more or less everything that boost has to offer with respect to regex
grammar, and the implementation is less than 1.200 lines of code.
Last I checked boost was over 13.500, and GRETA is in the same range.

My implementation has a fully self-contained regex parser, which
builds a parse tree from the regular expression, and the NFA is
created using a visitor interface (so these two modules have no
dependencies on each other), making it very simple to change the
syntax.

The matching algorithm uses an iterative back-tracking depth-first
search of the NFA.

Although I have not done so yet, the parse tree of the regular
expressions allow for easy creation of a reversed NFA, which can be
used for backward searches, and is required for look-behind (as I
would like to avoid the approach taken by GRETA).

Parsing the pattern is of course linear in the pattern size and
creating the NFA is likewise linear in the number of nodes. Matching
is what can be expected from a backtracking NFA, but I intend to use
the bit parallel algorithm when the pattern doesn't contain
back-references or counted repeats.

I intend to put the thing on SourceForge, but I haven't found a good
name yet, so suggestions would be welcome :-)

Beman Dawes

unread,
Sep 16, 2003, 2:43:59 AM9/16/03
to
Du...@DIKU.DK (Allan Odgaard) wrote in message news:<689e217.03091...@posting.google.com>...

> Beman Dawes wrote:
>
> > > I sent this to comp.std.c++ several days ago, but it never showed up,
> > > so I'll try this group instead.
> > That really is the right place; you might try again.
>
> I have since sent other letters to that group, most of them replies to
> posts, and none have come through. Do they have a rule about not
> allowing posts coming from Google or similar?

I post from Google and have no problems. Perhaps you could ask one of
the moderators why your posts aren't showing up.

> > As the proposal moved through the committee process, the interface was
> > again subject to scrutiny when Eric Niebler, the author of GRETA regex
> > library, suggested further changes. Although the final decision was
> > not to incorporate Eric's suggestions, they were quite interesting,
>
> Are these suggestions/comments available?

Some were private email discussion, but the key points appeared in
committee emails so are available to committee participants. If you
are interested, you might want to think about joining. C++ continues
to evolve because volunteers are willing to contribute their time.
There are no big corporations funding standards development efforts.

> > [...] there haven't been a lot of complaints about the overall interface.
>
> And you were the only one to reply to my 'proposal', so I guess the
> silent majority have spoken ;-)

My guess is the regular expressions fall in the category of
infrastructure that everybody wants but just isn't sexy enough to draw
more that one or two top developers.

> > [...] Can you boil your argument for change down into a paragraph
> > or two? That would help.
>
> My major concerns are:
>
> o it is designed for homogeneous type usage. Often I have a non-char
> buffer which needs to be searched for a hardcoded (at compile time)
> pattern. If I provide this pattern as a c-string then the character
> traits will be for char, thus failing to correctly identify word
> characters/boundaries and similar.

The traits aspect is a concern. John Maddock is working on a paper
which will include some discussion of traits issues. Pete Becker has
been helping by providing a lot of feedback. The paper should be in
the committee's pre-meeting mailing due in a few weeks, and like all
committee technical papers will be available to the public via the
WG21 web site.

> o it is in effect limited to char and wchar_t. On Mac OS X the
> 'unichar' type is an unsigned short (16 bit), whereas wchar_t is 32
> bit, so I must use unichar, but not only did that require several
> changes in the current implementation (but that might just be an
> implementation issue), it also required me to write a character_traits
> specialisation for unichar to go together with std::basic_string
> (which is heavily used by boost), and that is in fact not permitted by
> the standard (and did cause cross-compiler problems).

Without understanding the issue in any depth, that sounds like an
issue which can be dealt with as an issue with the traits aspects of
the current interface, rather than a need for an entirely new
interface.

> Other concerns are:
>
> o it doesn't try to offer any syntactical shortcuts. E.g. often I
> just wish to advance an iterator to the next occurrence of a pattern,
> this requires that I define a) a structure to hold the result, where I
> must provide a template argument (and pass by-reference), b) a
> structure for the actual pattern (and if I search a unichar sequence,
> I probably need to copy my c-string into a std::vector<unichar> first,
> and create the pattern object from this vector), and again, I must
> provide template arguments and c) the result of the actual search
> algorithm is a boolean, so I would need an if(...).

Understood. But can't that be solved by providing additional
functionality that in effect just wraps the proposed interface? Darin
Adler did a prototype on Boost once, and it was quite nice IIRC. But
Darin got busy with his real job, and although several others started
to carry the work forward, but haven't finished yet. You might ask on
the Boost list how that work is progressing.

In fact, the Boost list would be a good place for a lot of your
detailed concerns - they may be a bit detailed for
comp.lang.c++.moderated.

Sometimes helping to improved an ongoing effort is more productive
that starting a new effort. Particularly where the Boost regular
expression package is well on its way to standardization.

--Beman Dawes

Allan Odgaard

unread,
Sep 17, 2003, 5:13:41 AM9/17/03
to
Beman Dawes:

> > Are these suggestions/comments available?
> Some were private email discussion, but the key points appeared in
> committee emails so are available to committee participants. If you

> are interested, you might want to think about joining. [...]

I did not know that "normal people" could participate -- is there a
recommended way to be introduced to the work done, i.e. a mailing list
one can subscribe to? Looking at their pages it would seem that to
participate "please contact your national member body", which seems a
bit too drastic, as I would prefer to start by being a silent observer
of the "work done".

> > o it is in effect limited to char and wchar_t. [...]


> Without understanding the issue in any depth, that sounds like an
> issue which can be dealt with

std::basic_string would have to go.

> as an issue with the traits aspects of the current interface

The traits issue is also problematic as currently drafted. John sort
of criticise GRETA for templating the pattern object on the iterator
type -- while I agree with John that this could easily lead to code
bloat, there is a middle way: create the NFA independent of type, and
provide a visitor API for this NFA. That way the actual find function
can instantiate a visitor subclass that is templated on the type (and
thus uses the correct traits), and the actual work of parsing the
regex and building the NFA is done only once.

The downside is that one cannot do optimisations on character ranges
-- but this is IMO completely unnecessary.

A more justified concern might have to do with collating and named
elements, which one might want to delegate out to the traits class (at
parse-time), although I doubt that it makes sense in practice, i.e.
only having [[::digit::]] or similar available, when searching a
char-sequence (which could be the consequence).

If delegation is required I would suggest introducing a new delegate
class, unrelated to the traits, which would also allow for better
re-use of these things (Boost provides four different traits files
with an *average* size of 31.5 KB).

> rather than a need for an entirely new interface.

I realise that my actions might seem counter-productive, but the story
is that I needed regex support earlier this year, I turned to Boost
and had a friend "adapt" it to work on Mac OS X (which wasn't easy),
then I used it for some time, writing several wrappers around the
rather verbose API.

Then GCC was updated to 3.3 and Boost suddenly failed to compile, due
to its use of std::basic_string<CharT> -- this added to the fact that
Boost did not satisfy me performance wise and it lacked some features
I desired, then I finally thought to hell with it, spent a day
re-creating the features I needed from Boost and "moved on".

Later I saw Boost proposed as a standard library addition and wrote
John to offer him my "experience" using it for a project, but nothing
came out of it, so after talking with a few friends, who shared my
scepticism, I wrote the "proposal" (you replied to) to see if there
were anyone besides me who felt that Boost didn't really cut it as a
standard library addition.

> > o it doesn't try to offer any syntactical shortcuts. [...]


> Understood. But can't that be solved by providing additional
> functionality that in effect just wraps the proposed interface?

This was my first approach when I used Boost. The problem was that
without also wrapping the Boost structures, I would generally need a
new (uniquely named) function for each usage -- so I set out to create
wrappers for the structures as well, but quickly decided, that then it
was actually easier just to re-create the entire library.

And what is the point of having something in the standard which really
require wrappers to be used? If not wrappers, then at least a look
through the documentation, as with Boost, even the simplest task (e.g.
checking which of a set of file names have C++ extensions), require
vast amount of typing, and using hard-to-remember names like e.g.
"reg_expression", what I mean by hard here is that "Regular
Expression" can be abbreviated in a dozen different ways (and so it is
in practice), so remembering the actual abbreviation used by Boost is
hard, not mentioning which of the seven constructors is suited for my
current task.

> In fact, the Boost list would be a good place for a lot of your
> detailed concerns - they may be a bit detailed for
> comp.lang.c++.moderated.

As stated above, I did write John with most of my concerns.

> Sometimes helping to improved an ongoing effort is more productive
> that starting a new effort. Particularly where the Boost regular
> expression package is well on its way to standardization.

I realise that, however, my initial goal was not for my API to be part
of the standard, but simply provide a better library for my own
personal use, which I deemed was easier to achieve by starting from
scratch.

Beman Dawes

unread,
Sep 17, 2003, 4:24:46 PM9/17/03
to
Du...@DIKU.DK (Allan Odgaard) wrote in message news:<689e217.03091...@posting.google.com>...
> Beman Dawes:
>
> > > Are these suggestions/comments available?
> > Some were private email discussion, but the key points appeared in
> > committee emails so are available to committee participants. If you
> > are interested, you might want to think about joining. [...]
>
> I did not know that "normal people" could participate -- is there a
> recommended way to be introduced to the work done, i.e. a mailing list
> one can subscribe to? Looking at their pages it would seem that to
> participate "please contact your national member body", which seems a
> bit too drastic, as I would prefer to start by being a silent observer
> of the "work done".

There is a membership category called Advisor just for that purpose.
See the FAQ entry at http://www.jamesd.demon.co.uk/csc/faq.html#B5 and
the next several entries. While corporations or universities often
participate, a number of individuals also hold memberships.

The Boost members on the committee often help other Boosters through
the process of becoming active on the committee. One of the objectives
of Boost right from the start was to develop new members for the
committee's Library Working Group.

> > rather than a need for an entirely new interface.
>
> I realise that my actions might seem counter-productive, but the story
> is that I needed regex support earlier this year, I turned to Boost
> and had a friend "adapt" it to work on Mac OS X (which wasn't easy),
> then I used it for some time, writing several wrappers around the
> rather verbose API.
>
> Then GCC was updated to 3.3 and Boost suddenly failed to compile, due
> to its use of std::basic_string<CharT> -- this added to the fact that
> Boost did not satisfy me performance wise and it lacked some features
> I desired, then I finally thought to hell with it, spent a day
> re-creating the features I needed from Boost and "moved on".

Standard library components have to satisfy a lot of different
interests. Not just the interests of ordinary users, but also
implementors, teachers, authors, and others who care about C++. Even
just focusing on users, there is a lot of variation in needs. Thus it
may be possible to come up with your own library components that are
"better" by ignoring needs that don't apply to your applications.
OTOH, the advantages of standardization are great enough that many
programmers prefer standard library components even knowing that they
may be sub-optimal in some sense.

--Beman

Eric Niebler

unread,
Sep 20, 2003, 6:08:28 AM9/20/03
to

Allan Odgaard wrote:
> Beman Dawes:

>>as an issue with the traits aspects of the current interface
>
>
> The traits issue is also problematic as currently drafted. John sort
> of criticise GRETA for templating the pattern object on the iterator
> type -- while I agree with John that this could easily lead to code
> bloat, there is a middle way: create the NFA independent of type, and
> provide a visitor API for this NFA. That way the actual find function
> can instantiate a visitor subclass that is templated on the type (and
> thus uses the correct traits), and the actual work of parsing the
> regex and building the NFA is done only once.

I suspect that this approach would lead to severe performance problems.
Visitor typically involves double virtual calls, which cannot be
inlined. Some clever trickery could be used to avoid *some* of the
virtuals, but not all, and performance would suffer.

Note that the code bloat argument is a somewhat of a red herring. Even
if your NFA is iterator agnostic, your regex algorithms are certainly
not. They can't be -- at some level they have to know how to traverse
your character sequence. Tricks like visitor can increase the amount of
code that is blissfully unaware of the actual iterator type, but you pay
for that at runtime.

At the other extreme, if your NFA is parameterized on the iterator type,
it knows at compile time how to traverse and extract characters from
your character sequence, and there is extra potential for inlining and
optimization. That is the approach that I took with GRETA. There is no
distinction between pattern data and matching algorithm -- they are one
and they work together. You can take this to an extreme -- a pattern
can be compiled into an array of assembly instructions stored in a
vector<BYTE>, which can execute directly. Sick, but quick, and not for
the faint of heart. (The .NET Regex class can do something similar to
this, but it's slow for other reasons, alas.)

On the whole, I think your ideas regarding interface are interesting and
worthwhile. I also believe the Boost.Regex interface is successful, and
will satisfy a large number of users and educators. It also leaves room
for more specialized solutions like yours, GRETA and Spirit.

If you are so motivated, I highly encourage you to publish your regex
code and get actively involved in standardization.

Eric

Andy Heninger

unread,
Sep 22, 2003, 3:21:10 PM9/22/03
to

"Allan Odgaard" <Du...@DIKU.DK> wrote
[regarding Boost, and proposed std library regular expressions]

> o it is in effect limited to char and wchar_t. On Mac OS X the
> 'unichar' type is an unsigned short (16 bit), whereas wchar_t is 32
> bit, so I must use unichar, but not only did that require several
> changes in the current implementation (but that might just be an
> implementation issue), it also required me to write a character_traits
> specialisation for unichar to go together with std::basic_string
> (which is heavily used by boost), and that is in fact not permitted by
> the standard (and did cause cross-compiler problems).
>

IBM's ICU Unicode library includes C++ regular expressions that work
directly on utf-16 encoded Unicode, and are fully aware the Unicode
character properties, types, names, etc.
Full disclosure: I work for IBM on the ICU project.

The ICU regexp API for C++ was modelled after that in Java 1.4

Even though it is a C++ library, ICU and the C++ standard library
have next to nothing in common in style or design philosopy, for
a variety of mostly historical reasons. It would be nice,
some day, to see good Unicode support in a library that was
conceptually more aligned with the std library.

The Unicode Consortium has a paper on Unicode and Regular Expressions
at http://unicode.org/reports/tr18/

Information on ICU is at http://oss.software.ibm.com/icu/

-- Andy Heninger
heni...@us.ibm.com

0 new messages