Thursday 13 October 2016

Information about LuaJIT

The following are various notes about the design and implementation of LuaJIT.

Design overview

Email to lua-l mailing list.

From: Mike Pall mike.de
Subject: LuaJIT 2.0 intellectual property disclosure and research opportunities
Newsgroups: gmane.comp.lang.lua.general
Date: Monday 2nd November 2009 10:17:04 UTC (over 7 years ago)


It has been brought to my attention that it might be advantageous
for some parts of the research community and the open source
community, that I make a public statement about the intellectual
property (IP) contained in LuaJIT 2.0 and earlier versions:

  I hereby declare any and all of my own inventions contained in
  LuaJIT to be in the public domain and up for free use by anyone
  without payment of any royalties whatsoever.

  [Note that the source code itself is licensed under a permissive
  license and is not placed in the public domain. But this is an
  orthogonal issue.]

I cannot guarantee it to be free of third-party IP however. In
fact nobody can. Writing software has become a minefield and any
moderately complex piece of software is probably (unknowingly to
the author) encumbered by hundreds of dubious patents. This
especially applies to compilers. The curent IP system is broken
and software patents must be abolished. Ceterum censeo.

The usual form of disclosure is to write papers and publish them.
I'm sorry, but I don't have the time for this right now. But I
would consider publishing open source software as a form of
disclosure.

In the interest of anyone doing research on virtual machines,
compilers and interpreters, I've compiled  a list of some of the
new aspects to be found in LuaJIT 2.0. I do not claim all of them
are original (I cannot possibly know all of the literature), but
my research indicates that many of them are quite innovative.

This also presents some research opportunities for 3rd parties.
I have little use for academic merits myself -- I'm more interested
in coding than writing papers. Anyone is welcome to dig out any
aspects, explore them in detail and publish them (giving due credit).

Design aspects of the VM:

- NaN-tagging: 64 bit tagged values are used for stack slots and
  table slots. Unboxed floating-point numbers (doubles) are
  overlayed with tagged object references. The latter can be
  distinguished from numbers via the use of special NaNs as tags.
  It's a remote descendant of pointer-tagging.

  [The idea dates back to 2006, but I haven't disclosed it before
  2008. Special NaNs have been used to overlay pointers before.
  Others have used it for tagging later on. The specific layout is
  of my own devising.]

- Low-overhead call frames: The linear, growable stack implicitly
  holds the frame structure. The tags for the base function of
  each call frame hold a linked structure of frames, using no
  extra space. Calls/returns are faster due to lower memory
  traffic. This also allows installing exception handlers at zero
  cost (it's a special bit pattern in the frame link).

Design of the IR (intermediate representation) used by the compiler:

- Linear, pointer-free IR: The typed IR is SSA-based and highly
  orthogonal. An instruction takes up only 64 bits. It has up to
  two operands which are 16 bit references. It's implemented with
  a bidirectionally growable array. No trees, no pointers, no cry.
  Heavily optimized for minimal D-cache impact, too.

- Skip-list chains: The IR is threaded with segregated, per-opcode
  skip-list chains. The links are stored in a multi-purpose 16 bit
  field in the instruction. This facilitates low-overhead lookup
  for CSE, DSE and alias analysis. Back-linking enables short-cut
  searches (average overhead is less than 1 lookup). Incremental
  build-up is trivial. No hashes, no sets, no complex updates.

- IR references: Specially crafted IR references allow fast const
  vs. non-const decisions. The trace recorder uses type-tagged
  references (a form of caching) internally for low-overhead
  type-based dispatch.

- High-level IR: A single, uniform high-level IR is used across
  all stages of the compiler. This reduces overall complexity.
  Careful instruction design avoids any impact on low-level CSE
  opportunities. It also allows cheap and effective high-level
  semantic disambiguation for memory references.

Design of the compiler pipeline:

- Rule-based FOLD engine: The FOLD engine is primarily used for
  constant folding, algebraic simplifications and reassociation.
  Most traditional compilers have an evolutionary grown set of
  implicit rules, spread over thousands of hand-coded tiny
  conditionals.

  The rule-based FOLD engine uses a declarative approach to
  combine the first and second level of lookup. It allows wildcard
  lookup with masked keys, too. A pre-processor generates a
  semi-perfect hash table for constant-time rule lookup. It's able
  to deal with thousands of rules in a uniform manner without
  performance degradation. A declarative approach is also much
  easier to maintain.

- Unified stage dispatch: The FOLD engine is the first stage in
  the compiler pipeline. Wildcard rules are used to dispatch
  specific instructions or instruction types (loads, stores,
  allocations etc.) to later optimization stages (load forwarding,
  DSE etc.). Unmatched instructions are passed on to CSE.

  Unified stage dispatch facilitates modular and pluggable
  optimizations with only local knowledge. It's also faster than
  doing multiple dispatches in every stage.

Trace compiler:

- NLF region-selection: The trace heuristics use a natural-loop
  first (NLF) region-selection mechanism to come up with a
  close-to optimal set of (looping) root traces. Only special
  bytecode instructions trigger new root traces -- regular
  conditionals never do this. Root traces that leave the loop are
  aborted and retried later. This also gives outer loops a chance
  to inline inner loops with a low trip count.

  NLF usually generates a superior set of root traces than the
  MRET/NET (next-executing tail) and LEI (last-executed iteration)
  region-selection mechanisms known from the literature.

- Hashed profile counters: Bytecode instructions to trigger the
  start of a hot trace use low-overhead hashed profiling counters.
  The profile is imprecise because collisions are ignored. The
  hash table is kept very small to reduce D-cache impact (only two
  hot cache lines). Since NLF weeds out most false positives, this
  doesn't deteriorate hot trace detection.

  [Neither using hashed profile counters, nor imprecise profiling,
  nor using profiling to detect hot loops is new. But the specific
  combination may be original.]

- Code sinking via snapshots: The VM must be in a consistent state
  when a trace exits. This means that all updates (stores) to the
  state (stack or objects) must track the original language
  semantics.

  Naive trace compilers achieve this by forcing a full update of
  the state to memory before every exit. This causes many on-trace
  stores and seriously diminishes code quality.

  A better approach is to sink these stores to compensation code,
  which is only executed if the trace exits are actually taken.
  A common solution is to emit actual code for these stores. But
  this causes code cache bloat and the information often needs to
  be stored redundantly, for linking of side traces.

  Code sinking via snapshots allows sinking of arbitrary code
  without the overhead of the other approaches. A snapshot stores
  a consistent view of all updates to the state before an exit. If
  an exit is taken the on-trace machine state (registers and spill
  slots) and the snapshot can be used to restore the VM state.

  State restoration using this data-driven approach is slow of
  course. But repeatedly taken side exits quickly trigger the
  generation of side traces. The snapshot is used to initialize
  the IR of the side trace with the necessary state using
  pseudo-loads. These can be optimized together with the remainder
  of the side trace. The pseudo-loads are unified with the machine
  state of the parent trace by the backend to enable zero-cost
  linking to side traces.

  [Currently snapshots only allow store sinking of scalars. It's
  planned to extend this to allow arbitrary store and allocation
  sinking, which together with store forwarding would be a unique
  way to achieve scalar-replacement of aggregates.]

- Sparse snapshots: Taking a full snapshot of all state updates
  before every exit would need a considerable amount of storage.
  Since all scalar stores are sunk, it's feasible to reduce the
  snapshot density. The basic idea is that it doesn't matter which
  state is restored on a taken exit, as long as it's consistent.

  This is a form of transactional state management. Every snapshot
  is a commit; a taken exit causes a rollback to the last commit.
  The on-trace state may advance beyond the last commit as long as
  this doesn't affect the possibility of a rollback. In practice
  this means that all on-trace updates to the state (non-scalar
  stores that are not sunk) need to force a new snapshot for the
  next exit.

  Otherwise the trace recorder only generates a snapshot after
  control-flow constructs that are present in the source, too.
  Guards that have a low probability of being wrongly predicted do
  not cause snapshots (e.g. function dispatch). This further
  reduces the snapshot density. Sparse snapshots also improve
  on-trace code quality, because they reduce the live range of the
  results of intermediate computations. Scheduling decisions can
  be made over a longer stream of instructions, too.

  [It's planned to switch to compressed snapshots. 2D-compression
  across snapshots may be able to remove even more redundancy.]

Optimizations:

- Hash slot specialization: Hash table lookup for constant keys is
  specialized to the predicted hash slot. This avoids a loop to
  follow the hash chain. Pseudocode:

    HREFK:  if (hash[17].key != key) goto exit
    HLOAD:  x = hash[17].value
    -or-
    HSTORE: hash[17].value = x

  HREFK is shared by multiple HLOADs/HSTOREs and may be hoisted
  independently. The verification of the prediction (HREFK) is
  moved out of the dependency chain by a super-scalar CPU. This
  makes hash lookup as cheap as array lookup with minimal complexity.

  It also avoids all the complications (cache invalidation,
  ordering constraints, shape mismatches) associated with hidden
  classes (V8) or shape inference/property caching (TraceMonkey).

- Code hoisting via unrolling and copy-substitution (LOOP):
  Traditional loop-invariant code motion (LICM) is mostly useless
  for the IR resulting from dynamic languages. The IR has many
  guards and most subsequent instructions are control-dependent on
  them. The first non-hoistable guard would effectively prevent
  hoisting of all subsequent instructions.

  The LOOP pass does synthetic unrolling of the recorded IR,
  combining copy-substitution with redundancy elimination to
  achieve code hoisting. The unrolled and copy-substituted
  instructions are simply fed back into the compiler pipeline,
  which allows reuse of all optimizations for redundancy
  elimination. Loop recurrences are detected on-the-fly and a
  minimized set of PHIs is generated.

- Narrowing of numbers to integers: Predictive narrowing is used
  for induction variables. Demand-driven narrowing is used for
  index expressions using a backpropagation algorithm.

  This avoids the complexity associated with speculative, eager
  narrowing, which also causes excessive control-flow dependencies
  due to the many overflow checks. Selective narrowing is better
  at exploiting the combined bandwidth of the FP and integer units
  of the CPU and avoids clogging up the branch unit.

Register allocation:

- Blended cost-model for R-LSRA: The reverse-linear-scan register
  allocator uses a blended cost model for its spill decisions.
  This takes into account multiple factors (e.g. PHI weight) and
  benefits from the special layout of IR references (constants
  before invariant instructions, before variant instructions).

- Register hints: The register allocation heuristics take into
  account register hints, e.g. for loop recurrences or calling
  conventions. This is very cheap to implement, but improves the
  allocation decisions considerably. It reduces register shuffling
  and prevents unnecessary spills.

- x86-specific improvements: Special heuristics for move vs.
  rename produce close to optimal code for two-operand machine
  code instructions.

  Fusion of memory operands into instructions is required to
  generate high-quality x86 code. Late fusion in the backend
  allows better, local decisions, based on actual register
  pressure, rather than estimates of prior stages.

Ok, that's it! Sorry for the length of this posting, but I hope it
was at least informative to someone out there.

--Mike

Links

Wednesday 14 September 2016

Xcode 8 notes

Just installed Xcode 8. Let's see what Apple have in store. Nervous about installing Xcode as anything pre 7.x was a patchy. The 4-5 series were just alphas and should never have replaced 3.x.

First thing noticed, the font has changed! Ok, we'll roll with that. GUI looks largely the same. Try some editing: comment in/out shortcut not working.

Fixing comment shortcut

Other users having same problem in the betas. Fix kudos to Chris Hanson from Twitter. From terminal:
sudo /use/libexec/xpccachectl
And you have to restart your Mac. Just log out/in won't work.

I think plug-ins are used in Xcode 8. XPC appears to be an IPC protocol, part of Grand Dispatch. So I guess this is how the plug-ins talk to Xcode. Ok so that looks fixed.

Still no column display

For something that claims to be a source code editor, it is pretty strange that the column number isn't displayed anywhere. You can get line number from the sidebar, but not column, which is useful for commenting and layout. Will update issue ticket for this.


Wednesday 31 August 2016

Ponder Design Review


Review of  function features

The Ponder library is a fork of CAMP and takes its design decisions from there. The biggest change was the removal of Boost, which should leave the functionality of the API unchanged.

CAMP has some interesting features, e.g. in Function we can assign a function callback to test whether a function is currently callable.

  /**
   * \brief Set the callable state of the current function with a dynamic value
   *
   * function can be any C++ callable type, and will be called to return the
   * callable state of the function each time it is requested. This way, the callable
   * state of a function can depend on metaclass instances.
   *
   * \param function Function to call to get the callable state of the function
   *
   * \return Reference to this, in order to chain other calls
   */
  template <typename F>
  ClassBuilder<T>& callable(F function);


I'm not sure of the rationale of some of the features. This is a feature I have not used. There are other features, like parent-child user objects, again unused. These features may have a use in a particular application, but they might not be viewed as widely used. So, perhaps they should not be so tightly coupled with the function data.

Design is choice

As Andrei Alexandrescu says, "Design is choice". There may be many solutions to problem, but the design is the one you chose.

In CAMP the data of an object is mixed with its use, e.g. function data also contains methods to call the function. There may be different ways in which we want to call the function. The current call method takes a dynamic array of arguments which are value types. This is quite inefficient, along with the value mapping that occurs, where many of the objects may be copied.

CAMP call behaviour has several particular traits:
  • Coercion of values through ValueMapper.
  • Calling with dynamic array of values.
  • Ability to block calls (callable).
It might be best to separate the call behaviour from the type information. This way calling, and any other uses of the type, can be customised for its use. This is a significant change away from CAMP.

I am currently extending Ponder with Lua scripting ability. This has been complicated by the Ponder value mapping and its difficulty in dealing with the ambiguity of references.

Type data is immutable

Type information is static. It is baked into the program at compile-time. The Ponder types should reflect this. Any uses of the data should refer to the data, but not modify it.


Ponder: what is reflection?

This is discussion on the current state of Ponder and thoughts on future changes.

What is reflection?

Wikipedia states:
In computer science, reflection is the ability of a computer program to examine, introspect, and modify its own structure and behavior at runtime
and uses are:
... observing and modifying program execution at runtime. A reflection-oriented program component can monitor the execution of an enclosure of code and can modify itself according to a desired goal related to that enclosure. This is typically accomplished by dynamically assigning program code at runtime. 
In object-oriented programming languages such as Java, reflection allows inspection of classes, interfaces, fields and methods at runtime without knowing the names of the interfaces, fields, methods at compile time. It also allows instantiation of new objects and invocation of methods. 
Reflection can be used to adapt a given program to different situations dynamically. Reflection-oriented programming almost always requires additional knowledge, framework, relational mapping, and object relevance in order to take advantage of more generic code execution.
Features we might expect are:

Type Introspection

The ability to introspect a program type. E.g. see what type it is and which members it contains. This might useful for runtime data binding, e.g. loading an XML file and assigning the values to class members based on element name matches.

Some C++ reflection systems offer this data automatically by parsing the symbols in a compiled C++ file. Ponder does not offer this, and there is some discussion of this in a previous post. It is generally thought that you do not want to export all data, and that sometimes the data needs annotating in order to remove ambiguity. For example, function returning references: should the values be copied or kept as references?

Self Modification

Since C++ is statically compiled, self modified code might be limited to setting pointers and callbacks to chosen type. It might be possible by implementing a runtime dynamic C++ compiler is complicated, and also likely something you would't want to distribute with your program. A more popular way would be to customise behaviour with data, or use an embeddable scripting language, perhaps with dynamic features, e.g. Lua.


Tuesday 16 August 2016

Gwork Continuous Integration

I added a Null renderer to Gwork, i.e. one that doesn't draw anything. This makes it easier to do things like cross-platform build testing. We might check several different configs of the build without having to link to a graphical API.

This is useful because if anyone submits any patches to Gwork they will be tested in the pull-request queue. Users can also add tests builds their own Travis accounts so they can see if their fork is building.

Travis

For Linux and MacOS (OSX) builds I used Travis. It is a free service, so I can't complain too much, but it took a considerable amount of fiddling around to get Linux builds working. I won't bore you with the details.

Travis current live build status:




Monday 15 August 2016

Gwork memory allocation stats

I added memory allocation tracking to Gwork as I'd like to keep track of the number and size of allocations. A CSV file is generated which parsed to produce a report. The following are tables from the report.

Current state

The current Gwork allocation stats from:

commit a95a0fde3afb68d2fd4d7af817159c4891db970d
AuthorDate: Mon Aug 15 20:26:21 2016 +0100


NameAlloc countAlloc size
API test90181566856 (1530.133KB)
Button5311024 (10.766KB)
Checkbox346088 (5.945KB)
CollapsibleList20839744 (38.812KB)
ColorPicker28132680 (31.914KB)
ComboBox4194755408 (737.703KB)
CrossSplitter5411128 (10.867KB)
GroupBox133928 (3.836KB)
ImagePanel61248 (1.219KB)
Label458168 (7.977KB)
LabelMultiline368736 (8.531KB)
ListBox25246040 (44.961KB)
MenuStrip817153304 (149.711KB)
Numeric395056 (4.938KB)
PageControl5811568 (11.297KB)
ProgressBar729728 (9.500KB)
Properties60883240 (81.289KB)
RadioButton6212072 (11.789KB)
ScrollControl602113640 (110.977KB)
Slider345040 (4.922KB)
StatusBar102000 (1.953KB)
TabControl18134016 (33.219KB)
TextBox23423080 (22.539KB)
TreeControl714117984 (115.219KB)
Window142712 (2.648KB)


GWEN stats

NameAlloc countAlloc size
API test100511721899 (1681.542KB)
Button6211736 (11.461KB)
Checkbox356152 (6.008KB)
CollapsibleList22940952 (39.992KB)
ColorPicker34136256 (35.406KB)
ComboBox4742792208 (773.641KB)
CrossSplitter6315352 (14.992KB)
GroupBox1510264 (10.023KB)
ImagePanel81408 (1.375KB)
Label539304 (9.086KB)
LabelMultiline4118920 (18.477KB)
ListBox28966771 (65.206KB)
MenuStrip983203992 (199.211KB)
Numeric445216 (5.094KB)
PageControl6412096 (11.812KB)
ProgressBar7310016 (9.781KB)
Properties66180348 (78.465KB)
RadioButton6812592 (12.297KB)
ScrollControl574105864 (103.383KB)
Slider344912 (4.797KB)
StatusBar112144 (2.094KB)
TabControl18241952 (40.969KB)
TextBox30637552 (36.672KB)
TreeControl727121572 (118.723KB)
Window162856 (2.789KB)

As you can see the "unit test" from GWEN use 1681.542KB, with Gwork using 1530.133KB. So that's roughly 160KB smaller. Note, this is from the gwen branch in the Gwork repo. 

Comparison

There is a more detailed comparison below.

NameCount deltaSize delta% size
API test-1063-163051 (-159.229KB)90.5%
Button-9-71293.9%
Checkbox-1-6499.0%
CollapsibleList-21-1208 (-1.180KB)97.1%
ColorPicker-64-3752 (-3.664KB)89.7%
ComboBox-548-36800 (-35.938KB)95.4%
CrossSplitter-9-4224 (-4.125KB)72.5%
GroupBox-2-6336 (-6.188KB)38.3%
ImagePanel-2-16088.6%
Label-8-1136 (-1.109KB)87.8%
LabelMultiline-5-10184 (-9.945KB)46.2%
ListBox-37-20731 (-20.245KB)69.0%
MenuStrip-166-50688 (-49.500KB)75.2%
Numeric-5-16096.9%
PageControl-7-77693.6%
ProgressBar-1-28897.1%
Properties-532892 (2.824KB)103.6%
RadioButton-6-52095.9%
ScrollControl4200100.2%
Slider0128102.6%
StatusBar-1-14493.3%
TabControl-1-7936 (-7.750KB)81.1%
TextBox-72-14472 (-14.133KB)61.5%
TreeControl-13-3588 (-3.504KB)97.0%
Window-2-14495.0%

There isn't much detail about where the savings come from but the unicode string changes will have had an effect. GWEN stored every control string as wide unicode and as ASCII. LabelMultiline has a considerable saving, probably due to this. Interesting that a couple are larger than GWEN. Will have to investigate this.

Future Work

This gives a reference point on which to compare any future memory saving work. Work here might include:

  • Event system refactor. This is pretty inefficient as every control contains listeners and the associated containers whether used or not.
  • Type size and reordering. E.g. booleans might be better as chars, enums as chars etc. These might also be more efficient packed in the controls by reordering them.


Tuesday 24 May 2016

Cheatsheet online

Cheatsheet is now online. Previously it run in a browser on the desktop. Now it is hosted on Github pages.

Sheets currently are:
Moonscript now has syntax highlighting via Highlight.js, which I added.

Thursday 10 March 2016

Ponder Unittests Changed to Catch

Catch

The unit tests have been moved over to Catch from Boost
Test. This is a single header unit testing library. It is convenient as there are no external
library dependencies or downloads for users. (And for basic unit testing, why should you need more
than one header?!)

Additionally, there appeared to be issues with Visual Studio 2015 and Boost, so this should result
in more portability.

Standalone library

This now means that Boost is no longer required at all to use or test Ponder, and makes Ponder a
standalone library. The Boost comparison code has been left in for now, in case there are future
issues. Boost is a very competent library and if our type traits fail it may be useful to make a
comparison to Boost's behaviour, but this is done on demand and is not part of the regular testing.

Wednesday 2 March 2016

Gwork update

Have done a fair amount of refactoring of Gwork of late. This recently went into mainline on Github.

Architecture

Have restructured the source folders. We now have:
        source/
               platform/
               gwork/
               util/
               test/
               designer/
               samples/
This reflects the dependencies in the project, with each parent folder relying on the child folders above it. There are no dependencies from child to parent! Previously things like input and rendering were a little intertwined with the controls. Ideally, once you get to the gwork level everything should be platform agnostic.

Cmake

Gwork now has a cmake build system. Premake just moves too glacially and still doesn't have proper support for mobile platforms like iOS. Cmake is a little intimidating at first, but worth it due to its capabilities and range of outputs. It has full support for Ninja, which is pretty zippy, and handy when testing.

Renaming

I finally got round to renaming GWEN to Gwork (or "Gwk" namespace in the project). It has changed pretty significantly with the direction structure refactoring and worthy of a rename. Don't want confusion between the original project and this one.

Ponder - C++ Reflection

Preamble

Some time ago I asked a Stack Overflow question: "How can I add reflection to a C++ application?" I posed this in a non-specific way as there are many ways to do this, depending on your application. As you can see, over time, many varied answers have appeared. Also, in the interim, C++11 has appeared, and reflection is still not a part of the C++ language specification; it is questionable it will ever be built in to the language for multiple reasons.

This reflection project is more of an itch that needs scratching than a concrete problem. I've written serialisation code in the past, and looked at other solutions, so this is to try and solve those problems and try to be a more general solution to introspection within an application.

The Problem

Another user posed the question "why does C++ not have reflection?" to which jalf writes this informative reply:
There are several problems with reflection in C++. 
  • It's a lot of work to add, and the C++ committee is fairly conservative, and don't spend time on radical new features unless they're sure it'll pay off. (A suggestion for adding a module system similar to .NET assemblies has been made, and while I think there's general consensus that it'd be nice to have, it's not their top priority at the moment, and has been pushed back until well after C++0x. The motivation for this feature is to get rid of the #include system, but it would also enable at least some metadata). 
  • You don't pay for what you don't use. That's one of the must basic design philosophies underlying C++. Why should my code carry around metadata if I may never need it? Moreover, the addition of metadata may inhibit the compiler from optimizing. Why should I pay that cost in my code if I may never need that metadata? 
  • Which leads us to another big point: C++ makes very few guarantees about the compiled code. The compiler is allowed to do pretty much anything it likes, as long as the resulting functionality is what is expected. For example, your classes aren't required to actually be there. The compiler can optimize them away, inline everything they do, and it frequently does just that, because even simple template code tends to create quite a few template instantiations. The C++ standard library relies on this aggressive optimization. Functors are only performant if the overhead of instantiating and destructing the object can be optimized away. operator[] on a vector is only comparable to raw array indexing in performance because the entire operator can be inlined and thus removed entirely from the compiled code. C# and Java make a lot of guarantees about the output of the compiler. If I define a class in C#, then that class will exist in the resulting assembly. Even if I never use it. Even if all calls to its member functions could be inlined. The class has to be there, so that reflection can find it. Part of this is alleviated by C# compiling to bytecode, which means that the JIT compiler can remove class definitions and inline functions if it likes, even if the initial C# compiler can't. In C++, you only have one compiler, and it has to output efficient code. If you were allowed to inspect the metadata of a C++ executable, you'd expect to see every class it defined, which means that the compiler would have to preserve all the defined classes, even if they're not necessary. 
  • And then there are templates. Templates in C++ are nothing like generics in other languages. Every template instantiation creates a new type. std::vector is a completely separate class from std::vector. That adds up to a lot of different types in a entire program. What should our reflection see? The template std::vector? But how can it, since that's a source-code construct, which has no meaning at runtime? It'd have to see the separate classes std::vector and std::vector. And std::vector::iterator and std::vector::iterator, same for const_iterator and so on. And once you step into template metaprogramming, you quickly end up instantiating hundreds of templates, all of which get inlined and removed again by the compiler. They have no meaning, except as part of a compile-time metaprogram. Should all these hundreds of classes be visible to reflection? They'd have to, because otherwise our reflection would be useless, if it doesn't even guarantee that the classes I defined will actually be there. And a side problem is that the template class doesn't exist until it is instantiated. Imagine a program which uses std::vector. Should our reflection system be able to see std::vector::iterator? On one hand, you'd certainly expect so. It's an important class, and it's defined in terms of std::vector, which does exist in the metadata. On the other hand, if the program never actually uses this iterator class template, its type will never have been instantiated, and so the compiler won't have generated the class in the first place. And it's too late to create it at runtime, since it requires access to the source code. 
  • And finally, reflection isn't quite as vital in C++ as it is in C#. The reason is again, template metaprogramming. It can't solve everything, but for many cases where you'd otherwise resort to reflection, it's possible to write a metaprogram which does the same thing at compile-time. boost::type_traits is a simple example. You want to know about type T? Check its type_traits. In C#, you'd have to fish around after its type using reflection. Reflection would still be useful for some things (the main use I can see, which metaprogramming can't easily replace, is for autogenerated serialization code), but it would carry some significant costs for C++, and it's just not necessary as often as it is in other languages.
This is an excellent summary of the problems with adding reflection to C++. In essence, C++ may be transformed significantly from the source to the compiled product. Also, we may not want reflected exactly what is in the source.

Current solutions

Some of the solutions I considered were:

Macros

C style macro solutions feature in the above Stack Overflow question. The solution offered here uses Boost. I wanted to avoid using macros as to do anything complicated they always end up getting complicated (e.g. list iteration). Also, with the new features in C++11, like variadic templates, it is possible to do something more elegant in C++. This also can make debugging easier, as finding errors in the middle of a nested macro, using complicated C++ can get hairy.

Qt

This is an excellent framework, and I generally enjoy using it for GUI work. It has a form of markup for performing reflection. However, it is not a general solution as you are tied to Qt, and its licensing model (GPL/LGPL).

Reflex

This is an interesting way of doing reflection: you get a compiler to generate the metadata for you, in this case, gcc-xml. Not a bad solution, and the generator does the leg-work for you. Have to be careful to keep the generated metadata up to date with the program. For sizeable applications, the metadata can get large, and time consuming to generate and parse. Licensing is LGPL.

Reflect

This uses macros (see above). Also overloads "RTTI", the compiler version of runtime introspection. MIT licence.

Classdesc

A mature solution, but one that consequently comes with some cruft. Several large applications depend on it so unlikely that it will change significantly. MIT licence.

CAMP

This is a nicely engineered library, aiming to be general purpose with solid cmake build system. It relies on Boost for type trait information and other utilities. Initially this was LGPL but then relaxed to MIT. Project now retired by authors.

Requirements

My requirements were:
  • Use C++11, due to better template and type support.
  • Avoid Boost. Great library, but leads to bloated compile times.
  • Avoid macros.
  • Liberal licence. LGPL impractical when require non-shared libraries, which is may be common when using tightly coupled information like reflection.

Ponder

I decided on CAMP as it fit my requirements the best. I forked it on Github and subsequently renamed it as CAMP has been retired. The new name is Ponder, i.e.
ponder, synonyms: reflect on
The Boost dependency has been removed for the reflection library, although Boost unit testing is still used. I tried to simplify the library as much as possible, using variadic templates to remove longhand template argument lists, using C++11 type traits, etc. Also added a Jekyll website on Github pages to support a project blog, documentation, and discussion (via Disqus).

The plan next would be to use the API to support its original aims. It can be...
used to expose and edit objects' attributes into a graphical user interface. It can also be used to do automatic binding of C++ classes to script languages such as Python or Lua. Another possible application would be the serialization of objects to XML, text or binary formats. Or you can even combine all these examples to provide a powerful and consistent interface for manipulating your objects outside C++ code
Related links:

[Jun-2016] Boost unit testing no longer used. Catch used instead.