- You have the STXXL containers, which are similar to STL containers. Anyway, an STXXL container can be mapped to (multiple) external file(s) in a transparent way. Files may have different implementations that might be bases on various file systems or even remote storage interface. Examples of containers are:
- vectors: example: stxxl::vector
v(64 * int64(1024 * 1024)); - map: example: stxxl::map
- queue: example: stxxl:queue
- stacks and priority queues
- You can access the containers with STXXL Iterators and the library is taking care of all I/O operation, supporting large files up to dozens of terabytes even if they are mapped on multiple physical disks.
- unsigned int big_size = 1024 * 1024 * 2 * 16 * 16;
typedef stxxl::vectorvec_big;
vec_big my_vec(big_size);
vec_big::iterator big_it = my_vec.begin(); - You can use a collection of STXXL Algorithms, similar to STL ones but for external memory:
- stxxl::generate(v.begin(), v.end(), stxxl::random_number32(), 4);
- stxxl::for_each_m(v.begin(), v.end(), square
(), 4); - ...
- A particular mention goes to sorting algorithms for external memory:
- typedef stxxl::vector
vector_type;
const stxxl::int64 n_records =
stxxl::int64(384) * stxxl::int64(1024 * 1024) / sizeof(my_type);
vector_type v(n_records);
stxxl::random_number32 rnd;
for (vector_type::size_type i = 0; i < v.size(); i++)
v[i]._key = 1 + (rnd() % 0xfffffff); stxxl::sort(v.begin(), v.end(), cmp(), memory_to_use); - This paper describes the library Stxxl : Standard Template Library for XXL Data Sets ; R. Dementiev, L. Kettner, P. Sanders "STXXL: standard template library for XXL data sets" Software: Practice and Experience Volume 38, Issue 6, Pages 589-637, May 2008 DOI: 10.1002/spe.844. See also the technical report.
- Here you have a collection of example codes; performance are pretty good, and some stxxl users have reported that stxxl sort is much faster than UNIX sort. In addition, i found this paper which reports very good performance (Breadth first search on massive graphs). Do not forget to consult the STXXL forum if you need any help.
Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Friday, February 13, 2009
Generic programming and out-of-the core computation
STXXL is a C++ library for extra large data sets. The library allows to write generic programs, with an idiom typical of STL, and to leverage the power of out-of-the-core computations for massive data-sets. I played with the library and I loved it.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment