The Standard Librarian: File-Based Containers
Matt austern
http://www.cuj.com/experts/1907/AUSTERN.HTM topic=experts
-------------------------------------------------- ------------------------------
Find a way to create a file-based container? You may want to find outside the standard C runtime, and memory mapping may be an answer.
The standard C runtime is easy to combine file I / O to combine with generic algorithms. For a structured file that is always read (or write) the same data type, you can use the file as a value sequence and access it via Istream_iterator (or OSTREAM_ITERATOR). When the file is used as a character sequence, you can access it with iStreamBuf_iterator (or ostreambuf_iterator). So, for example, you can combine Treambuf_iterator with a STL generic algorithm to perform "Y TO K" conversion:
Std :: ifstream in ("f1");
Std :: OFStream Out ("f2");
Std :: Replace_copy
(std :: istreambuf_iterator
Std :: istreambuf_iterator
Std :: ostreabuf_iterator
'Y', 'k');
When turning to more complex (and less artificial) examples, you have begun to encounter problems. You can use std :: find () to find the first place in the character C, but what should I do when you need to find in the multi-byte string? That is the function of std :: search (), but you can't use it: its parameters must be Forward Iterator, and iStreamBuf_iterator is just Input Iterator.
This is not an endless limit. Think about how you will implement std :: search (). If you are looking for a string in an input sequence, find a partial match, you can't just pay only the place where you pay attention to the mismatch: Usually, you must return some characters to start again. (Consider looking for "ababac" in "abababac".) Have a smart algorithm to avoid unnecessary rollback, but not fully avoid retreat. Save the "Bookmark" is exactly what you use the ITREAMATOR that is similar to iStreamBuf_iterator: INPUT produces a character, and once you read a character, you have read it.
This is a common problem of I / O: It often causes you until it has read too many otherwise it is not known enough. If you are reading an integer, and enter the "123X" until you have consumed an undesirable character "X", otherwise you can't know that the end of the integer has been reached. There are also some common solutions.
The first, you can transform the algorithm - by using a secondary storage area, or change the problem, or accept restrictions on parameters or error handles - make it no longer require more semantic requirements than Input Iteator . For example, use a string matching version based on the Input Iterator-based string in Std :: Time_Get (), read DAY and MONTH in the standard C run. Its flexibility or robustness is not as good as std :: search (), but it is enough for it. Second, you can simply read out more characters than the desired, then put an undesired character back back into the input stream. If you are reading characters via streambuf, you can use streambuf :: sputbackc () to "unread" a character. This technique is usually easier and efficient, but there are some restrictions (you can't rely on more than one character), and there is no way to integrate it into the Iterator interface of iStream_iterator or iStreamBuf_iterator.
Third, you can read a part of the input file to an internal buffer and then process this buffer and no longer directly face the input stream. You must be very careful when crossing the boundary of the buffer, but that is not always necessary: for example, it is often possible to handle one line each time.
All of these techniques are useful, but from the aspect of combining I / O and generic algorithms, there is no complete satisfaction. They all require changes in algorithms, one of which makes you can't use the Iterator interface. You may be able to express Putback-based algorithms to be exactly the same as the Forward Iterator-based algorithm, but you have to write that algorithm twice.
It is easy to use internal buffers if you don't need to cross the buffer boundary; and if you use a Buffer that is large enough to accommodate the entire file, you can affirm this. This only needs a few lines of code:
Std :: ifstream in ("f1");
Std :: istreambuf_iterator
(in);
Std :: istreambuf_iterator
Std :: Vector
This BUF offers Random Iterator. But this technology is not very satisfactory: it is very good when F1 is a small file. Once the file is getting bigger, it is not feasible.
Files on the disk (on the vast majority operating system) just a character sequence - looks very similar to a container. Why can't we find a magazine to express it into a container, don't need to read all things to memory?
Memory mapping
We can of course. The modern operating system provides a richer I / O primitive set available than C or C standard libraries. Some of them can be transformed into a framework for the C runtime library.
Most of the operating systems in today's operating system allow MEMORY-MAPPED I / O: Associate a file with a memory, so that this memory readout (write) will be converted to read (more). This memory and other memory are the same for the program. You point to it with a pointer and you can use any of the usual pointers as you can. A reverse reference a pointer to the memory-mapped area will get the corresponding value in the file, and write the corresponding value in the file.
It should not be surprised that this is possible: it is not too big to manage virtual memory. When you write an address in the program, it sometimes converts to a physical memory address, sometimes points to memory that has been switched to the disk being switched to the disk. The operating system converts the memory page from physical memory and exchanges, does not require programmers' interference. You can consider a memory map into an interface that exposes part of this. Why does the operating system provide memory mapping? The fundamental answer is performance.
l Memory mapping can save space. Map files into memory allows you to access this file as if you have read it like an array - it is part of your program's memory space - but you don't have to assign physical memory for that array.
l The memory map saves time. When you use iStream :: read () or ostream :: write (), you do a normal file I / O, the operating system uses its own and physical hardware associated with some internal buffers, but then you must This buffer is copied between another Buffer in our program. Memory mapping avoids this additional copy process because it exposes buffers inside the operating system. In the test of simple file copying process, Stevens [Note 1] found that the memory maps can sometimes provide speed-of-two speeds.
In addition to performance, Memory-Mapped I / O also gives us simple interfaces similar to containers, and this is exactly what we want. Memory mapping lets us access content in the file through the pointer. The pointer is Iterator, so we must do is to write a packaging class, encapsulate memory maps and provide interfaces that meet the needs of container syntax (C standard Table 65 - plus Table 66, because our containers will provide random access to Iterator ).
The first requirement in Table 65 is that each container class x must contain a typedef, x :: value_type, which indicates the type of elements in the container. What kind of value_type should we choose for this container? Answer may, according to the example of std :: Vector and std :: List, we should allow completely general value_type by writing a templated container class. However, for this example, this choice will bring problems.
l The class with the NontriVial constructor will be a problem. After all, the build function's selling point is not to treat an object as a byte sequence. For example, usually, you cannot copy objects with memcpy ().
l Even if the constructor is thrown, the class containing the pointer is also a problem. When your memory maps a file, the operating system selects the base address. When you write an object to a disk and send it back again, the pointer will no longer point to the correct location.
l Even the basic data types, like long and float, it is also a problem. The binary representation of the basic data type is different from the system variation; the Float is written in the form of a native byte sequence, and then read it on a different system, it is unlikely to give you the correct answer.
Of course, all of these issues can be resolved: The key is to store objects to a Memory-mapped container to call some kind of serialization protocol (for example, with offset instead of pointer), not only byte to bytes copy. However, serialization and document I / O are separate issues; construct a persistent storage system on the low-level I / O library, rather than mixing them together and surrounding the serialization. The file is a byte sequence, so this is our container will provide. It will be a non-template class, and its value_type will be char.
The next question is how to implement read and write operations. When you open a file, you provide a sign to clarify Read / Write or read-only way. This difference is not reflected on the type: you can try to write a file that opens in a read-only mode, and this attempt will fail at the runtime. However, for containers, this is not a reasonable choice. The container provides Iterator, and the iTerator either Mutable is constant. We cannot provide Iteerator that pointing to the Mutable inside the read-only file. That is, if we support only read-only and read / write methods, you need two classes with different types of declarations. Now, I only discuss read-only class, RO_FILE.
Once we have made these decisions, it is easy to meet everything required by Table 65. The first is Typedef: RO_FILE :: Value_Type is Char, RO_FILE :: Iterator and RO_FILE :: Const_Iteerator are const char *, RO_FILE: Reverse_iterator is std :: reverse_iterator
All actual work is done in the constructor, and other member functions are built on private member variables Base and Length. These member functions are not inline, so they have not appeared in the classes. These member functions are also highly system-dependent. UNIX and Windows (and other operating systems) support memory mapping, but they are supported in different ways. I will show a UNIX-based implementation that is tested under Linux.
The usual idea is very simple: the constructor accepts a file name. Use the low-level system call open () open the file, find the length of the file, and then map the entire file to MMAP () into memory. All of these operations may fail, so we check the return value and throw an exception when needed. Finally, we close the file. Under UNXI, a file mapping is not required to keep open: mapping will not be lost unless you call MunMap ().
The only trouble is to assign a value and copy. What should they be semantic? Answer may be assigned to replace another file with the content of a file, copy constructs should be created and open a new file. But these answers are not very correct: there is a variable operation in other aspects, which will be quite strange, and more blame is that you have a file copy operation but can't be named Copy (). We will provide copy and assignment - but the copy and assignment is the handle instead of the file. When a RO_FILE is copied, we will map a file for the second time. They will have the same content. (This is close to the edge allowed by the standard to the container, because we are not allowed to share the same object between two different containers, but it is really unhappy. Our container is unmodified, its value_type is CHAR; No way to point out two different unmodently modified char objects with an unmodently charm object, there are two different situations of different addresses.)
Implementation under the complete UNIX is here for Listing 2.
limit
Memory mapping is a useful technology, but it is not always appropriate.
RO_FILE is the most obvious restriction on the name: it is read-only. What is the RW_FILE class looks like? Not fully clear. Assignment and copy may have to be changed: the same file map is reasonable twice to the immobilized container, but it is not reasonable for variable containers. (If you change a value in the container C1 will result in a change in container C2 that appears, this will be very strange.) It is also unclear how much RW_FILE is. We can use RO_FILE to avoid restrictions on input Iterator when reading from a file, and avoid overhead between additional copies between kernel spaces and user space, but the two points have no mandatory for the output. The second restriction is not obvious. UNIX programmers are accustomed to treating a file as a simple character sequence, and is accustomed to treating a line as a region between two "/ n" characters, but life is not always so simple. The text file of some operating systems has more structures. (For example, some mainframe operating systems support text files with a fixed proceedient.) C standard, like the C standard before it, distinguish to open files in text, open files in a binary; open files in text Means the standard runtime automatically considers the needs of these formats.
When the memory maps a file, you will accurately get the byte sequence stored on the disk. That is, memory mapping is strongly limited to I / O in binary mode. If you expect a file to be opened in text, you must understand how your operating system handles [Note 2] for text files, and must be converted.
Third, this class does not use all the functions of memory mapping. In particular, it does not support common technologies that use memory mapping to complete process communication.
Finally, the error notference of memory map is very unfriendly. Since I / O operation is written as a pointer, the I / O error is manifested as a memory access violation. This impact on the output is more than the input, but even if it is input, it will fail. For example, imagine it, when you read it from the file, other processes open and truncated it. Memory mapping is best suited to you know that this situation does not appear or you can capture errors that behave as Signal. (Write a proxy class with the reverse reference of the package pointer, capture Signal and convert them to C exceptions, however, the overhead will constitute a limit. If you need this form of error notification, you prefer Memory-Maped Things other than I / O.)
to sum up
The file I / O in the C standard runtime uses a simple read and write model. This simple model is sufficient for many purposes. However, modern operating systems provide a richer operation set: Asynchronous I / O, Signal-Driven I / O, Multiplexing, Memory Mapping. All of these technologies are used by actual procedures, but they are all within standard C runners.
Memory mapping is a technology that adapts to the standard C runtime framework because the file mapped by memory looks very like a container - and, for many generic algorithms, the file is more convenient than each of the characters than each visit. . How to support other types of high-order I / O under the architecture of the C running library is an outstanding issue.
Listing 1: Header File for Class RO_FILE
#include
#include
#include
#include
Class Ro_file {
PUBLIC:
RO_FILE (Const std :: string & name);
RO_FILE (const ro_file);
RO_FILE & OPERATOR = (const ro_file &); ~ RO_FILE ();
PUBLIC:
TYPEDEF CHAR VALUE_TYPE;
Typedef const char * pointer;
Typedef const char * const_pointer;
Typedef const char & refreste;
Typedef const char & const_reference;
Typedef std :: ptrdiff_t Difference_type;
Typedef std :: size_t size_type;
Typedef const char * iterator;
Typedef const char * const_iterator;
Typedef std :: reverse_iterator
REVERSE_ITERATOR;
Typedef std :: reverse_iterator
Const_reverse_iterator;
Const_iterator begin () const {return base;}
Const_iterator end () const {return base length;}
const_reverse_iterator rbegin () const
{RETURN const_reverse_iterator (end ());}
Const_reverse_iterator rend () const
{return const_reverse_iterator (begin ());}
Const_reference Operator [] (size_type n) const {
Return Base [N];
}
Const_reference at (size_type n) const {
IF (n> = size ())
Throw std :: out_of_range ("ro_file");
Return Base [N];
}
SIZE_TYPE SIZE () Const {Return Length;}
SIZE_TYPE MAX_SIZE () Const {Return Length;}
Bool Empty () const {return size ()! = 0;
Void swap (RO_FILE &);
Private:
Std :: string file;
Char * base;
SIZE_TYPE LENGTH;
}
Bool Operator == (const ro_file &, const ro_file);
BOOL Operator <(const ro_file &, const ro_file);
Inline Bool Operator! = (Const Ro_File & X, Const Ro_file & Y)
{Return! (x == y);}
Inline Bool Operator> (Const Ro_File & X, Const Ro_File & Y)
{RETURN Y Inline Bool Operator <= (Const Ro_File & X, Const Ro_file & Y) {Return! (Y Inline Bool Operator> = (Const Ro_File & X, Const Ro_File & Y) {RETURN! (x - End of listing - Listing 2: Unix Implementation of class RO_FILE #include "ro_file.h" #include #include #include #include #include #include #include #include Namespace { Std :: pair { INT fd = open (name.c_str (), o_rdonly); IF (fd == -1) Throw std :: runtime_ERROR ("can't open" name ":" STRERROR (Errno); OFF_T N = Lseek (fd, 0, seek_end); Void * p = map_failed; IF (n! = OFF_T (-1))) P = mmap (0, (std :: size_t) n, prot_read, map_private, fd, 0); Close (FD); IF (p == map_failed) Throw std :: runtime_ERROR ("CAN't Map" Name ":" STRERROR (Errno); Return std :: make_pair (static_cast Static_cast } } RO_FILE :: RO_FILE (const st: string& name) : File (Name), Base (0), Length (0) { Std :: pair Base = p.first; Length = p.second; } RO_FILE :: RO_FILE (const ro_file & c) : File (C.File), Base (0), Length (0) { Std :: pair Base = p.first; Length = p.second; } RO_FILE & RO_FILE :: Operator = (Const Ro_file & c) { IF (c! = * this) { Std :: string tmp = c.file; Std :: pair Munmap (Base, Length); File.swap (TMP); Base = p.first; Length = p.second; } Return * this; } RO_FILE :: ~ RO_FILE () { Munmap (Base, Length); } Void RO_FILE :: SWAP (RO_FILE & C) { Std :: swap (file, c.file); Std :: SWAP (Base, C.BASE); Std :: swap (length, c.length); } BOOL Operator == (const ro_file & x, const ro_file & y) { Return x.size () == y.size () && Std :: Equal (x.begin (), x.end (), y.begin ()); } BOOL Operator <(const ro_file & x, const ro_file & y) { Return std :: lexicographic_compare (x.begin (), x.end (), y.Begin (), y.end ()); } - End of listing - Note and reference [1] W. Richard Stevens. Advanced Programming In The Unix Environment (Addison-Wesley, 1992). [2] Fortunately, the answer is simple in some popular operating systems. In Unix, there is no distinction between text and binary files, and in Windows the only issue to worry about is that, if you open a text file in binary mode, You'll See That The Lines End with The Two-Character Sequence "/ R / N".