ITEM: G5567L

Questions about file system internals.


Question:

We have an RS/6000-520 running 3.2.? on which we run an application that 
creates, copies, move and removes a very large number of files and 
directories (sometimes > 80,000).  We are observing tremendous slowness 
and are inclined to attribute this symptom to the large number of files 
we are creating and moving around, causing the directory structures and 
i-node tables to grow to huge sizes.  I would like answers to certain 
questions in this regard, in my efforts to trouble shoot this
problem:

   1) If I keep increasing the number of files and directories
      within a directory, it increases the sizes of the directory
      structure and the i-node table, doesn't it?

   2) If I tried to access a file in a directory which has a very
      large number of files and directories, is the whole dir.
      structure/i-node table loaded into memory before the file
      name is searched?

   3) After the i-node data is loaded into memory, is the search
      that is performed a sequential search?

   4) If the file that I am trying to access is way down the list,
      is it bound to take a lot more time, relatively?

Response:

Answer To Question (1):

Increasing the number of files in a directory DOES increase the size
of the directory (The number found if you do a ls -ld
\). With respect to indirect blocks directory data is
accessed just like file data: With size \< 32 KB its contents are in
one data block (this means it will take one block read to find the
correct data block), With size > 32 KB and size \< 4 MB it has one
level of indirection blocks (this means it will take two block reads
to find the correct data block), With size > 4MB it has two levels
of indirection blocks (this means it will take three block reads to
find the correct data block).

Increasing the number of files in a directory DOES NOT increase the
size of the inode table. The inode table has a fixed size which is
calculated based on the amount of real memory available. If you have
PTF U422053 (APAR IX37293) on your system then the equation used to
calculate the size of the inode table also depends on the amount of
memory you have.

Note On Information Following Question (1):

The number of inodes on a JFS filesystem DOES increase
dynamically. AIX does not allocate a fixed number of inodes when the
JFS filesystem is created. You will notice that the -i option for
the mkfs command is ignored when a JFS filesystem is created. This
is because JFS allows any data block to be used as an
inode. Therefore the number of inodes on a JFS filesystem is limited
only by the number of data blocks (minus any administrative blocks
of course).

Answer To Question (2):

When a directory is searched the data blocks for the directory are
read in as needed just as the data blocks for a file. Therefore if I
search for the 8th file in the directory it is unnecessary to read
in the data block containing information on the 80,000th entry.

Answer To Question (3):

YES, the search through a directory is sequential. Therefore to get
to the last file in a directory you must read through the whole
directory. You can use od -c \ to observe the actual order
of the files in the directory.

Answer To Question (4):

The above paragraph DOES mean that it will take much longer
(relatively) to access the last file in a directory with 80,000
entries than to access the last file in a directory with 10,000
entires.

Disclaimer:

Please realize that all the numbers given above are according to the
current implementation. They are all subject to change without
notice. 

Comments On Two Remaining Issues:

1. Although the wording of the opendir() documentation seems to
imply that the whole contents of the directory are read in this is
not really what happens internally. As indicated above, the data
contained in a directory is accessed just as the data contained in a
file. Data blocks are read in as needed. If you do not scan to the
end of a directory via readdir() and seekdir() the data blocks
containing information for the last entry are not read in. Therefore
the opendir() itself is not extremely expensive because of 80,000
files per directory. Nevertheless if you do scan to the end of the
directory it will take a relatively long time because the search is
sequential and all the data blocks must be read in to get to the
last data block.

2. There is no way to compress the directory data. If you create
80,000 files in a directory and the directory is size X (via ls -ld
\) it will never be less than size X even if you delete all
80,000 files. UNIX supports direct reading of a directory by
applications, but it does not support direct writing of a directory
by applications. Therefore you cannot compress the entries in a
directory to decrease seek time even if you are willing to leave the
directory size of X unchanged. In addition you will notice if you do
and od -c \ that the file names shown are truncated after 14
characters even though AIX allows longer filenames. This is done for
historical reasons to support old programs which don't use
opendir(), but read the directory data directly.

Question:

How many file entries can there be in one disk page of a directory file? 
I think the answer may be 256:

1 page = 4K
1 directory record = 16 bytes

4K/16 = 256

Also, why does the directory entry in the directory file consists of only
a two-byte inode and a 14 character filename while fsdb shows the entire 
direct structure that is found in /usr/include/jfs/dir.h?

Response:

Some UNIX History To Clear This Up A Bit:

In early versions of UNIX, filenames were limited to 14
characters. Each filename was given a 16 byte entry for directory
data. If AIX used this method then there would be exactly 256
directory entries in a block of directory data. The first two bytes
gave the inode number and the last 14 bytes gave the filename (null
terminated if less than 14 characters). These old style directories
were read with open(), read(), lseek(), and close() just like normal
files. At this time there were no directory reading functions like
opendir(), and friends.  Then as now it was possible to read
directory data directly but not write it. The only way to "write"
directory data is to use system calls made for this function. System
V generally supports this style of directory data.

When BSD 4.2 came out it allowed filenames up to 255 characters
long. To support this the internal structure of the directory data
was changed so that each filename has an entry formatted according
to the struct dirent data structure found in the include file
/usr/include/sys/dir.h. Unlike the old method which made all
directory entries the same length (16 characters), this new method
does not make all directory entries the same length because the
logical choice would have been sizeof(struct dirent) which is so
large it would be wasteful. I will address the algorithm used to
determine how much space is allocated for each entry at the end of
my reply. At the same time the opendir() function and friends were
created. These functions are capable of reading both the old style
directory data (which I will call SYSV directories) and the new
style directory data (which I will call BSD directories).

The existence of two types of directory data presents a problem in
the UNIX world. Many old programs used open(), read(), lseek(), and
close() to scan directories because they were written before
opendir() and friends were available (there may also be some newer
System V programs which do this I suppose). These programs would
fail on the BSD directories because the internal data structure is
different from the SYSV directory data they expected. To avoid this
problem and provide backward compatibility for these old programs a
decision was made. When you use open(), read(), lseek(), and close()
to access BSD directory data it will reformat this data to look like
the SYSV directory data would. If you use opendir(), and friends to
access BSD directory data you will get what appears to be SYSV
directory data. Each version of UNIX must deal with the above issue
in some way if they allow both BSD and SYSV directory styles. Most
versions will choose one style for their own system, but have the
capability to access the other if necessary (via an NFS mount for
example). Some may actually allow you to specify BSD or SYSV for
each filesystem and allow both to coexist at the same time.

Applying History To AIX To Answer Your Last Append:

AIX uses BSD directory data for all directories.

When you use fsdb you are dumping raw data blocks. At this level,
fsdb is not concerned about whether the data is inode data, or file
data, or directory data. Since AIX uses BSD directory data this is
what you see when you dump a directory data block using fsdb. This
is probably what you would expect.

Since od was written long before BSD directories were available it
does not check to see if the file you are accessing is a directory
or not. It just assumes that open(), read(), lseek(), and close() to
access the desired data. Internally AIX converts its BSD directory
data to SYSV directory data to feed to od. This is why od -c
\ gives 16 characters for each directory entry. On the disk
the data is still in BSD directory format. This explains why od does
not give you what you would expect.

Answer To Question Of How Many Entries Can Exist In A 4KB Page:

There is no easy answer to this question. The macro LDIRSIZE() in
/usr/include/jfs/dir.h is used to allocate the necessary amount plus
a little room to grow for each directory entry. A directory entry
cannot span directory data blocks so if there are 30 bytes free at
the end of a directory data block, but we need 40 bytes free a whole
new block is added to the directory data and all of the new entry
falls in the new block. As the directory gets used over time it
becomes fragmented so it is possible for there to be 40 bytes free
in one block but not contiguous. To recover from this AIX does
compression of directory entries within each block when necessary
and possible. However as mentioned previously the total size of a
directory does not shrink. AIX does not compress multiple directory
data blocks into fewer data blocks. To determine exactly how many
entries you could fit in you directory pages you will have to keep
close track of the length of your filenames, the length returned by
LDIRSIZE() for each filename length, the order in which you create,
and delete entries in the directory and understand when the
compression algorithm is called an how thorough the compression
algorithm. I cannot give you this last bit of information because it
is confidential. 

Question:

I have a couple more questions:

- I have seen it documented that a disk page is 4K, but I have
  also seen quite a few references to 512 byte units.  Is a disk
  page 4K and how does the 512 bytes blocks fit into this picture?

- Does a disk read read in 1 4K page at a time or is it more (i.e.
  a sector, which would be multiple 4K pages)?

Response:

The lvm and disk device drivers make requests on 4k boundaries.  You can
indeed read more than 4k at one time.  If you request less than 4096 
bytes on a read, the disk device driver will still retrieve 4k, but will 
only hand the application the number of bytes requested.

The controller on the disk does things in increments of 512 bytes.  The 
platter is formatted in 512 byte blocks.  When the disk device driver 
makes a request for some number of 4k blocks, the disk controller does 
reads in blocks of 512 bytes, and returns them in increments of 4k.

Question:

The last information you provided implies that there are 8 disk reads to 
fulfill a request for one 4K block.  Is this correct?

Also, if I request a 512-byte block of information, the LVM/DD requests a 
4K block and returns 512 bytes to my application.  Now... what happens if 
I later request the next 512 bytes?  Will there be another disk read or 
will this data already be availabe because of the previous 4K request 
made from the LVM/DD (assuming that the next 512-byte block of data I 
have now requested is located within the 4K chunk that was read 
previously)?

Response:

The answer to your first question is no.  One read request will be issued
for 8 disk blocks.

The answer to your second question is yes, the data is cached in a kernel 
buffer.  If the data is available in cache, it will be moved from the 
buffer to user memory, instead of issuing another read.


Support Line: Questions about file system internals. ITEM: G5567L
Dated: March 1994 Category: N/A
This HTML file was generated 99/06/24~13:30:50
Comments or suggestions? Contact us