LCOV - code coverage report
Current view: top level - src/backend/utils/sort - logtape.c (source / functions) Coverage Total Hit
Test: Code coverage Lines: 93.0 % 401 373
Test Date: 2026-01-26 10:56:24 Functions: 100.0 % 26 26
Legend: Lines:     hit not hit
Branches: + taken - not taken # not executed
Branches: 65.1 % 238 155

             Branch data     Line data    Source code
       1                 :             : /*-------------------------------------------------------------------------
       2                 :             :  *
       3                 :             :  * logtape.c
       4                 :             :  *        Management of "logical tapes" within temporary files.
       5                 :             :  *
       6                 :             :  * This module exists to support sorting via multiple merge passes (see
       7                 :             :  * tuplesort.c).  Merging is an ideal algorithm for tape devices, but if
       8                 :             :  * we implement it on disk by creating a separate file for each "tape",
       9                 :             :  * there is an annoying problem: the peak space usage is at least twice
      10                 :             :  * the volume of actual data to be sorted.  (This must be so because each
      11                 :             :  * datum will appear in both the input and output tapes of the final
      12                 :             :  * merge pass.)
      13                 :             :  *
      14                 :             :  * We can work around this problem by recognizing that any one tape
      15                 :             :  * dataset (with the possible exception of the final output) is written
      16                 :             :  * and read exactly once in a perfectly sequential manner.  Therefore,
      17                 :             :  * a datum once read will not be required again, and we can recycle its
      18                 :             :  * space for use by the new tape dataset(s) being generated.  In this way,
      19                 :             :  * the total space usage is essentially just the actual data volume, plus
      20                 :             :  * insignificant bookkeeping and start/stop overhead.
      21                 :             :  *
      22                 :             :  * Few OSes allow arbitrary parts of a file to be released back to the OS,
      23                 :             :  * so we have to implement this space-recycling ourselves within a single
      24                 :             :  * logical file.  logtape.c exists to perform this bookkeeping and provide
      25                 :             :  * the illusion of N independent tape devices to tuplesort.c.  Note that
      26                 :             :  * logtape.c itself depends on buffile.c to provide a "logical file" of
      27                 :             :  * larger size than the underlying OS may support.
      28                 :             :  *
      29                 :             :  * For simplicity, we allocate and release space in the underlying file
      30                 :             :  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
      31                 :             :  * of which blocks in the underlying file belong to which logical tape,
      32                 :             :  * plus any blocks that are free (recycled and not yet reused).
      33                 :             :  * The blocks in each logical tape form a chain, with a prev- and next-
      34                 :             :  * pointer in each block.
      35                 :             :  *
      36                 :             :  * The initial write pass is guaranteed to fill the underlying file
      37                 :             :  * perfectly sequentially, no matter how data is divided into logical tapes.
      38                 :             :  * Once we begin merge passes, the access pattern becomes considerably
      39                 :             :  * less predictable --- but the seeking involved should be comparable to
      40                 :             :  * what would happen if we kept each logical tape in a separate file,
      41                 :             :  * so there's no serious performance penalty paid to obtain the space
      42                 :             :  * savings of recycling.  We try to localize the write accesses by always
      43                 :             :  * writing to the lowest-numbered free block when we have a choice; it's
      44                 :             :  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
      45                 :             :  * policy for free blocks would be better?)
      46                 :             :  *
      47                 :             :  * To further make the I/Os more sequential, we can use a larger buffer
      48                 :             :  * when reading, and read multiple blocks from the same tape in one go,
      49                 :             :  * whenever the buffer becomes empty.
      50                 :             :  *
      51                 :             :  * To support the above policy of writing to the lowest free block, the
      52                 :             :  * freelist is a min heap.
      53                 :             :  *
      54                 :             :  * Since all the bookkeeping and buffer memory is allocated with palloc(),
      55                 :             :  * and the underlying file(s) are made with OpenTemporaryFile, all resources
      56                 :             :  * for a logical tape set are certain to be cleaned up even if processing
      57                 :             :  * is aborted by ereport(ERROR).  To avoid confusion, the caller should take
      58                 :             :  * care that all calls for a single LogicalTapeSet are made in the same
      59                 :             :  * palloc context.
      60                 :             :  *
      61                 :             :  * To support parallel sort operations involving coordinated callers to
      62                 :             :  * tuplesort.c routines across multiple workers, it is necessary to
      63                 :             :  * concatenate each worker BufFile/tapeset into one single logical tapeset
      64                 :             :  * managed by the leader.  Workers should have produced one final
      65                 :             :  * materialized tape (their entire output) when this happens in leader.
      66                 :             :  * There will always be the same number of runs as input tapes, and the same
      67                 :             :  * number of input tapes as participants (worker Tuplesortstates).
      68                 :             :  *
      69                 :             :  * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
      70                 :             :  * Portions Copyright (c) 1994, Regents of the University of California
      71                 :             :  *
      72                 :             :  * IDENTIFICATION
      73                 :             :  *        src/backend/utils/sort/logtape.c
      74                 :             :  *
      75                 :             :  *-------------------------------------------------------------------------
      76                 :             :  */
      77                 :             : 
      78                 :             : #include "postgres.h"
      79                 :             : 
      80                 :             : #include <fcntl.h>
      81                 :             : 
      82                 :             : #include "storage/buffile.h"
      83                 :             : #include "utils/builtins.h"
      84                 :             : #include "utils/logtape.h"
      85                 :             : #include "utils/memdebug.h"
      86                 :             : #include "utils/memutils.h"
      87                 :             : 
      88                 :             : /*
      89                 :             :  * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
      90                 :             :  *
      91                 :             :  * The first block of a tape has prev == -1.  The last block of a tape
      92                 :             :  * stores the number of valid bytes on the block, inverted, in 'next'
      93                 :             :  * Therefore next < 0 indicates the last block.
      94                 :             :  */
      95                 :             : typedef struct TapeBlockTrailer
      96                 :             : {
      97                 :             :         int64           prev;                   /* previous block on this tape, or -1 on first
      98                 :             :                                                                  * block */
      99                 :             :         int64           next;                   /* next block on this tape, or # of valid
     100                 :             :                                                                  * bytes on last block (if < 0) */
     101                 :             : } TapeBlockTrailer;
     102                 :             : 
     103                 :             : #define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
     104                 :             : #define TapeBlockGetTrailer(buf) \
     105                 :             :         ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
     106                 :             : 
     107                 :             : #define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
     108                 :             : #define TapeBlockGetNBytes(buf) \
     109                 :             :         (TapeBlockIsLast(buf) ? \
     110                 :             :          (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
     111                 :             : #define TapeBlockSetNBytes(buf, nbytes) \
     112                 :             :         (TapeBlockGetTrailer(buf)->next = -(nbytes))
     113                 :             : 
     114                 :             : /*
     115                 :             :  * When multiple tapes are being written to concurrently (as in HashAgg),
     116                 :             :  * avoid excessive fragmentation by preallocating block numbers to individual
     117                 :             :  * tapes. Each preallocation doubles in size starting at
     118                 :             :  * TAPE_WRITE_PREALLOC_MIN blocks up to TAPE_WRITE_PREALLOC_MAX blocks.
     119                 :             :  *
     120                 :             :  * No filesystem operations are performed for preallocation; only the block
     121                 :             :  * numbers are reserved. This may lead to sparse writes, which will cause
     122                 :             :  * ltsWriteBlock() to fill in holes with zeros.
     123                 :             :  */
     124                 :             : #define TAPE_WRITE_PREALLOC_MIN 8
     125                 :             : #define TAPE_WRITE_PREALLOC_MAX 128
     126                 :             : 
     127                 :             : /*
     128                 :             :  * This data structure represents a single "logical tape" within the set
     129                 :             :  * of logical tapes stored in the same file.
     130                 :             :  *
     131                 :             :  * While writing, we hold the current partially-written data block in the
     132                 :             :  * buffer.  While reading, we can hold multiple blocks in the buffer.  Note
     133                 :             :  * that we don't retain the trailers of a block when it's read into the
     134                 :             :  * buffer.  The buffer therefore contains one large contiguous chunk of data
     135                 :             :  * from the tape.
     136                 :             :  */
     137                 :             : struct LogicalTape
     138                 :             : {
     139                 :             :         LogicalTapeSet *tapeSet;        /* tape set this tape is part of */
     140                 :             : 
     141                 :             :         bool            writing;                /* T while in write phase */
     142                 :             :         bool            frozen;                 /* T if blocks should not be freed when read */
     143                 :             :         bool            dirty;                  /* does buffer need to be written? */
     144                 :             : 
     145                 :             :         /*
     146                 :             :          * Block numbers of the first, current, and next block of the tape.
     147                 :             :          *
     148                 :             :          * The "current" block number is only valid when writing, or reading from
     149                 :             :          * a frozen tape.  (When reading from an unfrozen tape, we use a larger
     150                 :             :          * read buffer that holds multiple blocks, so the "current" block is
     151                 :             :          * ambiguous.)
     152                 :             :          *
     153                 :             :          * When concatenation of worker tape BufFiles is performed, an offset to
     154                 :             :          * the first block in the unified BufFile space is applied during reads.
     155                 :             :          */
     156                 :             :         int64           firstBlockNumber;
     157                 :             :         int64           curBlockNumber;
     158                 :             :         int64           nextBlockNumber;
     159                 :             :         int64           offsetBlockNumber;
     160                 :             : 
     161                 :             :         /*
     162                 :             :          * Buffer for current data block(s).
     163                 :             :          */
     164                 :             :         char       *buffer;                     /* physical buffer (separately palloc'd) */
     165                 :             :         int                     buffer_size;    /* allocated size of the buffer */
     166                 :             :         int                     max_size;               /* highest useful, safe buffer_size */
     167                 :             :         int                     pos;                    /* next read/write position in buffer */
     168                 :             :         int                     nbytes;                 /* total # of valid bytes in buffer */
     169                 :             : 
     170                 :             :         /*
     171                 :             :          * Preallocated block numbers are held in an array sorted in descending
     172                 :             :          * order; blocks are consumed from the end of the array (lowest block
     173                 :             :          * numbers first).
     174                 :             :          */
     175                 :             :         int64      *prealloc;
     176                 :             :         int                     nprealloc;              /* number of elements in list */
     177                 :             :         int                     prealloc_size;  /* number of elements list can hold */
     178                 :             : };
     179                 :             : 
     180                 :             : /*
     181                 :             :  * This data structure represents a set of related "logical tapes" sharing
     182                 :             :  * space in a single underlying file.  (But that "file" may be multiple files
     183                 :             :  * if needed to escape OS limits on file size; buffile.c handles that for us.)
     184                 :             :  * Tapes belonging to a tape set can be created and destroyed on-the-fly, on
     185                 :             :  * demand.
     186                 :             :  */
     187                 :             : struct LogicalTapeSet
     188                 :             : {
     189                 :             :         BufFile    *pfile;                      /* underlying file for whole tape set */
     190                 :             :         SharedFileSet *fileset;
     191                 :             :         int                     worker;                 /* worker # if shared, -1 for leader/serial */
     192                 :             : 
     193                 :             :         /*
     194                 :             :          * File size tracking.  nBlocksWritten is the size of the underlying file,
     195                 :             :          * in BLCKSZ blocks.  nBlocksAllocated is the number of blocks allocated
     196                 :             :          * by ltsReleaseBlock(), and it is always greater than or equal to
     197                 :             :          * nBlocksWritten.  Blocks between nBlocksAllocated and nBlocksWritten are
     198                 :             :          * blocks that have been allocated for a tape, but have not been written
     199                 :             :          * to the underlying file yet.  nHoleBlocks tracks the total number of
     200                 :             :          * blocks that are in unused holes between worker spaces following BufFile
     201                 :             :          * concatenation.
     202                 :             :          */
     203                 :             :         int64           nBlocksAllocated;       /* # of blocks allocated */
     204                 :             :         int64           nBlocksWritten; /* # of blocks used in underlying file */
     205                 :             :         int64           nHoleBlocks;    /* # of "hole" blocks left */
     206                 :             : 
     207                 :             :         /*
     208                 :             :          * We store the numbers of recycled-and-available blocks in freeBlocks[].
     209                 :             :          * When there are no such blocks, we extend the underlying file.
     210                 :             :          *
     211                 :             :          * If forgetFreeSpace is true then any freed blocks are simply forgotten
     212                 :             :          * rather than being remembered in freeBlocks[].  See notes for
     213                 :             :          * LogicalTapeSetForgetFreeSpace().
     214                 :             :          */
     215                 :             :         bool            forgetFreeSpace;        /* are we remembering free blocks? */
     216                 :             :         int64      *freeBlocks;         /* resizable array holding minheap */
     217                 :             :         int64           nFreeBlocks;    /* # of currently free blocks */
     218                 :             :         Size            freeBlocksLen;  /* current allocated length of freeBlocks[] */
     219                 :             :         bool            enable_prealloc;        /* preallocate write blocks? */
     220                 :             : };
     221                 :             : 
     222                 :             : static LogicalTape *ltsCreateTape(LogicalTapeSet *lts);
     223                 :             : static void ltsWriteBlock(LogicalTapeSet *lts, int64 blocknum, const void *buffer);
     224                 :             : static void ltsReadBlock(LogicalTapeSet *lts, int64 blocknum, void *buffer);
     225                 :             : static int64 ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt);
     226                 :             : static int64 ltsGetFreeBlock(LogicalTapeSet *lts);
     227                 :             : static int64 ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt);
     228                 :             : static void ltsReleaseBlock(LogicalTapeSet *lts, int64 blocknum);
     229                 :             : static void ltsInitReadBuffer(LogicalTape *lt);
     230                 :             : 
     231                 :             : 
     232                 :             : /*
     233                 :             :  * Write a block-sized buffer to the specified block of the underlying file.
     234                 :             :  *
     235                 :             :  * No need for an error return convention; we ereport() on any error.
     236                 :             :  */
     237                 :             : static void
     238                 :        7585 : ltsWriteBlock(LogicalTapeSet *lts, int64 blocknum, const void *buffer)
     239                 :             : {
     240                 :             :         /*
     241                 :             :          * BufFile does not support "holes", so if we're about to write a block
     242                 :             :          * that's past the current end of file, fill the space between the current
     243                 :             :          * end of file and the target block with zeros.
     244                 :             :          *
     245                 :             :          * This can happen either when tapes preallocate blocks; or for the last
     246                 :             :          * block of a tape which might not have been flushed.
     247                 :             :          *
     248                 :             :          * Note that BufFile concatenation can leave "holes" in BufFile between
     249                 :             :          * worker-owned block ranges.  These are tracked for reporting purposes
     250                 :             :          * only.  We never read from nor write to these hole blocks, and so they
     251                 :             :          * are not considered here.
     252                 :             :          */
     253         [ +  + ]:        8250 :         while (blocknum > lts->nBlocksWritten)
     254                 :             :         {
     255                 :         665 :                 PGIOAlignedBlock zerobuf;
     256                 :             : 
     257   [ +  -  +  -  :         665 :                 MemSet(zerobuf.data, 0, sizeof(zerobuf));
          +  -  +  -  #  
                      # ]
     258                 :             : 
     259                 :         665 :                 ltsWriteBlock(lts, lts->nBlocksWritten, zerobuf.data);
     260                 :         665 :         }
     261                 :             : 
     262                 :             :         /* Write the requested block */
     263         [ +  - ]:        7585 :         if (BufFileSeekBlock(lts->pfile, blocknum) != 0)
     264   [ #  #  #  # ]:           0 :                 ereport(ERROR,
     265                 :             :                                 (errcode_for_file_access(),
     266                 :             :                                  errmsg("could not seek to block %" PRId64 " of temporary file",
     267                 :             :                                                 blocknum)));
     268                 :        7585 :         BufFileWrite(lts->pfile, buffer, BLCKSZ);
     269                 :             : 
     270                 :             :         /* Update nBlocksWritten, if we extended the file */
     271         [ +  + ]:        7585 :         if (blocknum == lts->nBlocksWritten)
     272                 :        2249 :                 lts->nBlocksWritten++;
     273                 :        7585 : }
     274                 :             : 
     275                 :             : /*
     276                 :             :  * Read a block-sized buffer from the specified block of the underlying file.
     277                 :             :  *
     278                 :             :  * No need for an error return convention; we ereport() on any error.   This
     279                 :             :  * module should never attempt to read a block it doesn't know is there.
     280                 :             :  */
     281                 :             : static void
     282                 :        6875 : ltsReadBlock(LogicalTapeSet *lts, int64 blocknum, void *buffer)
     283                 :             : {
     284         [ +  - ]:        6875 :         if (BufFileSeekBlock(lts->pfile, blocknum) != 0)
     285   [ #  #  #  # ]:           0 :                 ereport(ERROR,
     286                 :             :                                 (errcode_for_file_access(),
     287                 :             :                                  errmsg("could not seek to block %" PRId64 " of temporary file",
     288                 :             :                                                 blocknum)));
     289                 :        6875 :         BufFileReadExact(lts->pfile, buffer, BLCKSZ);
     290                 :        6875 : }
     291                 :             : 
     292                 :             : /*
     293                 :             :  * Read as many blocks as we can into the per-tape buffer.
     294                 :             :  *
     295                 :             :  * Returns true if anything was read, 'false' on EOF.
     296                 :             :  */
     297                 :             : static bool
     298                 :       10111 : ltsReadFillBuffer(LogicalTape *lt)
     299                 :             : {
     300                 :       10111 :         lt->pos = 0;
     301                 :       10111 :         lt->nbytes = 0;
     302                 :             : 
     303                 :       10111 :         do
     304                 :             :         {
     305                 :       11262 :                 char       *thisbuf = lt->buffer + lt->nbytes;
     306                 :       11262 :                 int64           datablocknum = lt->nextBlockNumber;
     307                 :             : 
     308                 :             :                 /* Fetch next block number */
     309         [ +  + ]:       11262 :                 if (datablocknum == -1L)
     310                 :        4485 :                         break;                          /* EOF */
     311                 :             :                 /* Apply worker offset, needed for leader tapesets */
     312                 :        6777 :                 datablocknum += lt->offsetBlockNumber;
     313                 :             : 
     314                 :             :                 /* Read the block */
     315                 :        6777 :                 ltsReadBlock(lt->tapeSet, datablocknum, thisbuf);
     316         [ +  + ]:        6777 :                 if (!lt->frozen)
     317                 :        6660 :                         ltsReleaseBlock(lt->tapeSet, datablocknum);
     318                 :        6777 :                 lt->curBlockNumber = lt->nextBlockNumber;
     319                 :             : 
     320         [ +  + ]:        6777 :                 lt->nbytes += TapeBlockGetNBytes(thisbuf);
     321         [ +  + ]:        6777 :                 if (TapeBlockIsLast(thisbuf))
     322                 :             :                 {
     323                 :        4637 :                         lt->nextBlockNumber = -1L;
     324                 :             :                         /* EOF */
     325                 :        4637 :                         break;
     326                 :             :                 }
     327                 :             :                 else
     328                 :        2140 :                         lt->nextBlockNumber = TapeBlockGetTrailer(thisbuf)->next;
     329                 :             : 
     330                 :             :                 /* Advance to next block, if we have buffer space left */
     331   [ -  +  +  +  :       11262 :         } while (lt->buffer_size - lt->nbytes > BLCKSZ);
                      + ]
     332                 :             : 
     333                 :       10111 :         return (lt->nbytes > 0);
     334                 :             : }
     335                 :             : 
     336                 :             : static inline uint64
     337                 :      253114 : left_offset(uint64 i)
     338                 :             : {
     339                 :      253114 :         return 2 * i + 1;
     340                 :             : }
     341                 :             : 
     342                 :             : static inline uint64
     343                 :      253114 : right_offset(uint64 i)
     344                 :             : {
     345                 :      253114 :         return 2 * i + 2;
     346                 :             : }
     347                 :             : 
     348                 :             : static inline uint64
     349                 :      158686 : parent_offset(uint64 i)
     350                 :             : {
     351                 :      158686 :         return (i - 1) / 2;
     352                 :             : }
     353                 :             : 
     354                 :             : /*
     355                 :             :  * Get the next block for writing.
     356                 :             :  */
     357                 :             : static int64
     358                 :        6920 : ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt)
     359                 :             : {
     360         [ +  + ]:        6920 :         if (lts->enable_prealloc)
     361                 :        4661 :                 return ltsGetPreallocBlock(lts, lt);
     362                 :             :         else
     363                 :        2259 :                 return ltsGetFreeBlock(lts);
     364                 :        6920 : }
     365                 :             : 
     366                 :             : /*
     367                 :             :  * Select the lowest currently unused block from the tape set's global free
     368                 :             :  * list min heap.
     369                 :             :  */
     370                 :             : static int64
     371                 :       38203 : ltsGetFreeBlock(LogicalTapeSet *lts)
     372                 :             : {
     373                 :       38203 :         int64      *heap = lts->freeBlocks;
     374                 :       38203 :         int64           blocknum;
     375                 :       38203 :         int64           heapsize;
     376                 :       38203 :         int64           holeval;
     377                 :       38203 :         uint64          holepos;
     378                 :             : 
     379                 :             :         /* freelist empty; allocate a new block */
     380         [ +  + ]:       38203 :         if (lts->nFreeBlocks == 0)
     381                 :        2312 :                 return lts->nBlocksAllocated++;
     382                 :             : 
     383                 :             :         /* easy if heap contains one element */
     384         [ +  + ]:       35891 :         if (lts->nFreeBlocks == 1)
     385                 :             :         {
     386                 :          63 :                 lts->nFreeBlocks--;
     387                 :          63 :                 return lts->freeBlocks[0];
     388                 :             :         }
     389                 :             : 
     390                 :             :         /* remove top of minheap */
     391                 :       35828 :         blocknum = heap[0];
     392                 :             : 
     393                 :             :         /* we'll replace it with end of minheap array */
     394                 :       35828 :         holeval = heap[--lts->nFreeBlocks];
     395                 :             : 
     396                 :             :         /* sift down */
     397                 :       35828 :         holepos = 0;                            /* holepos is where the "hole" is */
     398                 :       35828 :         heapsize = lts->nFreeBlocks;
     399                 :      253114 :         while (true)
     400                 :             :         {
     401                 :      253114 :                 uint64          left = left_offset(holepos);
     402                 :      253114 :                 uint64          right = right_offset(holepos);
     403                 :      253114 :                 uint64          min_child;
     404                 :             : 
     405   [ +  +  +  + ]:      253114 :                 if (left < heapsize && right < heapsize)
     406         [ +  + ]:      219229 :                         min_child = (heap[left] < heap[right]) ? left : right;
     407         [ +  + ]:       33885 :                 else if (left < heapsize)
     408                 :        7275 :                         min_child = left;
     409         [ -  + ]:       26610 :                 else if (right < heapsize)
     410                 :           0 :                         min_child = right;
     411                 :             :                 else
     412                 :       26610 :                         break;
     413                 :             : 
     414         [ +  + ]:      226504 :                 if (heap[min_child] >= holeval)
     415                 :        9218 :                         break;
     416                 :             : 
     417                 :      217286 :                 heap[holepos] = heap[min_child];
     418                 :      217286 :                 holepos = min_child;
     419      [ -  +  + ]:      253114 :         }
     420                 :       35828 :         heap[holepos] = holeval;
     421                 :             : 
     422                 :       35828 :         return blocknum;
     423                 :       38203 : }
     424                 :             : 
     425                 :             : /*
     426                 :             :  * Return the lowest free block number from the tape's preallocation list.
     427                 :             :  * Refill the preallocation list with blocks from the tape set's free list if
     428                 :             :  * necessary.
     429                 :             :  */
     430                 :             : static int64
     431                 :        4661 : ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt)
     432                 :             : {
     433                 :             :         /* sorted in descending order, so return the last element */
     434         [ +  + ]:        4661 :         if (lt->nprealloc > 0)
     435                 :         172 :                 return lt->prealloc[--lt->nprealloc];
     436                 :             : 
     437         [ +  + ]:        4489 :         if (lt->prealloc == NULL)
     438                 :             :         {
     439                 :        4485 :                 lt->prealloc_size = TAPE_WRITE_PREALLOC_MIN;
     440                 :        4485 :                 lt->prealloc = palloc_array(int64, lt->prealloc_size);
     441                 :        4485 :         }
     442         [ -  + ]:           4 :         else if (lt->prealloc_size < TAPE_WRITE_PREALLOC_MAX)
     443                 :             :         {
     444                 :             :                 /* when the preallocation list runs out, double the size */
     445                 :           4 :                 lt->prealloc_size *= 2;
     446         [ +  - ]:           4 :                 if (lt->prealloc_size > TAPE_WRITE_PREALLOC_MAX)
     447                 :           0 :                         lt->prealloc_size = TAPE_WRITE_PREALLOC_MAX;
     448                 :           8 :                 lt->prealloc = (int64 *) repalloc(lt->prealloc,
     449                 :           4 :                                                                                   sizeof(int64) * lt->prealloc_size);
     450                 :           4 :         }
     451                 :             : 
     452                 :             :         /* refill preallocation list */
     453                 :        4489 :         lt->nprealloc = lt->prealloc_size;
     454         [ +  + ]:       40433 :         for (int i = lt->nprealloc; i > 0; i--)
     455                 :             :         {
     456                 :       35944 :                 lt->prealloc[i - 1] = ltsGetFreeBlock(lts);
     457                 :             : 
     458                 :             :                 /* verify descending order */
     459   [ +  +  -  + ]:       35944 :                 Assert(i == lt->nprealloc || lt->prealloc[i - 1] > lt->prealloc[i]);
     460                 :       35944 :         }
     461                 :             : 
     462                 :        4489 :         return lt->prealloc[--lt->nprealloc];
     463                 :        4661 : }
     464                 :             : 
     465                 :             : /*
     466                 :             :  * Return a block# to the freelist.
     467                 :             :  */
     468                 :             : static void
     469                 :       37943 : ltsReleaseBlock(LogicalTapeSet *lts, int64 blocknum)
     470                 :             : {
     471                 :       37943 :         int64      *heap;
     472                 :       37943 :         uint64          holepos;
     473                 :             : 
     474                 :             :         /*
     475                 :             :          * Do nothing if we're no longer interested in remembering free space.
     476                 :             :          */
     477         [ +  + ]:       37943 :         if (lts->forgetFreeSpace)
     478                 :        1393 :                 return;
     479                 :             : 
     480                 :             :         /*
     481                 :             :          * Enlarge freeBlocks array if full.
     482                 :             :          */
     483         [ +  + ]:       36550 :         if (lts->nFreeBlocks >= lts->freeBlocksLen)
     484                 :             :         {
     485                 :             :                 /*
     486                 :             :                  * If the freelist becomes very large, just return and leak this free
     487                 :             :                  * block.
     488                 :             :                  */
     489         [ -  + ]:          10 :                 if (lts->freeBlocksLen * 2 * sizeof(int64) > MaxAllocSize)
     490                 :           0 :                         return;
     491                 :             : 
     492                 :          10 :                 lts->freeBlocksLen *= 2;
     493                 :          20 :                 lts->freeBlocks = (int64 *) repalloc(lts->freeBlocks,
     494                 :          10 :                                                                                          lts->freeBlocksLen * sizeof(int64));
     495                 :          10 :         }
     496                 :             : 
     497                 :             :         /* create a "hole" at end of minheap array */
     498                 :       36550 :         heap = lts->freeBlocks;
     499                 :       36550 :         holepos = lts->nFreeBlocks;
     500                 :       36550 :         lts->nFreeBlocks++;
     501                 :             : 
     502                 :             :         /* sift up to insert blocknum */
     503         [ +  + ]:      164252 :         while (holepos != 0)
     504                 :             :         {
     505                 :      158686 :                 uint64          parent = parent_offset(holepos);
     506                 :             : 
     507         [ +  + ]:      158686 :                 if (heap[parent] < blocknum)
     508                 :       30984 :                         break;
     509                 :             : 
     510                 :      127702 :                 heap[holepos] = heap[parent];
     511                 :      127702 :                 holepos = parent;
     512         [ +  + ]:      158686 :         }
     513                 :       36550 :         heap[holepos] = blocknum;
     514                 :       37943 : }
     515                 :             : 
     516                 :             : /*
     517                 :             :  * Lazily allocate and initialize the read buffer. This avoids waste when many
     518                 :             :  * tapes are open at once, but not all are active between rewinding and
     519                 :             :  * reading.
     520                 :             :  */
     521                 :             : static void
     522                 :        4641 : ltsInitReadBuffer(LogicalTape *lt)
     523                 :             : {
     524         [ +  - ]:        4641 :         Assert(lt->buffer_size > 0);
     525                 :        4641 :         lt->buffer = palloc(lt->buffer_size);
     526                 :             : 
     527                 :             :         /* Read the first block, or reset if tape is empty */
     528                 :        4641 :         lt->nextBlockNumber = lt->firstBlockNumber;
     529                 :        4641 :         lt->pos = 0;
     530                 :        4641 :         lt->nbytes = 0;
     531                 :        4641 :         ltsReadFillBuffer(lt);
     532                 :        4641 : }
     533                 :             : 
     534                 :             : /*
     535                 :             :  * Create a tape set, backed by a temporary underlying file.
     536                 :             :  *
     537                 :             :  * The tape set is initially empty. Use LogicalTapeCreate() to create
     538                 :             :  * tapes in it.
     539                 :             :  *
     540                 :             :  * In a single-process sort, pass NULL argument for fileset, and -1 for
     541                 :             :  * worker.
     542                 :             :  *
     543                 :             :  * In a parallel sort, parallel workers pass the shared fileset handle and
     544                 :             :  * their own worker number.  After the workers have finished, create the
     545                 :             :  * tape set in the leader, passing the shared fileset handle and -1 for
     546                 :             :  * worker, and use LogicalTapeImport() to import the worker tapes into it.
     547                 :             :  *
     548                 :             :  * Currently, the leader will only import worker tapes into the set, it does
     549                 :             :  * not create tapes of its own, although in principle that should work.
     550                 :             :  *
     551                 :             :  * If preallocate is true, blocks for each individual tape are allocated in
     552                 :             :  * batches.  This avoids fragmentation when writing multiple tapes at the
     553                 :             :  * same time.
     554                 :             :  */
     555                 :             : LogicalTapeSet *
     556                 :         139 : LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
     557                 :             : {
     558                 :         139 :         LogicalTapeSet *lts;
     559                 :             : 
     560                 :             :         /*
     561                 :             :          * Create top-level struct including per-tape LogicalTape structs.
     562                 :             :          */
     563                 :         139 :         lts = palloc_object(LogicalTapeSet);
     564                 :         139 :         lts->nBlocksAllocated = 0L;
     565                 :         139 :         lts->nBlocksWritten = 0L;
     566                 :         139 :         lts->nHoleBlocks = 0L;
     567                 :         139 :         lts->forgetFreeSpace = false;
     568                 :         139 :         lts->freeBlocksLen = 32;     /* reasonable initial guess */
     569                 :         139 :         lts->freeBlocks = (int64 *) palloc(lts->freeBlocksLen * sizeof(int64));
     570                 :         139 :         lts->nFreeBlocks = 0;
     571                 :         139 :         lts->enable_prealloc = preallocate;
     572                 :             : 
     573                 :         139 :         lts->fileset = fileset;
     574                 :         139 :         lts->worker = worker;
     575                 :             : 
     576                 :             :         /*
     577                 :             :          * Create temp BufFile storage as required.
     578                 :             :          *
     579                 :             :          * In leader, we hijack the BufFile of the first tape that's imported, and
     580                 :             :          * concatenate the BufFiles of any subsequent tapes to that. Hence don't
     581                 :             :          * create a BufFile here. Things are simpler for the worker case and the
     582                 :             :          * serial case, though.  They are generally very similar -- workers use a
     583                 :             :          * shared fileset, whereas serial sorts use a conventional serial BufFile.
     584                 :             :          */
     585   [ +  +  +  + ]:         139 :         if (fileset && worker == -1)
     586                 :          28 :                 lts->pfile = NULL;
     587         [ +  + ]:         111 :         else if (fileset)
     588                 :             :         {
     589                 :          82 :                 char            filename[MAXPGPATH];
     590                 :             : 
     591                 :          82 :                 pg_itoa(worker, filename);
     592                 :          82 :                 lts->pfile = BufFileCreateFileSet(&fileset->fs, filename);
     593                 :          82 :         }
     594                 :             :         else
     595                 :          29 :                 lts->pfile = BufFileCreateTemp(false);
     596                 :             : 
     597                 :         278 :         return lts;
     598                 :         139 : }
     599                 :             : 
     600                 :             : /*
     601                 :             :  * Claim ownership of a logical tape from an existing shared BufFile.
     602                 :             :  *
     603                 :             :  * Caller should be leader process.  Though tapes are marked as frozen in
     604                 :             :  * workers, they are not frozen when opened within leader, since unfrozen tapes
     605                 :             :  * use a larger read buffer. (Frozen tapes have smaller read buffer, optimized
     606                 :             :  * for random access.)
     607                 :             :  */
     608                 :             : LogicalTape *
     609                 :          56 : LogicalTapeImport(LogicalTapeSet *lts, int worker, TapeShare *shared)
     610                 :             : {
     611                 :          56 :         LogicalTape *lt;
     612                 :          56 :         int64           tapeblocks;
     613                 :          56 :         char            filename[MAXPGPATH];
     614                 :          56 :         BufFile    *file;
     615                 :          56 :         int64           filesize;
     616                 :             : 
     617                 :          56 :         lt = ltsCreateTape(lts);
     618                 :             : 
     619                 :             :         /*
     620                 :             :          * build concatenated view of all buffiles, remembering the block number
     621                 :             :          * where each source file begins.
     622                 :             :          */
     623                 :          56 :         pg_itoa(worker, filename);
     624                 :          56 :         file = BufFileOpenFileSet(&lts->fileset->fs, filename, O_RDONLY, false);
     625                 :          56 :         filesize = BufFileSize(file);
     626                 :             : 
     627                 :             :         /*
     628                 :             :          * Stash first BufFile, and concatenate subsequent BufFiles to that. Store
     629                 :             :          * block offset into each tape as we go.
     630                 :             :          */
     631                 :          56 :         lt->firstBlockNumber = shared->firstblocknumber;
     632         [ +  + ]:          56 :         if (lts->pfile == NULL)
     633                 :             :         {
     634                 :          28 :                 lts->pfile = file;
     635                 :          28 :                 lt->offsetBlockNumber = 0L;
     636                 :          28 :         }
     637                 :             :         else
     638                 :             :         {
     639                 :          28 :                 lt->offsetBlockNumber = BufFileAppend(lts->pfile, file);
     640                 :             :         }
     641                 :             :         /* Don't allocate more for read buffer than could possibly help */
     642         [ -  + ]:          56 :         lt->max_size = Min(MaxAllocSize, filesize);
     643                 :          56 :         tapeblocks = filesize / BLCKSZ;
     644                 :             : 
     645                 :             :         /*
     646                 :             :          * Update # of allocated blocks and # blocks written to reflect the
     647                 :             :          * imported BufFile.  Allocated/written blocks include space used by holes
     648                 :             :          * left between concatenated BufFiles.  Also track the number of hole
     649                 :             :          * blocks so that we can later work backwards to calculate the number of
     650                 :             :          * physical blocks for instrumentation.
     651                 :             :          */
     652                 :          56 :         lts->nHoleBlocks += lt->offsetBlockNumber - lts->nBlocksAllocated;
     653                 :             : 
     654                 :          56 :         lts->nBlocksAllocated = lt->offsetBlockNumber + tapeblocks;
     655                 :          56 :         lts->nBlocksWritten = lts->nBlocksAllocated;
     656                 :             : 
     657                 :         112 :         return lt;
     658                 :          56 : }
     659                 :             : 
     660                 :             : /*
     661                 :             :  * Close a logical tape set and release all resources.
     662                 :             :  *
     663                 :             :  * NOTE: This doesn't close any of the tapes!  You must close them
     664                 :             :  * first, or you can let them be destroyed along with the memory context.
     665                 :             :  */
     666                 :             : void
     667                 :         139 : LogicalTapeSetClose(LogicalTapeSet *lts)
     668                 :             : {
     669                 :         139 :         BufFileClose(lts->pfile);
     670                 :         139 :         pfree(lts->freeBlocks);
     671                 :         139 :         pfree(lts);
     672                 :         139 : }
     673                 :             : 
     674                 :             : /*
     675                 :             :  * Create a logical tape in the given tapeset.
     676                 :             :  *
     677                 :             :  * The tape is initialized in write state.
     678                 :             :  */
     679                 :             : LogicalTape *
     680                 :        8588 : LogicalTapeCreate(LogicalTapeSet *lts)
     681                 :             : {
     682                 :             :         /*
     683                 :             :          * The only thing that currently prevents creating new tapes in leader is
     684                 :             :          * the fact that BufFiles opened using BufFileOpenFileSet() are read-only
     685                 :             :          * by definition, but that could be changed if it seemed worthwhile.  For
     686                 :             :          * now, writing to the leader tape will raise a "Bad file descriptor"
     687                 :             :          * error, so tuplesort must avoid writing to the leader tape altogether.
     688                 :             :          */
     689   [ +  +  +  - ]:        8588 :         if (lts->fileset && lts->worker == -1)
     690   [ #  #  #  # ]:           0 :                 elog(ERROR, "cannot create new tapes in leader process");
     691                 :             : 
     692                 :        8588 :         return ltsCreateTape(lts);
     693                 :             : }
     694                 :             : 
     695                 :             : static LogicalTape *
     696                 :        8644 : ltsCreateTape(LogicalTapeSet *lts)
     697                 :             : {
     698                 :        8644 :         LogicalTape *lt;
     699                 :             : 
     700                 :             :         /*
     701                 :             :          * Create per-tape struct.  Note we allocate the I/O buffer lazily.
     702                 :             :          */
     703                 :        8644 :         lt = palloc_object(LogicalTape);
     704                 :        8644 :         lt->tapeSet = lts;
     705                 :        8644 :         lt->writing = true;
     706                 :        8644 :         lt->frozen = false;
     707                 :        8644 :         lt->dirty = false;
     708                 :        8644 :         lt->firstBlockNumber = -1L;
     709                 :        8644 :         lt->curBlockNumber = -1L;
     710                 :        8644 :         lt->nextBlockNumber = -1L;
     711                 :        8644 :         lt->offsetBlockNumber = 0L;
     712                 :        8644 :         lt->buffer = NULL;
     713                 :        8644 :         lt->buffer_size = 0;
     714                 :             :         /* palloc() larger than MaxAllocSize would fail */
     715                 :        8644 :         lt->max_size = MaxAllocSize;
     716                 :        8644 :         lt->pos = 0;
     717                 :        8644 :         lt->nbytes = 0;
     718                 :        8644 :         lt->prealloc = NULL;
     719                 :        8644 :         lt->nprealloc = 0;
     720                 :        8644 :         lt->prealloc_size = 0;
     721                 :             : 
     722                 :       17288 :         return lt;
     723                 :        8644 : }
     724                 :             : 
     725                 :             : /*
     726                 :             :  * Close a logical tape.
     727                 :             :  *
     728                 :             :  * Note: This doesn't return any blocks to the free list!  You must read
     729                 :             :  * the tape to the end first, to reuse the space.  In current use, though,
     730                 :             :  * we only close tapes after fully reading them.
     731                 :             :  */
     732                 :             : void
     733                 :        4585 : LogicalTapeClose(LogicalTape *lt)
     734                 :             : {
     735         [ -  + ]:        4585 :         if (lt->buffer)
     736                 :        4585 :                 pfree(lt->buffer);
     737                 :        4585 :         pfree(lt);
     738                 :        4585 : }
     739                 :             : 
     740                 :             : /*
     741                 :             :  * Mark a logical tape set as not needing management of free space anymore.
     742                 :             :  *
     743                 :             :  * This should be called if the caller does not intend to write any more data
     744                 :             :  * into the tape set, but is reading from un-frozen tapes.  Since no more
     745                 :             :  * writes are planned, remembering free blocks is no longer useful.  Setting
     746                 :             :  * this flag lets us avoid wasting time and space in ltsReleaseBlock(), which
     747                 :             :  * is not designed to handle large numbers of free blocks.
     748                 :             :  */
     749                 :             : void
     750                 :          45 : LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
     751                 :             : {
     752                 :          45 :         lts->forgetFreeSpace = true;
     753                 :          45 : }
     754                 :             : 
     755                 :             : /*
     756                 :             :  * Write to a logical tape.
     757                 :             :  *
     758                 :             :  * There are no error returns; we ereport() on failure.
     759                 :             :  */
     760                 :             : void
     761                 :     1551181 : LogicalTapeWrite(LogicalTape *lt, const void *ptr, size_t size)
     762                 :             : {
     763                 :     1551181 :         LogicalTapeSet *lts = lt->tapeSet;
     764                 :     1551181 :         size_t          nthistime;
     765                 :             : 
     766         [ +  - ]:     1551181 :         Assert(lt->writing);
     767         [ +  - ]:     1551181 :         Assert(lt->offsetBlockNumber == 0L);
     768                 :             : 
     769                 :             :         /* Allocate data buffer and first block on first write */
     770         [ +  + ]:     1551181 :         if (lt->buffer == NULL)
     771                 :             :         {
     772                 :        4669 :                 lt->buffer = (char *) palloc(BLCKSZ);
     773                 :        4669 :                 lt->buffer_size = BLCKSZ;
     774                 :        4669 :         }
     775         [ +  + ]:     1551181 :         if (lt->curBlockNumber == -1)
     776                 :             :         {
     777         [ +  - ]:        4669 :                 Assert(lt->firstBlockNumber == -1);
     778         [ +  - ]:        4669 :                 Assert(lt->pos == 0);
     779                 :             : 
     780                 :        4669 :                 lt->curBlockNumber = ltsGetBlock(lts, lt);
     781                 :        4669 :                 lt->firstBlockNumber = lt->curBlockNumber;
     782                 :             : 
     783                 :        4669 :                 TapeBlockGetTrailer(lt->buffer)->prev = -1L;
     784                 :        4669 :         }
     785                 :             : 
     786         [ +  - ]:     1551181 :         Assert(lt->buffer_size == BLCKSZ);
     787         [ +  + ]:     3103931 :         while (size > 0)
     788                 :             :         {
     789         [ +  + ]:     1552750 :                 if (lt->pos >= (int) TapeBlockPayloadSize)
     790                 :             :                 {
     791                 :             :                         /* Buffer full, dump it out */
     792                 :        2251 :                         int64           nextBlockNumber;
     793                 :             : 
     794         [ +  - ]:        2251 :                         if (!lt->dirty)
     795                 :             :                         {
     796                 :             :                                 /* Hmm, went directly from reading to writing? */
     797   [ #  #  #  # ]:           0 :                                 elog(ERROR, "invalid logtape state: should be dirty");
     798                 :           0 :                         }
     799                 :             : 
     800                 :             :                         /*
     801                 :             :                          * First allocate the next block, so that we can store it in the
     802                 :             :                          * 'next' pointer of this block.
     803                 :             :                          */
     804                 :        2251 :                         nextBlockNumber = ltsGetBlock(lt->tapeSet, lt);
     805                 :             : 
     806                 :             :                         /* set the next-pointer and dump the current block. */
     807                 :        2251 :                         TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
     808                 :        2251 :                         ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, lt->buffer);
     809                 :             : 
     810                 :             :                         /* initialize the prev-pointer of the next block */
     811                 :        2251 :                         TapeBlockGetTrailer(lt->buffer)->prev = lt->curBlockNumber;
     812                 :        2251 :                         lt->curBlockNumber = nextBlockNumber;
     813                 :        2251 :                         lt->pos = 0;
     814                 :        2251 :                         lt->nbytes = 0;
     815                 :        2251 :                 }
     816                 :             : 
     817                 :     1552750 :                 nthistime = TapeBlockPayloadSize - lt->pos;
     818         [ +  + ]:     1552750 :                 if (nthistime > size)
     819                 :     1550499 :                         nthistime = size;
     820         [ -  + ]:     1552750 :                 Assert(nthistime > 0);
     821                 :             : 
     822                 :     1552750 :                 memcpy(lt->buffer + lt->pos, ptr, nthistime);
     823                 :             : 
     824                 :     1552750 :                 lt->dirty = true;
     825                 :     1552750 :                 lt->pos += nthistime;
     826         [ -  + ]:     1552750 :                 if (lt->nbytes < lt->pos)
     827                 :     1552750 :                         lt->nbytes = lt->pos;
     828                 :     1552750 :                 ptr = (const char *) ptr + nthistime;
     829                 :     1552750 :                 size -= nthistime;
     830                 :             :         }
     831                 :     1551181 : }
     832                 :             : 
     833                 :             : /*
     834                 :             :  * Rewind logical tape and switch from writing to reading.
     835                 :             :  *
     836                 :             :  * The tape must currently be in writing state, or "frozen" in read state.
     837                 :             :  *
     838                 :             :  * 'buffer_size' specifies how much memory to use for the read buffer.
     839                 :             :  * Regardless of the argument, the actual amount of memory used is between
     840                 :             :  * BLCKSZ and MaxAllocSize, and is a multiple of BLCKSZ.  The given value is
     841                 :             :  * rounded down and truncated to fit those constraints, if necessary.  If the
     842                 :             :  * tape is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ
     843                 :             :  * byte buffer is used.
     844                 :             :  */
     845                 :             : void
     846                 :        4641 : LogicalTapeRewindForRead(LogicalTape *lt, size_t buffer_size)
     847                 :             : {
     848                 :        4641 :         LogicalTapeSet *lts = lt->tapeSet;
     849                 :             : 
     850                 :             :         /*
     851                 :             :          * Round and cap buffer_size if needed.
     852                 :             :          */
     853         [ +  + ]:        4641 :         if (lt->frozen)
     854                 :           1 :                 buffer_size = BLCKSZ;
     855                 :             :         else
     856                 :             :         {
     857                 :             :                 /* need at least one block */
     858         [ +  + ]:        4640 :                 if (buffer_size < BLCKSZ)
     859                 :          30 :                         buffer_size = BLCKSZ;
     860                 :             : 
     861                 :             :                 /* palloc() larger than max_size is unlikely to be helpful */
     862         [ +  + ]:        4640 :                 if (buffer_size > lt->max_size)
     863                 :          56 :                         buffer_size = lt->max_size;
     864                 :             : 
     865                 :             :                 /* round down to BLCKSZ boundary */
     866                 :        4640 :                 buffer_size -= buffer_size % BLCKSZ;
     867                 :             :         }
     868                 :             : 
     869         [ +  + ]:        4641 :         if (lt->writing)
     870                 :             :         {
     871                 :             :                 /*
     872                 :             :                  * Completion of a write phase.  Flush last partial data block, and
     873                 :             :                  * rewind for normal (destructive) read.
     874                 :             :                  */
     875         [ +  + ]:        4640 :                 if (lt->dirty)
     876                 :             :                 {
     877                 :             :                         /*
     878                 :             :                          * As long as we've filled the buffer at least once, its contents
     879                 :             :                          * are entirely defined from valgrind's point of view, even though
     880                 :             :                          * contents beyond the current end point may be stale.  But it's
     881                 :             :                          * possible - at least in the case of a parallel sort - to sort
     882                 :             :                          * such small amount of data that we do not fill the buffer even
     883                 :             :                          * once.  Tell valgrind that its contents are defined, so it
     884                 :             :                          * doesn't bleat.
     885                 :             :                          */
     886                 :        4584 :                         VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
     887                 :             :                                                                           lt->buffer_size - lt->nbytes);
     888                 :             : 
     889                 :        4584 :                         TapeBlockSetNBytes(lt->buffer, lt->nbytes);
     890                 :        4584 :                         ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, lt->buffer);
     891                 :        4584 :                 }
     892                 :        4640 :                 lt->writing = false;
     893                 :        4640 :         }
     894                 :             :         else
     895                 :             :         {
     896                 :             :                 /*
     897                 :             :                  * This is only OK if tape is frozen; we rewind for (another) read
     898                 :             :                  * pass.
     899                 :             :                  */
     900         [ +  - ]:           1 :                 Assert(lt->frozen);
     901                 :             :         }
     902                 :             : 
     903         [ +  + ]:        4641 :         if (lt->buffer)
     904                 :        4585 :                 pfree(lt->buffer);
     905                 :             : 
     906                 :             :         /* the buffer is lazily allocated, but set the size here */
     907                 :        4641 :         lt->buffer = NULL;
     908                 :        4641 :         lt->buffer_size = buffer_size;
     909                 :             : 
     910                 :             :         /* free the preallocation list, and return unused block numbers */
     911         [ +  + ]:        4641 :         if (lt->prealloc != NULL)
     912                 :             :         {
     913         [ +  + ]:       35768 :                 for (int i = lt->nprealloc; i > 0; i--)
     914                 :       31283 :                         ltsReleaseBlock(lts, lt->prealloc[i - 1]);
     915                 :        4485 :                 pfree(lt->prealloc);
     916                 :        4485 :                 lt->prealloc = NULL;
     917                 :        4485 :                 lt->nprealloc = 0;
     918                 :        4485 :                 lt->prealloc_size = 0;
     919                 :        4485 :         }
     920                 :        4641 : }
     921                 :             : 
     922                 :             : /*
     923                 :             :  * Read from a logical tape.
     924                 :             :  *
     925                 :             :  * Early EOF is indicated by return value less than #bytes requested.
     926                 :             :  */
     927                 :             : size_t
     928                 :     1592277 : LogicalTapeRead(LogicalTape *lt, void *ptr, size_t size)
     929                 :             : {
     930                 :     1592277 :         size_t          nread = 0;
     931                 :     1592277 :         size_t          nthistime;
     932                 :             : 
     933         [ +  - ]:     1592277 :         Assert(!lt->writing);
     934                 :             : 
     935         [ +  + ]:     1592277 :         if (lt->buffer == NULL)
     936                 :        4641 :                 ltsInitReadBuffer(lt);
     937                 :             : 
     938         [ +  + ]:     3180702 :         while (size > 0)
     939                 :             :         {
     940         [ +  + ]:     1592910 :                 if (lt->pos >= lt->nbytes)
     941                 :             :                 {
     942                 :             :                         /* Try to load more data into buffer. */
     943         [ +  + ]:        5470 :                         if (!ltsReadFillBuffer(lt))
     944                 :        4485 :                                 break;                  /* EOF */
     945                 :         985 :                 }
     946                 :             : 
     947                 :     1588425 :                 nthistime = lt->nbytes - lt->pos;
     948         [ +  + ]:     1588425 :                 if (nthistime > size)
     949                 :     1582804 :                         nthistime = size;
     950         [ +  - ]:     1588425 :                 Assert(nthistime > 0);
     951                 :             : 
     952                 :     1588425 :                 memcpy(ptr, lt->buffer + lt->pos, nthistime);
     953                 :             : 
     954                 :     1588425 :                 lt->pos += nthistime;
     955                 :     1588425 :                 ptr = (char *) ptr + nthistime;
     956                 :     1588425 :                 size -= nthistime;
     957                 :     1588425 :                 nread += nthistime;
     958                 :             :         }
     959                 :             : 
     960                 :     3184554 :         return nread;
     961                 :     1592277 : }
     962                 :             : 
     963                 :             : /*
     964                 :             :  * "Freeze" the contents of a tape so that it can be read multiple times
     965                 :             :  * and/or read backwards.  Once a tape is frozen, its contents will not
     966                 :             :  * be released until the LogicalTapeSet is destroyed.  This is expected
     967                 :             :  * to be used only for the final output pass of a merge.
     968                 :             :  *
     969                 :             :  * This *must* be called just at the end of a write pass, before the
     970                 :             :  * tape is rewound (after rewind is too late!).  It performs a rewind
     971                 :             :  * and switch to read mode "for free".  An immediately following rewind-
     972                 :             :  * for-read call is OK but not necessary.
     973                 :             :  *
     974                 :             :  * share output argument is set with details of storage used for tape after
     975                 :             :  * freezing, which may be passed to LogicalTapeSetCreate within leader
     976                 :             :  * process later.  This metadata is only of interest to worker callers
     977                 :             :  * freezing their final output for leader (single materialized tape).
     978                 :             :  * Serial sorts should set share to NULL.
     979                 :             :  */
     980                 :             : void
     981                 :          85 : LogicalTapeFreeze(LogicalTape *lt, TapeShare *share)
     982                 :             : {
     983                 :          85 :         LogicalTapeSet *lts = lt->tapeSet;
     984                 :             : 
     985         [ +  - ]:          85 :         Assert(lt->writing);
     986         [ +  - ]:          85 :         Assert(lt->offsetBlockNumber == 0L);
     987                 :             : 
     988                 :             :         /*
     989                 :             :          * Completion of a write phase.  Flush last partial data block, and rewind
     990                 :             :          * for nondestructive read.
     991                 :             :          */
     992         [ -  + ]:          85 :         if (lt->dirty)
     993                 :             :         {
     994                 :             :                 /*
     995                 :             :                  * As long as we've filled the buffer at least once, its contents are
     996                 :             :                  * entirely defined from valgrind's point of view, even though
     997                 :             :                  * contents beyond the current end point may be stale.  But it's
     998                 :             :                  * possible - at least in the case of a parallel sort - to sort such
     999                 :             :                  * small amount of data that we do not fill the buffer even once. Tell
    1000                 :             :                  * valgrind that its contents are defined, so it doesn't bleat.
    1001                 :             :                  */
    1002                 :          85 :                 VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
    1003                 :             :                                                                   lt->buffer_size - lt->nbytes);
    1004                 :             : 
    1005                 :          85 :                 TapeBlockSetNBytes(lt->buffer, lt->nbytes);
    1006                 :          85 :                 ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, lt->buffer);
    1007                 :          85 :         }
    1008                 :          85 :         lt->writing = false;
    1009                 :          85 :         lt->frozen = true;
    1010                 :             : 
    1011                 :             :         /*
    1012                 :             :          * The seek and backspace functions assume a single block read buffer.
    1013                 :             :          * That's OK with current usage.  A larger buffer is helpful to make the
    1014                 :             :          * read pattern of the backing file look more sequential to the OS, when
    1015                 :             :          * we're reading from multiple tapes.  But at the end of a sort, when a
    1016                 :             :          * tape is frozen, we only read from a single tape anyway.
    1017                 :             :          */
    1018   [ +  -  -  + ]:          85 :         if (!lt->buffer || lt->buffer_size != BLCKSZ)
    1019                 :             :         {
    1020         [ #  # ]:           0 :                 if (lt->buffer)
    1021                 :           0 :                         pfree(lt->buffer);
    1022                 :           0 :                 lt->buffer = palloc(BLCKSZ);
    1023                 :           0 :                 lt->buffer_size = BLCKSZ;
    1024                 :           0 :         }
    1025                 :             : 
    1026                 :             :         /* Read the first block, or reset if tape is empty */
    1027                 :          85 :         lt->curBlockNumber = lt->firstBlockNumber;
    1028                 :          85 :         lt->pos = 0;
    1029                 :          85 :         lt->nbytes = 0;
    1030                 :             : 
    1031         [ +  - ]:          85 :         if (lt->firstBlockNumber == -1L)
    1032                 :           0 :                 lt->nextBlockNumber = -1L;
    1033                 :          85 :         ltsReadBlock(lt->tapeSet, lt->curBlockNumber, lt->buffer);
    1034         [ +  + ]:          85 :         if (TapeBlockIsLast(lt->buffer))
    1035                 :          75 :                 lt->nextBlockNumber = -1L;
    1036                 :             :         else
    1037                 :          10 :                 lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1038         [ +  + ]:          85 :         lt->nbytes = TapeBlockGetNBytes(lt->buffer);
    1039                 :             : 
    1040                 :             :         /* Handle extra steps when caller is to share its tapeset */
    1041         [ +  + ]:          85 :         if (share)
    1042                 :             :         {
    1043                 :          82 :                 BufFileExportFileSet(lts->pfile);
    1044                 :          82 :                 share->firstblocknumber = lt->firstBlockNumber;
    1045                 :          82 :         }
    1046                 :          85 : }
    1047                 :             : 
    1048                 :             : /*
    1049                 :             :  * Backspace the tape a given number of bytes.  (We also support a more
    1050                 :             :  * general seek interface, see below.)
    1051                 :             :  *
    1052                 :             :  * *Only* a frozen-for-read tape can be backed up; we don't support
    1053                 :             :  * random access during write, and an unfrozen read tape may have
    1054                 :             :  * already discarded the desired data!
    1055                 :             :  *
    1056                 :             :  * Returns the number of bytes backed up.  It can be less than the
    1057                 :             :  * requested amount, if there isn't that much data before the current
    1058                 :             :  * position.  The tape is positioned to the beginning of the tape in
    1059                 :             :  * that case.
    1060                 :             :  */
    1061                 :             : size_t
    1062                 :          12 : LogicalTapeBackspace(LogicalTape *lt, size_t size)
    1063                 :             : {
    1064                 :          12 :         size_t          seekpos = 0;
    1065                 :             : 
    1066         [ +  - ]:          12 :         Assert(lt->frozen);
    1067         [ +  - ]:          12 :         Assert(lt->buffer_size == BLCKSZ);
    1068                 :             : 
    1069         [ +  - ]:          12 :         if (lt->buffer == NULL)
    1070                 :           0 :                 ltsInitReadBuffer(lt);
    1071                 :             : 
    1072                 :             :         /*
    1073                 :             :          * Easy case for seek within current block.
    1074                 :             :          */
    1075         [ +  + ]:          12 :         if (size <= (size_t) lt->pos)
    1076                 :             :         {
    1077                 :          11 :                 lt->pos -= (int) size;
    1078                 :          11 :                 return size;
    1079                 :             :         }
    1080                 :             : 
    1081                 :             :         /*
    1082                 :             :          * Not-so-easy case, have to walk back the chain of blocks.  This
    1083                 :             :          * implementation would be pretty inefficient for long seeks, but we
    1084                 :             :          * really aren't doing that (a seek over one tuple is typical).
    1085                 :             :          */
    1086                 :           1 :         seekpos = (size_t) lt->pos; /* part within this block */
    1087         [ +  - ]:           1 :         while (size > seekpos)
    1088                 :             :         {
    1089                 :           1 :                 int64           prev = TapeBlockGetTrailer(lt->buffer)->prev;
    1090                 :             : 
    1091         [ -  + ]:           1 :                 if (prev == -1L)
    1092                 :             :                 {
    1093                 :             :                         /* Tried to back up beyond the beginning of tape. */
    1094         [ +  - ]:           1 :                         if (lt->curBlockNumber != lt->firstBlockNumber)
    1095   [ #  #  #  # ]:           0 :                                 elog(ERROR, "unexpected end of tape");
    1096                 :           1 :                         lt->pos = 0;
    1097                 :           1 :                         return seekpos;
    1098                 :             :                 }
    1099                 :             : 
    1100                 :           0 :                 ltsReadBlock(lt->tapeSet, prev, lt->buffer);
    1101                 :             : 
    1102         [ #  # ]:           0 :                 if (TapeBlockGetTrailer(lt->buffer)->next != lt->curBlockNumber)
    1103   [ #  #  #  # ]:           0 :                         elog(ERROR, "broken tape, next of block %" PRId64 " is %" PRId64 ", expected %" PRId64,
    1104                 :             :                                  prev,
    1105                 :             :                                  TapeBlockGetTrailer(lt->buffer)->next,
    1106                 :             :                                  lt->curBlockNumber);
    1107                 :             : 
    1108                 :           0 :                 lt->nbytes = TapeBlockPayloadSize;
    1109                 :           0 :                 lt->curBlockNumber = prev;
    1110                 :           0 :                 lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1111                 :             : 
    1112                 :           0 :                 seekpos += TapeBlockPayloadSize;
    1113         [ +  - ]:           1 :         }
    1114                 :             : 
    1115                 :             :         /*
    1116                 :             :          * 'seekpos' can now be greater than 'size', because it points to the
    1117                 :             :          * beginning the target block.  The difference is the position within the
    1118                 :             :          * page.
    1119                 :             :          */
    1120                 :           0 :         lt->pos = seekpos - size;
    1121                 :           0 :         return size;
    1122                 :          12 : }
    1123                 :             : 
    1124                 :             : /*
    1125                 :             :  * Seek to an arbitrary position in a logical tape.
    1126                 :             :  *
    1127                 :             :  * *Only* a frozen-for-read tape can be seeked.
    1128                 :             :  *
    1129                 :             :  * Must be called with a block/offset previously returned by
    1130                 :             :  * LogicalTapeTell().
    1131                 :             :  */
    1132                 :             : void
    1133                 :        1032 : LogicalTapeSeek(LogicalTape *lt, int64 blocknum, int offset)
    1134                 :             : {
    1135         [ +  - ]:        1032 :         Assert(lt->frozen);
    1136         [ +  - ]:        1032 :         Assert(offset >= 0 && offset <= TapeBlockPayloadSize);
    1137         [ +  - ]:        1032 :         Assert(lt->buffer_size == BLCKSZ);
    1138                 :             : 
    1139         [ +  - ]:        1032 :         if (lt->buffer == NULL)
    1140                 :           0 :                 ltsInitReadBuffer(lt);
    1141                 :             : 
    1142         [ +  + ]:        1032 :         if (blocknum != lt->curBlockNumber)
    1143                 :             :         {
    1144                 :          13 :                 ltsReadBlock(lt->tapeSet, blocknum, lt->buffer);
    1145                 :          13 :                 lt->curBlockNumber = blocknum;
    1146                 :          13 :                 lt->nbytes = TapeBlockPayloadSize;
    1147                 :          13 :                 lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1148                 :          13 :         }
    1149                 :             : 
    1150         [ +  - ]:        1032 :         if (offset > lt->nbytes)
    1151   [ #  #  #  # ]:           0 :                 elog(ERROR, "invalid tape seek position");
    1152                 :        1032 :         lt->pos = offset;
    1153                 :        1032 : }
    1154                 :             : 
    1155                 :             : /*
    1156                 :             :  * Obtain current position in a form suitable for a later LogicalTapeSeek.
    1157                 :             :  *
    1158                 :             :  * NOTE: it'd be OK to do this during write phase with intention of using
    1159                 :             :  * the position for a seek after freezing.  Not clear if anyone needs that.
    1160                 :             :  */
    1161                 :             : void
    1162                 :        1468 : LogicalTapeTell(LogicalTape *lt, int64 *blocknum, int *offset)
    1163                 :             : {
    1164         [ +  - ]:        1468 :         if (lt->buffer == NULL)
    1165                 :           0 :                 ltsInitReadBuffer(lt);
    1166                 :             : 
    1167         [ +  - ]:        1468 :         Assert(lt->offsetBlockNumber == 0L);
    1168                 :             : 
    1169                 :             :         /* With a larger buffer, 'pos' wouldn't be the same as offset within page */
    1170         [ +  - ]:        1468 :         Assert(lt->buffer_size == BLCKSZ);
    1171                 :             : 
    1172                 :        1468 :         *blocknum = lt->curBlockNumber;
    1173                 :        1468 :         *offset = lt->pos;
    1174                 :        1468 : }
    1175                 :             : 
    1176                 :             : /*
    1177                 :             :  * Obtain total disk space currently used by a LogicalTapeSet, in blocks. Does
    1178                 :             :  * not account for open write buffer, if any.
    1179                 :             :  */
    1180                 :             : int64
    1181                 :        4625 : LogicalTapeSetBlocks(LogicalTapeSet *lts)
    1182                 :             : {
    1183                 :        4625 :         return lts->nBlocksWritten - lts->nHoleBlocks;
    1184                 :             : }
        

Generated by: LCOV version 2.3.2-1