busybox/docs/keep_data_small.txt
<<
>>
Prefs
   1                Keeping data small
   2
   3When many applets are compiled into busybox, all rw data and
   4bss for each applet are concatenated. Including those from libc,
   5if static busybox is built. When busybox is started, _all_ this data
   6is allocated, not just that one part for selected applet.
   7
   8What "allocated" exactly means, depends on arch.
   9On NOMMU it's probably bites the most, actually using real
  10RAM for rwdata and bss. On i386, bss is lazily allocated
  11by COWed zero pages. Not sure about rwdata - also COW?
  12
  13In order to keep busybox NOMMU and small-mem systems friendly
  14we should avoid large global data in our applets, and should
  15minimize usage of libc functions which implicitly use
  16such structures.
  17
  18Small experiment to measure "parasitic" bbox memory consumption:
  19here we start 1000 "busybox sleep 10" in parallel.
  20busybox binary is practically allyesconfig static one,
  21built against uclibc. Run on x86-64 machine with 64-bit kernel:
  22
  23bash-3.2# nmeter '%t %c %m %p %[pn]'
  2423:17:28 .......... 168M    0  147
  2523:17:29 .......... 168M    0  147
  2623:17:30 U......... 168M    1  147
  2723:17:31 SU........ 181M  244  391
  2823:17:32 SSSSUUU... 223M  757 1147
  2923:17:33 UUU....... 223M    0 1147
  3023:17:34 U......... 223M    1 1147
  3123:17:35 .......... 223M    0 1147
  3223:17:36 .......... 223M    0 1147
  3323:17:37 S......... 223M    0 1147
  3423:17:38 .......... 223M    1 1147
  3523:17:39 .......... 223M    0 1147
  3623:17:40 .......... 223M    0 1147
  3723:17:41 .......... 210M    0  906
  3823:17:42 .......... 168M    1  147
  3923:17:43 .......... 168M    0  147
  40
  41This requires 55M of memory. Thus 1 trivial busybox applet
  42takes 55k of memory on 64-bit x86 kernel.
  43
  44On 32-bit kernel we need ~26k per applet.
  45
  46Script:
  47
  48i=1000; while test $i != 0; do
  49        echo -n .
  50        busybox sleep 30 &
  51        i=$((i - 1))
  52done
  53echo
  54wait
  55
  56(Data from NOMMU arches are sought. Provide 'size busybox' output too)
  57
  58
  59                Example 1
  60
  61One example how to reduce global data usage is in
  62archival/libarchive/decompress_gunzip.c:
  63
  64/* This is somewhat complex-looking arrangement, but it allows
  65 * to place decompressor state either in bss or in
  66 * malloc'ed space simply by changing #defines below.
  67 * Sizes on i386:
  68 * text    data     bss     dec     hex
  69 * 5256       0     108    5364    14f4 - bss
  70 * 4915       0       0    4915    1333 - malloc
  71 */
  72#define STATE_IN_BSS 0
  73#define STATE_IN_MALLOC 1
  74
  75(see the rest of the file to get the idea)
  76
  77This example completely eliminates globals in that module.
  78Required memory is allocated in unpack_gz_stream() [its main module]
  79and then passed down to all subroutines which need to access 'globals'
  80as a parameter.
  81
  82
  83                Example 2
  84
  85In case you don't want to pass this additional parameter everywhere,
  86take a look at archival/gzip.c. Here all global data is replaced by
  87single global pointer (ptr_to_globals) to allocated storage.
  88
  89In order to not duplicate ptr_to_globals in every applet, you can
  90reuse single common one. It is defined in libbb/ptr_to_globals.c
  91as struct globals *const ptr_to_globals, but the struct globals is
  92NOT defined in libbb.h. You first define your own struct:
  93
  94struct globals { int a; char buf[1000]; };
  95
  96and then declare that ptr_to_globals is a pointer to it:
  97
  98#define G (*ptr_to_globals)
  99
 100ptr_to_globals is declared as constant pointer.
 101This helps gcc understand that it won't change, resulting in noticeably
 102smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro:
 103
 104        SET_PTR_TO_GLOBALS(xzalloc(sizeof(G)));
 105
 106Typically it is done in <applet>_main(). Another variation is
 107to use stack:
 108
 109int <applet>_main(...)
 110{
 111#undef G
 112        struct globals G;
 113        memset(&G, 0, sizeof(G));
 114        SET_PTR_TO_GLOBALS(&G);
 115
 116Now you can reference "globals" by G.a, G.buf and so on, in any function.
 117
 118
 119                bb_common_bufsiz1
 120
 121There is one big common buffer in bss - bb_common_bufsiz1. It is a much
 122earlier mechanism to reduce bss usage. Each applet can use it for
 123its needs. Library functions are prohibited from using it.
 124
 125'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer:
 126
 127#define G (*(struct globals*)&bb_common_bufsiz1)
 128
 129Be careful, though, and use it only if globals fit into bb_common_bufsiz1.
 130Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change
 131from one libc to another, you have to add compile-time check for it:
 132
 133if (sizeof(struct globals) > sizeof(bb_common_bufsiz1))
 134        BUG_<applet>_globals_too_big();
 135
 136
 137                Drawbacks
 138
 139You have to initialize it by hand. xzalloc() can be helpful in clearing
 140allocated storage to 0, but anything more must be done by hand.
 141
 142All global variables are prefixed by 'G.' now. If this makes code
 143less readable, use #defines:
 144
 145#define dev_fd (G.dev_fd)
 146#define sector (G.sector)
 147
 148
 149                Finding non-shared duplicated strings
 150
 151strings busybox | sort | uniq -c | sort -nr
 152
 153
 154                gcc's data alignment problem
 155
 156The following attribute added in vi.c:
 157
 158static int tabstop;
 159static struct termios term_orig __attribute__ ((aligned (4)));
 160static struct termios term_vi __attribute__ ((aligned (4)));
 161
 162reduces bss size by 32 bytes, because gcc sometimes aligns structures to
 163ridiculously large values. asm output diff for above example:
 164
 165 tabstop:
 166        .zero   4
 167        .section        .bss.term_orig,"aw",@nobits
 168-       .align 32
 169+       .align 4
 170        .type   term_orig, @object
 171        .size   term_orig, 60
 172 term_orig:
 173        .zero   60
 174        .section        .bss.term_vi,"aw",@nobits
 175-       .align 32
 176+       .align 4
 177        .type   term_vi, @object
 178        .size   term_vi, 60
 179
 180gcc doesn't seem to have options for altering this behaviour.
 181
 182gcc 3.4.3 and 4.1.1 tested:
 183char c = 1;
 184// gcc aligns to 32 bytes if sizeof(struct) >= 32
 185struct {
 186    int a,b,c,d;
 187    int i1,i2,i3;
 188} s28 = { 1 };    // struct will be aligned to 4 bytes
 189struct {
 190    int a,b,c,d;
 191    int i1,i2,i3,i4;
 192} s32 = { 1 };    // struct will be aligned to 32 bytes
 193// same for arrays
 194char vc31[31] = { 1 }; // unaligned
 195char vc32[32] = { 1 }; // aligned to 32 bytes
 196
 197-fpack-struct=1 reduces alignment of s28 to 1 (but probably
 198will break layout of many libc structs) but s32 and vc32
 199are still aligned to 32 bytes.
 200
 201I will try to cook up a patch to add a gcc option for disabling it.
 202Meanwhile, this is where it can be disabled in gcc source:
 203
 204gcc/config/i386/i386.c
 205int
 206ix86_data_alignment (tree type, int align)
 207{
 208#if 0
 209  if (AGGREGATE_TYPE_P (type)
 210       && TYPE_SIZE (type)
 211       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
 212       && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256
 213           || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256)
 214    return 256;
 215#endif
 216
 217Result (non-static busybox built against glibc):
 218
 219# size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox
 220   text    data     bss     dec     hex filename
 221 634416    2736   23856  661008   a1610 busybox
 222 632580    2672   22944  658196   a0b14 busybox_noalign
 223
 224
 225
 226                Keeping code small
 227
 228Use scripts/bloat-o-meter to check whether introduced changes
 229didn't generate unnecessary bloat. This script needs unstripped binaries
 230to generate a detailed report. To automate this, just use
 231"make bloatcheck". It requires busybox_old binary to be present,
 232use "make baseline" to generate it from unmodified source, or
 233copy busybox_unstripped to busybox_old before modifying sources
 234and rebuilding.
 235
 236Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once",
 237produce "make bloatcheck", see the biggest auto-inlined functions.
 238Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE
 239to some of these functions. In 1.16.x timeframe, the results were
 240(annotated "make bloatcheck" output):
 241
 242function             old     new   delta
 243expand_vars_to_list    -    1712   +1712 win
 244lzo1x_optimize         -    1429   +1429 win
 245arith_apply            -    1326   +1326 win
 246read_interfaces        -    1163   +1163 loss, leave w/o NOINLINE
 247logdir_open            -    1148   +1148 win
 248check_deps             -    1148   +1148 loss
 249rewrite                -    1039   +1039 win
 250run_pipe             358    1396   +1038 win
 251write_status_file      -    1029   +1029 almost the same, leave w/o NOINLINE
 252dump_identity          -     987    +987 win
 253mainQSort3             -     921    +921 win
 254parse_one_line         -     916    +916 loss
 255summarize              -     897    +897 almost the same
 256do_shm                 -     884    +884 win
 257cpio_o                 -     863    +863 win
 258subCommand             -     841    +841 loss
 259receive                -     834    +834 loss
 260
 261855 bytes saved in total.
 262
 263scripts/mkdiff_obj_bloat may be useful to automate this process: run
 264"scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE"
 265and select modules which shrank.
 266