Nevar pievienot vairāk kā 25 tēmas Tēmai ir jāsākas ar burtu vai ciparu, tā var saturēt domu zīmes ('-') un var būt līdz 35 simboliem gara.

splitting-strategy.md 3.3KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
  1. # Semantic Splitting Strategy
  2. When the source content is large (exceeds ~15,000 tokens) or a token_budget requires it, split the distillate into semantically coherent sections rather than arbitrary size breaks.
  3. ## Why Semantic Over Size-Based
  4. Arbitrary splits (every N tokens) break coherence. A downstream workflow loading "part 2 of 4" gets context fragments. Semantic splits produce self-contained topic clusters that a workflow can load selectively — "give me just the technical decisions section" — which is more useful and more token-efficient for the consumer.
  5. ## Splitting Process
  6. ### 1. Identify Natural Boundaries
  7. After the initial extraction and deduplication (Steps 1-2 of the compression process), look for natural semantic boundaries:
  8. - Distinct problem domains or functional areas
  9. - Different stakeholder perspectives (users, technical, business)
  10. - Temporal boundaries (current state vs future vision)
  11. - Scope boundaries (in-scope vs out-of-scope vs deferred)
  12. - Phase boundaries (analysis, design, implementation)
  13. Choose boundaries that produce sections a downstream workflow might load independently.
  14. ### 2. Assign Items to Sections
  15. For each extracted item, assign it to the most relevant section. Items that span multiple sections go in the root distillate.
  16. Cross-cutting items (items relevant to multiple sections):
  17. - Constraints that affect all areas → root distillate
  18. - Decisions with broad impact → root distillate
  19. - Section-specific decisions → section distillate
  20. ### 3. Produce Root Distillate
  21. The root distillate contains:
  22. - **Orientation** (3-5 bullets): what was distilled, from what sources, for what consumer, how many sections
  23. - **Cross-references**: list of section distillates with 1-line descriptions
  24. - **Cross-cutting items**: facts, decisions, and constraints that span multiple sections
  25. - **Scope summary**: high-level in/out/deferred if applicable
  26. ### 4. Produce Section Distillates
  27. Each section distillate must be self-sufficient — a reader loading only one section should understand it without the others.
  28. Each section includes:
  29. - **Context header** (1 line): "This section covers [topic]. Part N of M from [source document names]."
  30. - **Section content**: thematically-grouped bullets following the same compression rules as a single distillate
  31. - **Cross-references** (if needed): pointers to other sections for related content
  32. ### 5. Output Structure
  33. Create a folder `{base-name}-distillate/` containing:
  34. ```
  35. {base-name}-distillate/
  36. ├── _index.md # Root distillate: orientation, cross-cutting items, section manifest
  37. ├── 01-{topic-slug}.md # Self-contained section
  38. ├── 02-{topic-slug}.md
  39. └── 03-{topic-slug}.md
  40. ```
  41. Example:
  42. ```
  43. product-brief-distillate/
  44. ├── _index.md
  45. ├── 01-problem-solution.md
  46. ├── 02-technical-decisions.md
  47. └── 03-users-market.md
  48. ```
  49. ## Size Targets
  50. When a token_budget is specified:
  51. - Root distillate: ~20% of budget (orientation + cross-cutting items)
  52. - Remaining budget split proportionally across sections based on content density
  53. - If a section exceeds its proportional share, compress more aggressively or sub-split
  54. When no token_budget but splitting is needed:
  55. - Aim for sections of 3,000-5,000 tokens each
  56. - Root distillate as small as possible while remaining useful standalone