Skip to content

support PP-OCRv6#17921

Open
leo-q8 wants to merge 16 commits into
PaddlePaddle:mainfrom
leo-q8:v6_main
Open

support PP-OCRv6#17921
leo-q8 wants to merge 16 commits into
PaddlePaddle:mainfrom
leo-q8:v6_main

Conversation

@leo-q8
Copy link
Copy Markdown
Collaborator

@leo-q8 leo-q8 commented Apr 14, 2026

No description provided.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 14, 2026

Thanks for your contribution!

1. Restore set_epoch_as_seed in SimpleDataSet to re-enable adaptive
   shrink_ratio (0.4→0.6) curriculum learning in MakeBorderMap and
   MakeShrinkMap across training epochs.
2. Remove default p=1.0 for Affine augmentation, restoring
   albumentations default p=0.5 (50% rotation probability).

These two changes caused V5 dataset precision to drop significantly
(e.g. blur 0.904→0.709, printing_ch 0.926→0.708) while V4 remained
unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leo-q8 and others added 12 commits April 18, 2026 07:55
- Add EMA (Exponential Moving Average) support in training pipeline:
  build EMA in train.py, apply/restore in program.py, save/load in
  save_load.py (gated by use_ema config, disabled by default)
- Replace build_dataloader with lightweight reset_data_lines to avoid
  rebuilding DataLoader every epoch when need_reset=True
- Store _base_shrink_ratio/_total_epoch in MakeBorderMap/MakeShrinkMap
  to support in-place shrink_ratio updates across epochs
- Add dilated_kernel_size=7 for PP-OCRv6 small det config
- Remove redundant balance_loss=true from base/small configs (matches
  Python default)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rruption

Previously, Global.seed (1024) was passed to build_dataloader as the epoch
parameter, causing shrink_ratio to be incorrectly computed as 0.81 instead
of 0.4 at training start. This separates the two concerns: seed controls
data shuffling randomness, epoch controls adaptive shrink_ratio progression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core EMA implementation with threshold/exponential/normal decay types,
bias-correction, ema_filter_no_grad for distillation, and exchange-
pattern save/restore compatible with PaddleDetection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: each epoch rebuilt the DataLoader, destroying and re-forking
16 worker processes. Worker join() triggers COW page cleanup that grows
linearly with parent process memory (epoch 1: ~1s, epoch 400: ~466s),
totaling ~27 hours over 500 epochs.

Solution:
- Enable persistent_workers=True in DataLoader for training
- Pre-load all label lines once; use _index_map for per-epoch ratio
  sampling (bit-exact with original get_image_info_list + shuffle)
- Signal epoch changes to workers via multiprocessing.Value shared memory
- Workers lazily rebuild _index_map on next __getitem__ (no disk I/O)
- MakeBorderMap/MakeShrinkMap dynamically compute shrink_ratio from
  shared epoch value (_get_shrink_ratio), replacing per-epoch op rebuild
- Reset reader_start after save_model to fix avg_reader_cost accounting

Verified against 53b006f baseline with real config (34 label files,
ratio 0.01~0.8): data sampling bit-exact, shrink_ratio numerically
identical, ext_data pool equivalent, DistributedBatchSampler unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _shared_epoch injection into MakeBorderMap/MakeShrinkMap with
a simpler approach: dataset.__getitem__ sets data["epoch"], ops read
it in __call__/draw_border_map. This keeps ops as pure data transforms
with no cross-process state, and removes _setup_shared_epoch_in_ops,
_get_shrink_ratio, and _update_epoch_in_ops (-44 lines, +7 lines).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
set_epoch_as_seed conflated two unrelated concerns: data sampling seed
and shrink_ratio epoch tracking. Now that epoch flows through data dict,
ops only need total_epoch from config to store _base_shrink_ratio —
no need to inject epoch at init time. Remove set_epoch_as_seed and
_update_epoch_in_ops entirely (-25 lines).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
draw_border_map does not have access to the data dict, so epoch-based
shrink_ratio must be computed in __call__ and passed as a parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When pyclipper.Execute() returns [] for degenerate polygons (very small
area or floating-point precision issues), skip the polygon instead of
crashing the entire sample with IndexError: list index out of range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pyclipper internally converts float coords to integers and raises
ValueError when encountering NaN. NaN can appear after aggressive
geometric augmentations (affine, crop) on degenerate polygons.

- make_border_map: return early from draw_border_map if polygon has NaN
- make_shrink_map: mark polygon as ignored and continue if NaN detected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ppocr/data/simple_dataset.py
- Accept http/https URLs as image paths in label files; local paths
  unchanged (os.path.join / os.path.exists behaviour identical to before)
- _load_image_bytes(): cache hit → in-flight future → synchronous download
- _img_path_exists(): returns True for URLs (server reachability checked
  lazily at download time, not up-front)
- Per-worker LRU cache (200 entries, ~54 MB/worker) backed by a 4-thread
  ThreadPoolExecutor; cache and pool are fork-local, no cross-process lock
- _prefetch_epoch_urls(): called inside each worker on epoch-0 first access
  and on every epoch change (via _ensure_index_map); scans _index_map and
  submits the first 200 URL entries for background download so that most
  URL items are warm by the time __getitem__ needs them
- _download_url_bytes(): always removes the futures-dict entry on both
  success and failure, so a transiently-failing URL is retried next access
- get_ext_data(): wraps _load_image_bytes in try/except so a URL error
  skips that ext sample without aborting the main item
- Missing local file behaviour is identical to before (raise → except →
  log error → random retry in train, sequential retry in eval)

configs/det/PP-OCRv6/{base,small,tiny}_det.yml
- IaaAugment Affine: add p=0.5, widen rotate range to ±25, cap resize to 2×
- Regularizer factor: normalise scientific notation
- base: DetResizeForTest limit_side_len=4000 / limit_type=max

tools/infer_det.py
- infer_img can now be a .txt file listing one image path per line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
URLs containing CJK filenames (e.g. ancient_img dataset paths with
Chinese characters) caused urllib to raise UnicodeEncodeError because
http.client requires ASCII-only URLs.

Add _encode_url() which uses urllib.parse.urlparse + quote to
percent-encode only the path component, leaving scheme and netloc
intact.  _download_url_bytes() now encodes the URL before passing
it to urlopen; the original URL is still used as the cache key so
duplicate downloads are correctly deduplicated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mCropData

Non-quad polygons (e.g. from pyclipper) don't have exactly 4 vertices,
so get_min_quad_side would produce incorrect results. Guard the check
with a len==4 condition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leo-q8 and others added 2 commits May 19, 2026 09:36
Some label files (e.g. ancient.txt, handwrite_en.txt) contain [null, null]
coordinate points which become NaN after np.array(..., dtype=np.float32).
Instead of letting NaN propagate through the pipeline and crash in
downstream transforms (CopyPaste, MakeShrinkMap, etc.), detect them at
the source and mark as ignore so they are properly masked in loss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ention

Lower the minimum area threshold from 120 to 80 and char height ratio
from 0.5 to 0.35 to retain more partially-clipped text polygons during
random cropping, reducing unnecessary label loss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant