support PP-OCRv6 by leo-q8 · Pull Request #17921 · PaddlePaddle/PaddleOCR

leo-q8 · 2026-04-14T12:14:51Z

No description provided.

paddle-bot · 2026-04-14T12:14:58Z

Thanks for your contribution!

1. Restore set_epoch_as_seed in SimpleDataSet to re-enable adaptive shrink_ratio (0.4→0.6) curriculum learning in MakeBorderMap and MakeShrinkMap across training epochs. 2. Remove default p=1.0 for Affine augmentation, restoring albumentations default p=0.5 (50% rotation probability). These two changes caused V5 dataset precision to drop significantly (e.g. blur 0.904→0.709, printing_ch 0.926→0.708) while V4 remained unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add EMA (Exponential Moving Average) support in training pipeline: build EMA in train.py, apply/restore in program.py, save/load in save_load.py (gated by use_ema config, disabled by default) - Replace build_dataloader with lightweight reset_data_lines to avoid rebuilding DataLoader every epoch when need_reset=True - Store _base_shrink_ratio/_total_epoch in MakeBorderMap/MakeShrinkMap to support in-place shrink_ratio updates across epochs - Add dilated_kernel_size=7 for PP-OCRv6 small det config - Remove redundant balance_loss=true from base/small configs (matches Python default) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rruption Previously, Global.seed (1024) was passed to build_dataloader as the epoch parameter, causing shrink_ratio to be incorrectly computed as 0.81 instead of 0.4 at training start. This separates the two concerns: seed controls data shuffling randomness, epoch controls adaptive shrink_ratio progression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Core EMA implementation with threshold/exponential/normal decay types, bias-correction, ema_filter_no_grad for distillation, and exchange- pattern save/restore compatible with PaddleDetection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Root cause: each epoch rebuilt the DataLoader, destroying and re-forking 16 worker processes. Worker join() triggers COW page cleanup that grows linearly with parent process memory (epoch 1: ~1s, epoch 400: ~466s), totaling ~27 hours over 500 epochs. Solution: - Enable persistent_workers=True in DataLoader for training - Pre-load all label lines once; use _index_map for per-epoch ratio sampling (bit-exact with original get_image_info_list + shuffle) - Signal epoch changes to workers via multiprocessing.Value shared memory - Workers lazily rebuild _index_map on next __getitem__ (no disk I/O) - MakeBorderMap/MakeShrinkMap dynamically compute shrink_ratio from shared epoch value (_get_shrink_ratio), replacing per-epoch op rebuild - Reset reader_start after save_model to fix avg_reader_cost accounting Verified against 53b006f baseline with real config (34 label files, ratio 0.01~0.8): data sampling bit-exact, shrink_ratio numerically identical, ext_data pool equivalent, DistributedBatchSampler unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace _shared_epoch injection into MakeBorderMap/MakeShrinkMap with a simpler approach: dataset.__getitem__ sets data["epoch"], ops read it in __call__/draw_border_map. This keeps ops as pure data transforms with no cross-process state, and removes _setup_shared_epoch_in_ops, _get_shrink_ratio, and _update_epoch_in_ops (-44 lines, +7 lines). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

set_epoch_as_seed conflated two unrelated concerns: data sampling seed and shrink_ratio epoch tracking. Now that epoch flows through data dict, ops only need total_epoch from config to store _base_shrink_ratio — no need to inject epoch at init time. Remove set_epoch_as_seed and _update_epoch_in_ops entirely (-25 lines). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

draw_border_map does not have access to the data dict, so epoch-based shrink_ratio must be computed in __call__ and passed as a parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When pyclipper.Execute() returns [] for degenerate polygons (very small area or floating-point precision issues), skip the polygon instead of crashing the entire sample with IndexError: list index out of range. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pyclipper internally converts float coords to integers and raises ValueError when encountering NaN. NaN can appear after aggressive geometric augmentations (affine, crop) on degenerate polygons. - make_border_map: return early from draw_border_map if polygon has NaN - make_shrink_map: mark polygon as ignored and continue if NaN detected Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ppocr/data/simple_dataset.py - Accept http/https URLs as image paths in label files; local paths unchanged (os.path.join / os.path.exists behaviour identical to before) - _load_image_bytes(): cache hit → in-flight future → synchronous download - _img_path_exists(): returns True for URLs (server reachability checked lazily at download time, not up-front) - Per-worker LRU cache (200 entries, ~54 MB/worker) backed by a 4-thread ThreadPoolExecutor; cache and pool are fork-local, no cross-process lock - _prefetch_epoch_urls(): called inside each worker on epoch-0 first access and on every epoch change (via _ensure_index_map); scans _index_map and submits the first 200 URL entries for background download so that most URL items are warm by the time __getitem__ needs them - _download_url_bytes(): always removes the futures-dict entry on both success and failure, so a transiently-failing URL is retried next access - get_ext_data(): wraps _load_image_bytes in try/except so a URL error skips that ext sample without aborting the main item - Missing local file behaviour is identical to before (raise → except → log error → random retry in train, sequential retry in eval) configs/det/PP-OCRv6/{base,small,tiny}_det.yml - IaaAugment Affine: add p=0.5, widen rotate range to ±25, cap resize to 2× - Regularizer factor: normalise scientific notation - base: DetResizeForTest limit_side_len=4000 / limit_type=max tools/infer_det.py - infer_img can now be a .txt file listing one image path per line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

URLs containing CJK filenames (e.g. ancient_img dataset paths with Chinese characters) caused urllib to raise UnicodeEncodeError because http.client requires ASCII-only URLs. Add _encode_url() which uses urllib.parse.urlparse + quote to percent-encode only the path component, leaving scheme and netloc intact. _download_url_bytes() now encodes the URL before passing it to urlopen; the original URL is still used as the cache key so duplicate downloads are correctly deduplicated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mCropData Non-quad polygons (e.g. from pyclipper) don't have exactly 4 vertices, so get_min_quad_side would produce incorrect results. Guard the check with a len==4 condition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Some label files (e.g. ancient.txt, handwrite_en.txt) contain [null, null] coordinate points which become NaN after np.array(..., dtype=np.float32). Instead of letting NaN propagate through the pipeline and crash in downstream transforms (CopyPaste, MakeShrinkMap, etc.), detect them at the source and mark as ignore so they are properly masked in loss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ention Lower the minimum area threshold from 120 to 80 and char height ratio from 0.5 to 0.35 to retain more partially-clipped text polygons during random cropping, reducing unnecessary label loss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

support PP-OCRv6

cbf6837

leo-q8 force-pushed the v6_main branch from 4cbad2c to cbf6837 Compare April 14, 2026 12:20

leo-q8 force-pushed the v6_main branch from 65425ef to 8110b2e Compare April 18, 2026 03:04

leo-q8 and others added 12 commits April 18, 2026 07:55

fix: compute shrink_ratio in __call__ and pass to draw_border_map

e2554b5

draw_border_map does not have access to the data dict, so epoch-based shrink_ratio must be computed in __call__ and passed as a parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

leo-q8 force-pushed the v6_main branch from 1c36b54 to 12efbc1 Compare May 8, 2026 08:29

leo-q8 and others added 2 commits May 19, 2026 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support PP-OCRv6#17921

support PP-OCRv6#17921
leo-q8 wants to merge 16 commits into
PaddlePaddle:mainfrom
leo-q8:v6_main

leo-q8 commented Apr 14, 2026

Uh oh!

paddle-bot Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leo-q8 commented Apr 14, 2026

Uh oh!

paddle-bot Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant